#
4bfc0bb2 |
| 25-May-2019 |
Roman Gushchin <guro@fb.com> |
bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound to the lifetime of the cgroup itself. It means that if a user forgets
bpf: decouple the lifetime of cgroup_bpf from cgroup itself
Currently the lifetime of bpf programs attached to a cgroup is bound to the lifetime of the cgroup itself. It means that if a user forgets (or intentionally avoids) to detach a bpf program before removing the cgroup, it will stay attached up to the release of the cgroup. Since the cgroup can stay in the dying state (the state between being rmdir()'ed and being released) for a very long time, it leads to a waste of memory. Also, it blocks a possibility to implement the memcg-based memory accounting for bpf objects, because a circular reference dependency will occur. Charged memory pages are pinning the corresponding memory cgroup, and if the memory cgroup is pinning the attached bpf program, nothing will be ever released.
A dying cgroup can not contain any processes, so the only chance for an attached bpf program to be executed is a live socket associated with the cgroup. So in order to release all bpf data early, let's count associated sockets using a new percpu refcounter. On cgroup removal the counter is transitioned to the atomic mode, and as soon as it reaches 0, all bpf programs are detached.
Because cgroup_bpf_release() can block, it can't be called from the percpu ref counter callback directly, so instead an asynchronous work is scheduled.
The reference counter is not socket specific, and can be used for any other types of programs, which can be executed from a cgroup-bpf hook outside of the process context, had such a need arise in the future.
Signed-off-by: Roman Gushchin <guro@fb.com> Cc: jolsa@redhat.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
02a8c817 |
| 14-Apr-2019 |
Alban Crequy <alban@kinvolk.io> |
bpf: add map helper functions push, pop, peek in more BPF programs
commit f1a2e44a3aec ("bpf: add queue and stack maps") introduced new BPF helper functions: - BPF_FUNC_map_push_elem - BPF_FUNC_map_
bpf: add map helper functions push, pop, peek in more BPF programs
commit f1a2e44a3aec ("bpf: add queue and stack maps") introduced new BPF helper functions: - BPF_FUNC_map_push_elem - BPF_FUNC_map_pop_elem - BPF_FUNC_map_peek_elem
but they were made available only for network BPF programs. This patch makes them available for tracepoint, cgroup and lirc programs.
Signed-off-by: Alban Crequy <alban@kinvolk.io> Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
51356ac8 |
| 12-Apr-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Fix distinct pointer types warning for ARCH=i386
Fix a new warning reported by kbuild for make ARCH=i386:
In file included from kernel/bpf/cgroup.c:11:0: kernel/bpf/cgroup.c: In function
bpf: Fix distinct pointer types warning for ARCH=i386
Fix a new warning reported by kbuild for make ARCH=i386:
In file included from kernel/bpf/cgroup.c:11:0: kernel/bpf/cgroup.c: In function '__cgroup_bpf_run_filter_sysctl': include/linux/kernel.h:827:29: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^ include/linux/kernel.h:841:4: note: in expansion of macro '__typecheck' (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ include/linux/kernel.h:851:24: note: in expansion of macro '__safe_cmp' __builtin_choose_expr(__safe_cmp(x, y), \ ^~~~~~~~~~ include/linux/kernel.h:860:19: note: in expansion of macro '__careful_cmp' #define min(x, y) __careful_cmp(x, y, <) ^~~~~~~~~~~~~ >> kernel/bpf/cgroup.c:837:17: note: in expansion of macro 'min' ctx.new_len = min(PAGE_SIZE, *pcount); ^~~
Fixes: 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers") Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
d7a4cb9b |
| 18-Mar-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Introduce bpf_strtol and bpf_strtoul helpers
Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned long correspondingly. It's similar to user space strtol(3) and strtoul(3) wi
bpf: Introduce bpf_strtol and bpf_strtoul helpers
Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned long correspondingly. It's similar to user space strtol(3) and strtoul(3) with a few changes to the API:
* instead of NUL-terminated C string the helpers expect buffer and buffer length;
* resulting long or unsigned long is returned in a separate result-argument;
* return value is used to indicate success or failure, on success number of consumed bytes is returned that can be used to identify position to read next if the buffer is expected to contain multiple integers;
* instead of *base* argument, *flags* is used that provides base in 5 LSB, other bits are reserved for future use;
* number of supported bases is limited.
Documentation for the new helpers is provided in bpf.h UAPI.
The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to be able to convert string input to e.g. "ulongvec" output.
E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be parsed by calling to bpf_strtoul three times.
Implementation notes:
Implementation includes "../../lib/kstrtox.h" to reuse integer parsing functions. It's done exactly same way as fs/proc/base.c already does.
Unfortunately existing kstrtoX function can't be used directly since they fail if any invalid character is present right after integer in the string. Existing simple_strtoX functions can't be used either since they're obsolete and don't handle overflow properly.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
e1550bfe |
| 07-Mar-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Add file_pos field to bpf_sysctl ctx
Add file_pos field to bpf_sysctl context to read and write sysctl file position at which sysctl is being accessed (read or written).
The field can be used
bpf: Add file_pos field to bpf_sysctl ctx
Add file_pos field to bpf_sysctl context to read and write sysctl file position at which sysctl is being accessed (read or written).
The field can be used to e.g. override whole sysctl value on write to sysctl even when sys_write is called by user space with file_pos > 0. Or BPF program may reject such accesses.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
4e63acdf |
| 07-Mar-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Introduce bpf_sysctl_{get,set}_new_value helpers
Add helpers to work with new value being written to sysctl by user space.
bpf_sysctl_get_new_value() copies value being written to sysctl into
bpf: Introduce bpf_sysctl_{get,set}_new_value helpers
Add helpers to work with new value being written to sysctl by user space.
bpf_sysctl_get_new_value() copies value being written to sysctl into provided buffer.
bpf_sysctl_set_new_value() overrides new value being written by user space with a one from provided buffer. Buffer should contain string representation of the value, similar to what can be seen in /proc/sys/.
Both helpers can be used only on sysctl write.
File position matters and can be managed by an interface that will be introduced separately. E.g. if user space calls sys_write to a file in /proc/sys/ at file position = X, where X > 0, then the value set by bpf_sysctl_set_new_value() will be written starting from X. If program wants to override whole value with specified buffer, file position has to be set to zero.
Documentation for the new helpers is provided in bpf.h UAPI.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
1d11b301 |
| 28-Feb-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Introduce bpf_sysctl_get_current_value helper
Add bpf_sysctl_get_current_value() helper to copy current sysctl value into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
It provides sa
bpf: Introduce bpf_sysctl_get_current_value helper
Add bpf_sysctl_get_current_value() helper to copy current sysctl value into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
It provides same string as user space can see by reading corresponding file in /proc/sys/, including new line, etc.
Documentation for the new helper is provided in bpf.h UAPI.
Since current value is kept in ctl_table->data in a parsed form, ctl_table->proc_handler() with write=0 is called to read that data and convert it to a string. Such a string can later be parsed by a program using helpers that will be introduced separately.
Unfortunately it's not trivial to provide API to access parsed data due to variety of data representations (string, intvec, uintvec, ulongvec, custom structures, even NULL, etc). Instead it's assumed that user know how to handle specific sysctl they're interested in and appropriate helpers can be used.
Since ctl_table->proc_handler() expects __user buffer, conversion to __user happens for kernel allocated one where the value is stored.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
808649fb |
| 27-Feb-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Introduce bpf_sysctl_get_name helper
Add bpf_sysctl_get_name() helper to copy sysctl name (/proc/sys/ entry) into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
By default full name (
bpf: Introduce bpf_sysctl_get_name helper
Add bpf_sysctl_get_name() helper to copy sysctl name (/proc/sys/ entry) into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
By default full name (w/o /proc/sys/) is copied, e.g. "net/ipv4/tcp_mem".
If BPF_F_SYSCTL_BASE_NAME flag is set, only base name will be copied, e.g. "tcp_mem".
Documentation for the new helper is provided in bpf.h UAPI.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
7b146ceb |
| 27-Feb-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Sysctl hook
Containerized applications may run as root and it may create problems for whole host. Specifically such applications may change a sysctl and affect applications in other containers.
bpf: Sysctl hook
Containerized applications may run as root and it may create problems for whole host. Specifically such applications may change a sysctl and affect applications in other containers.
Furthermore in existing infrastructure it may not be possible to just completely disable writing to sysctl, instead such a process should be gradual with ability to log what sysctl are being changed by a container, investigate, limit the set of writable sysctl to currently used ones (so that new ones can not be changed) and eventually reduce this set to zero.
The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
New program type has access to following minimal context: struct bpf_sysctl { __u32 write; };
Where @write indicates whether sysctl is being read (= 0) or written (= 1).
Helpers to access sysctl name and value will be introduced separately.
BPF_CGROUP_SYSCTL attach point is added to sysctl code right before passing control to ctl_table->proc_handler so that BPF program can either allow or deny access to sysctl.
Suggested-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
b1cd609d |
| 12-Mar-2019 |
Andrey Ignatov <rdna@fb.com> |
bpf: Add base proto function for cgroup-bpf programs
Currently kernel/bpf/cgroup.c contains only one program type and one proto function cgroup_dev_func_proto(). It'd be useful to have base proto fu
bpf: Add base proto function for cgroup-bpf programs
Currently kernel/bpf/cgroup.c contains only one program type and one proto function cgroup_dev_func_proto(). It'd be useful to have base proto function that can be reused for new cgroup-bpf program types coming soon.
Introduce cgroup_base_func_proto().
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
6cab5e90 |
| 28-Jan-2019 |
Alexei Starovoitov <ast@kernel.org> |
bpf: run bpf programs with preemption disabled
Disabled preemption is necessary for proper access to per-cpu maps from BPF programs.
But the sender side of socket filters didn't have preemption dis
bpf: run bpf programs with preemption disabled
Disabled preemption is necessary for proper access to per-cpu maps from BPF programs.
But the sender side of socket filters didn't have preemption disabled: unix_dgram_sendmsg->sk_filter->sk_filter_trim_cap->bpf_prog_run_save_cb->BPF_PROG_RUN
and a combination of af_packet with tun device didn't disable either: tpacket_snd->packet_direct_xmit->packet_pick_tx_queue->ndo_select_queue-> tun_select_queue->tun_ebpf_select_queue->bpf_prog_run_clear_cb->BPF_PROG_RUN
Disable preemption before executing BPF programs (both classic and extended).
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
1832f4ef |
| 29-Jan-2019 |
Valdis Kletnieks <valdis.kletnieks@vt.edu> |
bpf, cgroups: clean up kerneldoc warnings
Building with W=1 reveals some bitrot:
CC kernel/bpf/cgroup.o kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described i
bpf, cgroups: clean up kerneldoc warnings
Building with W=1 reveals some bitrot:
CC kernel/bpf/cgroup.o kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach' kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach'
Add a kerneldoc line for 'flags'.
Fixing the warning for 'unused_flags' is best approached by removing the unused parameter on the function call.
Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
c8dc7980 |
| 16-Jan-2019 |
Mathieu Malaterre <malat@debian.org> |
bpf: Annotate implicit fall through in cgroup_dev_func_proto
There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warnings (W=1).
This commit remove
bpf: Annotate implicit fall through in cgroup_dev_func_proto
There is a plan to build the kernel with -Wimplicit-fallthrough and this place in the code produced a warnings (W=1).
This commit removes the following warning:
kernel/bpf/cgroup.c:719:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
b39b5f41 |
| 19-Oct-2018 |
Song Liu <songliubraving@fb.com> |
bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the skb. This patch enables direct access of skb for these programs.
bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB
BPF programs of BPF_PROG_TYPE_CGROUP_SKB need to access headers in the skb. This patch enables direct access of skb for these programs.
Two helper functions bpf_compute_and_save_data_end() and bpf_restore_data_end() are introduced. There are used in __cgroup_bpf_run_filter_skb(), to compute proper data_end for the BPF program, and restore original data afterwards.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
8bad74f9 |
| 28-Sep-2018 |
Roman Gushchin <guro@fb.com> |
bpf: extend cgroup bpf core to allow multiple cgroup storage types
In order to introduce per-cpu cgroup storage, let's generalize bpf cgroup core to support multiple cgroup storage types. Potentiall
bpf: extend cgroup bpf core to allow multiple cgroup storage types
In order to introduce per-cpu cgroup storage, let's generalize bpf cgroup core to support multiple cgroup storage types. Potentially, per-node cgroup storage can be added later.
This commit is mostly a formal change that replaces cgroup_storage pointer with a array of cgroup_storage pointers. It doesn't actually introduce a new storage type, it will be done later.
Each bpf program is now able to have one cgroup storage of each type.
Signed-off-by: Roman Gushchin <guro@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
5bf7a60b |
| 27-Sep-2018 |
Yonghong Song <yhs@fb.com> |
bpf: permit CGROUP_DEVICE programs accessing helper bpf_get_current_cgroup_id()
Currently, helper bpf_get_current_cgroup_id() is not permitted for CGROUP_DEVICE type of programs. If the helper is us
bpf: permit CGROUP_DEVICE programs accessing helper bpf_get_current_cgroup_id()
Currently, helper bpf_get_current_cgroup_id() is not permitted for CGROUP_DEVICE type of programs. If the helper is used in such cases, the verifier will log the following error:
0: (bf) r6 = r1 1: (69) r7 = *(u16 *)(r6 +0) 2: (85) call bpf_get_current_cgroup_id#80 unknown func bpf_get_current_cgroup_id#80
The bpf_get_current_cgroup_id() is useful for CGROUP_DEVICE type of programs in order to customize action based on cgroup id. This patch added such a support.
Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
85fc4b16 |
| 06-Aug-2018 |
Roman Gushchin <guro@fb.com> |
bpf: introduce update_effective_progs()
__cgroup_bpf_attach() and __cgroup_bpf_detach() functions have a good amount of duplicated code, which is possible to eliminate by introducing the update_effe
bpf: introduce update_effective_progs()
__cgroup_bpf_attach() and __cgroup_bpf_detach() functions have a good amount of duplicated code, which is possible to eliminate by introducing the update_effective_progs() helper function.
The update_effective_progs() calls compute_effective_progs() and then in case of success it calls activate_effective_progs() for each descendant cgroup. In case of failure (OOM), it releases allocated prog arrays and return the error code.
Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
cd339431 |
| 02-Aug-2018 |
Roman Gushchin <guro@fb.com> |
bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used to get a pointer to the bpf local storage from a bpf program.
It takes a pointer to a
bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used to get a pointer to the bpf local storage from a bpf program.
It takes a pointer to a storage map and flags as arguments. Right now it accepts only cgroup storage maps, and flags argument has to be 0. Further it can be extended to support other types of local storage: e.g. thread local storage etc.
Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
394e40a2 |
| 02-Aug-2018 |
Roman Gushchin <guro@fb.com> |
bpf: extend bpf_prog_array to store pointers to the cgroup storage
This patch converts bpf_prog_array from an array of prog pointers to the array of struct bpf_prog_array_item elements.
This allows
bpf: extend bpf_prog_array to store pointers to the cgroup storage
This patch converts bpf_prog_array from an array of prog pointers to the array of struct bpf_prog_array_item elements.
This allows to save a cgroup storage pointer for each bpf program efficiently attached to a cgroup.
Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
d7bf2c10 |
| 02-Aug-2018 |
Roman Gushchin <guro@fb.com> |
bpf: allocate cgroup storage entries on attaching bpf programs
If a bpf program is using cgroup local storage, allocate a bpf_cgroup_storage structure automatically on attaching the program to a cgr
bpf: allocate cgroup storage entries on attaching bpf programs
If a bpf program is using cgroup local storage, allocate a bpf_cgroup_storage structure automatically on attaching the program to a cgroup and save the pointer into the corresponding bpf_prog_list entry. Analogically, release the cgroup local storage on detaching of the bpf program.
Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
3960f4fd |
| 13-Jul-2018 |
Roman Gushchin <guro@fb.com> |
bpf: fix rcu annotations in compute_effective_progs()
The progs local variable in compute_effective_progs() is marked as __rcu, which is not correct. This is a local pointer, which is initialized by
bpf: fix rcu annotations in compute_effective_progs()
The progs local variable in compute_effective_progs() is marked as __rcu, which is not correct. This is a local pointer, which is initialized by bpf_prog_array_alloc(), which also now returns a generic non-rcu pointer.
The real rcu-protected pointer is *array (array is a pointer to an RCU-protected pointer), so the assignment should be performed using rcu_assign_pointer().
Fixes: 324bda9e6c5a ("bpf: multi program support for cgroup+bpf") Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
fdb5c453 |
| 18-Jun-2018 |
Sean Young <sean@mess.org> |
bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF
If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not possible to attach, detach or query IR BPF programs to /d
bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF
If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not possible to attach, detach or query IR BPF programs to /dev/lircN devices, making them impossible to use. For embedded devices, it should be possible to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
This change requires some refactoring, since bpf_prog_{attach,detach,query} functions are now always compiled, but their code paths for cgroups need moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c, moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not require #ifdefs since that is already conditionally compiled.
Fixes: f4364dcfc86d ("media: rc: introduce BPF_PROG_LIRC_MODE2") Signed-off-by: Sean Young <sean@mess.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
1cedee13 |
| 25-May-2018 |
Andrey Ignatov <rdna@fb.com> |
bpf: Hooks for sys_sendmsg
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg.
It leverages existing BPF program type `BPF_PROG_TYP
bpf: Hooks for sys_sendmsg
In addition to already existing BPF hooks for sys_bind and sys_connect, the patch provides new hooks for sys_sendmsg.
It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that provides access to socket itlself (properties like family, type, protocol) and user-passed `struct sockaddr *` so that BPF program can override destination IP and port for system calls such as sendto(2) or sendmsg(2) and/or assign source IP to the socket.
The hooks are implemented as two new attach types: `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and UDPv6 correspondingly.
UDPv4 and UDPv6 separate attach types for same reason as sys_bind and sys_connect hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
The difference with already existing hooks is sys_sendmsg are implemented only for unconnected UDP.
For TCP it doesn't make sense to change user-provided `struct sockaddr *` at sendto(2)/sendmsg(2) time since socket either was already connected and has source/destination set or wasn't connected and call to sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
Connected UDP is already handled by sys_connect hooks that can override source/destination at connect time and use fast-path later, i.e. these hooks don't affect UDP fast-path.
Rewriting source IP is implemented differently than that in sys_connect hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to just bind socket to desired local IP address since source IP can be set on per-packet basis by using ancillary data (cmsg(3)). So no matter if socket is bound or not, source IP has to be rewritten on every call to sys_sendmsg.
To do so two new fields are added to UAPI `struct bpf_sock_addr`; * `msg_src_ip4` to set source IPv4 for UDPv4; * `msg_src_ip6` to set source IPv6 for UDPv6.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
4fbac77d |
| 30-Mar-2018 |
Andrey Ignatov <rdna@fb.com> |
bpf: Hooks for sys_bind
== The problem ==
There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should
bpf: Hooks for sys_bind
== The problem ==
There is a use-case when all processes inside a cgroup should use one single IP address on a host that has multiple IP configured. Those processes should use the IP for both ingress and egress, for TCP and UDP traffic. So TCP/UDP servers should be bound to that IP to accept incoming connections on it, and TCP/UDP clients should make outgoing connections from that IP. It should not require changing application code since it's often not possible.
Currently it's solved by intercepting glibc wrappers around syscalls such as `bind(2)` and `connect(2)`. It's done by a shared library that is preloaded for every process in a cgroup so that whenever TCP/UDP server calls `bind(2)`, the library replaces IP in sockaddr before passing arguments to syscall. When application calls `connect(2)` the library transparently binds the local end of connection to that IP (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
Shared library approach is fragile though, e.g.: * some applications clear env vars (incl. `LD_PRELOAD`); * `/etc/ld.so.preload` doesn't help since some applications are linked with option `-z nodefaultlib`; * other applications don't use glibc and there is nothing to intercept.
== The solution ==
The patch provides much more reliable in-kernel solution for the 1st part of the problem: binding TCP/UDP servers on desired IP. It does not depend on application environment and implementation details (whether glibc is used or not).
It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND` (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
The new program type is intended to be used with sockets (`struct sock`) in a cgroup and provided by user `struct sockaddr`. Pointers to both of them are parts of the context passed to programs of newly added types.
The new attach types provides hooks in `bind(2)` system call for both IPv4 and IPv6 so that one can write a program to override IP addresses and ports user program tries to bind to and apply such a program for whole cgroup.
== Implementation notes ==
[1] Separate attach types for `AF_INET` and `AF_INET6` are added intentionally to prevent reading/writing to offsets that don't make sense for corresponding socket family. E.g. if user passes `sockaddr_in` it doesn't make sense to read from / write to `user_ip6[]` context fields.
[2] The write access to `struct bpf_sock_addr_kern` is implemented using special field as an additional "register".
There are just two registers in `sock_addr_convert_ctx_access`: `src` with value to write and `dst` with pointer to context that can't be changed not to break later instructions. But the fields, allowed to write to, are not available directly and to access them address of corresponding pointer has to be loaded first. To get additional register the 1st not used by `src` and `dst` one is taken, its content is saved to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load address of pointer field, and finally the register's content is restored from the temporary field after writing `src` value.
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
5e43f899 |
| 30-Mar-2018 |
Andrey Ignatov <rdna@fb.com> |
bpf: Check attach type at prog load time
== The problem ==
There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different perm
bpf: Check attach type at prog load time
== The problem ==
There are use-cases when a program of some type can be attached to multiple attach points and those attach points must have different permissions to access context or to call helpers.
E.g. context structure may have fields for both IPv4 and IPv6 but it doesn't make sense to read from / write to IPv6 field when attach point is somewhere in IPv4 stack.
Same applies to BPF-helpers: it may make sense to call some helper from some attach point, but not from other for same prog type.
== The solution ==
Introduce `expected_attach_type` field in in `struct bpf_attr` for `BPF_PROG_LOAD` command. If scenario described in "The problem" section is the case for some prog type, the field will be checked twice:
1) At load time prog type is checked to see if attach type for it must be known to validate program permissions correctly. Prog will be rejected with EINVAL if it's the case and `expected_attach_type` is not specified or has invalid value.
2) At attach time `attach_type` is compared with `expected_attach_type`, if prog type requires to have one, and, if they differ, attach will be rejected with EINVAL.
The `expected_attach_type` is now available as part of `struct bpf_prog` in both `bpf_verifier_ops->is_valid_access()` and `bpf_verifier_ops->get_func_proto()` () and can be used to check context accesses and calls to helpers correspondingly.
Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and Daniel Borkmann <daniel@iogearbox.net> here: https://marc.info/?l=linux-netdev&m=152107378717201&w=2
Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|