#
07c7b547 |
| 16-Jun-2020 |
Tony Lindgren <tony@atomide.com> |
Merge tag 'v5.8-rc1' into fixes
Linux 5.8-rc1
|
#
4b3c1f1b |
| 16-Jun-2020 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge v5.8-rc1 into drm-misc-fixes
Beginning a new release cycles for what will become v5.8. Updating drm-misc-fixes accordingly.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
#
8440d4a7 |
| 12-Jun-2020 |
Rob Herring <robh@kernel.org> |
Merge branch 'dt/schema-cleanups' into dt/linus
|
#
f77d26a9 |
| 11-Jun-2020 |
Thomas Gleixner <tglx@linutronix.de> |
Merge branch 'x86/entry' into ras/core
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow up patches can be applied without creating a horrible merge conflict afterwards.
|
Revision tags: v5.4.46, v5.7.2, v5.4.45, v5.7.1 |
|
#
8dd06ef3 |
| 06-Jun-2020 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 5.8 merge window.
|
#
cb8e59cc |
| 03-Jun-2020 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von D
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz.
2) Add GSO partial support to igc, from Sasha Neftin.
3) Several cleanups and improvements to r8169 from Heiner Kallweit.
4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn.
5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky.
6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.
7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.
8) Add sriov and vf support to hinic, from Luo bin.
9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur.
10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.
11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp.
12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab.
13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger.
14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin.
15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg.
16) Several optimizations to sch_fq scheduler, from Eric Dumazet.
17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer.
18) Several RISCV bpf jit optimizations, from Luke Nelson.
19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang.
20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel.
21) Add BPF iterators, from Yonghang Song.
22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility.
23) Remove zero-length arrays all over, from Gustavo A. R. Silva.
24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer.
25) Add CAP_BPF, from Alexei Starovoitov.
26) Support terse dumps in the packet scheduler, from Vlad Buslov.
27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.
28) Add devm_register_netdev(), from Bartosz Golaszewski.
29) Minimize qdisc resets, from Cong Wang.
30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...
show more ...
|
Revision tags: v5.4.44 |
|
#
9a25c1df |
| 01-Jun-2020 |
David S. Miller <davem@davemloft.net> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
==================== pull-request: bpf-next 2020-06-01
The following pull-request contains BPF updates for
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
==================== pull-request: bpf-next 2020-06-01
The following pull-request contains BPF updates for your *net-next* tree.
We've added 55 non-merge commits during the last 1 day(s) which contain a total of 91 files changed, 4986 insertions(+), 463 deletions(-).
The main changes are:
1) Add rx_queue_mapping to bpf_sock from Amritha.
2) Add BPF ring buffer, from Andrii.
3) Attach and run programs through devmap, from David.
4) Allow SO_BINDTODEVICE opt in bpf_setsockopt, from Ferenc.
5) link based flow_dissector, from Jakub.
6) Use tracing helpers for lsm programs, from Jiri.
7) Several sk_msg fixes and extensions, from John. ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
cf51abcd |
| 01-Jun-2020 |
Alexei Starovoitov <ast@kernel.org> |
Merge branch 'Link-based-attach-to-netns'
Jakub Sitnicki says:
==================== One of the pieces of feedback from recent review of BPF hooks for socket lookup [0] was that new program types sh
Merge branch 'Link-based-attach-to-netns'
Jakub Sitnicki says:
==================== One of the pieces of feedback from recent review of BPF hooks for socket lookup [0] was that new program types should use bpf_link-based attachment.
This series introduces new bpf_link type for attaching to network namespace. All link operations are supported. Errors returned from ops follow cgroup example. Patch 4 description goes into error semantics.
The major change in v2 is a switch away from RCU to mutex-only synchronization. Andrii pointed out that it is not needed, and it makes sense to keep locking straightforward.
Also, there were a couple of bugs in update_prog and fill_info initial implementation, one picked up by kbuild. Those are now fixed. Tests have been extended to cover them. Full changelog below.
Series is organized as so:
Patches 1-3 prepare a space in struct net to keep state for attached BPF programs, and massage the code in flow_dissector to make it attach type agnostic, to finally move it under kernel/bpf/.
Patch 4, the most important one, introduces new bpf_link link type for attaching to network namespace.
Patch 5 unifies the update error (ENOLINK) between BPF cgroup and netns.
Patches 6-8 make libbpf and bpftool aware of the new link type.
Patches 9-12 Add and extend tests to check that link low- and high-level API for operating on links to netns works as intended.
Thanks to Alexei, Andrii, Lorenz, Marek, and Stanislav for feedback.
-jkbs
[0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: Lorenz Bauer <lmb@cloudflare.com> Cc: Marek Majkowski <marek@cloudflare.com> Cc: Stanislav Fomichev <sdf@google.com>
v1 -> v2:
- Switch to mutex-only synchronization. Don't rely on RCU grace period guarantee when accessing struct net from link release / update / fill_info, and when accessing bpf_link from pernet pre_exit callback. (Andrii) - Drop patch 1, no longer needed with mutex-only synchronization. - Don't leak uninitialized variable contents from fill_info callback when link is in defunct state. (kbuild) - Make fill_info treat the link as defunct (i.e. no attached netns) when struct net refcount is 0, but link has not been yet auto-detached. - Add missing BPF_LINK_TYPE define in bpf_types.h for new link type. - Fix link update_prog callback to update the prog that will run, and not just the link itself. - Return EEXIST on prog attach when link already exists, and on link create when prog is already attached directly. (Andrii) - Return EINVAL on prog detach when link is attached. (Andrii) - Fold __netns_bpf_link_attach into its only caller. (Stanislav) - Get rid of a wrapper around container_of() (Andrii) - Use rcu_dereference_protected instead of rcu_access_pointer on update-side. (Stanislav) - Make return-on-success from netns_bpf_link_create less confusing. (Andrii) - Adapt bpf_link for cgroup to return ENOLINK when updating a defunct link. (Andrii, Alexei) - Order new exported symbols in libbpf.map alphabetically (Andrii) - Keep libbpf's "failed to attach link" warning message clear as to what we failed to attach to (cgroup vs netns). (Andrii) - Extract helpers for printing link attach type. (bpftool, Andrii) - Switch flow_dissector tests to BPF skeleton and extend them to exercise link-based flow dissector attachment. (Andrii) - Harden flow dissector attachment tests with prog query checks after prog attach/detach, or link create/update/close. - Extend flow dissector tests to cover fill_info for defunct links. - Rebase onto recent bpf-next ====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
Revision tags: v5.7 |
|
#
b27f7bb5 |
| 31-May-2020 |
Jakub Sitnicki <jakub@cloudflare.com> |
flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not specific to flow dissector to a dedicated module named bpf/net_namespace.c.
Th
flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not specific to flow dissector to a dedicated module named bpf/net_namespace.c.
The set of functions will grow with the addition of bpf_link support for netns attached programs. This patch prepares ground by creating a place for it.
This is a code move with no functional changes intended.
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com
show more ...
|
#
e255d327 |
| 29-May-2020 |
Daniel Borkmann <daniel@iogearbox.net> |
Merge branch 'bpf-ring-buffer'
Andrii Nakryiko says:
==================== Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]). It presents an alternative to perf buffer, f
Merge branch 'bpf-ring-buffer'
Andrii Nakryiko says:
==================== Implement a new BPF ring buffer, as presented at BPF virtual conference ([0]). It presents an alternative to perf buffer, following its semantics closely, but allowing sharing same instance of ring buffer across multiple CPUs efficiently.
Most patches have extensive commentary explaining various aspects, so I'll keep cover letter short. Overall structure of the patch set: - patch #1 adds BPF ring buffer implementation to kernel and necessary verifier support; - patch #2 adds libbpf consumer implementation for BPF ringbuf; - patch #3 adds selftest, both for single BPF ring buf use case, as well as using it with array/hash of maps; - patch #4 adds extensive benchmarks and provide some analysis in commit message, it builds upon selftests/bpf's bench runner. - patch #5 adds most of patch #1 commit message as a doc under Documentation/bpf/ringbuf.rst.
Litmus tests, validating consumer/producer protocols and memory orderings, were moved out as discussed in [1] and are going to be posted against -rcu tree and put under Documentation/litmus-tests/bpf-rb.
[0] https://docs.google.com/presentation/d/18ITdg77Bj6YDOH2LghxrnFxiPWe0fAqcmJY95t_qr0w [1] https://lkml.org/lkml/2020/5/22/1011
v3->v4: - fix ringbuf freeing (vunmap, __free_page); verified with a trivial loop creating and closing ringbuf map endlessly (Daniel);
v2->v3: - dropped unnecessary smp_wmb() (Paul); - verifier reference type enhancement patch was dropped (Alexei); - better verifier message for various memory access checks (Alexei); - clarified a bit roundup_len() bit shifting (Alexei); - converted doc to .rst (Alexei); - fixed warning on 32-bit arches regarding tautological ring area size check.
v1->v2: - commit()/discard()/output() accept flags (NO_WAKEUP/FORCE_WAKEUP) (Stanislav); - bpf_ringbuf_query() added, returning available data size, ringbuf size, consumer/producer positions, needed to implement smarter notification policy (Stanislav); - added ringbuf UAPI constants to include/uapi/linux/bpf.h (Jonathan); - fixed sample size check, added proper ringbuf size check (Jonathan, Alexei); - wake_up_all() is done through irq_work (Alexei); - consistent use of smp_load_acquire/smp_store_release, no READ_ONCE/WRITE_ONCE (Alexei); - added Documentation/bpf/ringbuf.txt (Stanislav); - updated litmus test with smp_load_acquire/smp_store_release changes; - added ring_buffer__consume() API to libbpf for busy-polling; - ring_buffer__poll() on success returns number of records consumed; - fixed EPOLL notifications, don't assume available data, done similarly to perfbuf's implementation; - both ringbuf and perfbuf now have --rb-sampled mode, instead of pb-raw/pb-custom mode, updated benchmark results; - extended ringbuf selftests to validate epoll logic/manual notification logic, as well as bpf_ringbuf_query(). ====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
show more ...
|
#
457f4436 |
| 29-May-2020 |
Andrii Nakryiko <andriin@fb.com> |
bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem, which allows multiple CPUs to submit data to a single shared rin
bpf: Implement BPF ring buffer and verifier support for it
This commit adds a new MPSC ring buffer implementation into BPF ecosystem, which allows multiple CPUs to submit data to a single shared ring buffer. On the consumption side, only single consumer is assumed.
Motivation ---------- There are two distinctive motivators for this work, which are not satisfied by existing perf buffer, which prompted creation of a new ring buffer implementation. - more efficient memory utilization by sharing ring buffer across CPUs; - preserving ordering of events that happen sequentially in time, even across multiple CPUs (e.g., fork/exec/exit events for a task).
These two problems are independent, but perf buffer fails to satisfy both. Both are a result of a choice to have per-CPU perf ring buffer. Both can be also solved by having an MPSC implementation of ring buffer. The ordering problem could technically be solved for perf buffer with some in-kernel counting, but given the first one requires an MPSC buffer, the same solution would solve the second problem automatically.
Semantics and APIs ------------------ Single ring buffer is presented to BPF programs as an instance of BPF map of type BPF_MAP_TYPE_RINGBUF. Two other alternatives considered, but ultimately rejected.
One way would be to, similar to BPF_MAP_TYPE_PERF_EVENT_ARRAY, make BPF_MAP_TYPE_RINGBUF could represent an array of ring buffers, but not enforce "same CPU only" rule. This would be more familiar interface compatible with existing perf buffer use in BPF, but would fail if application needed more advanced logic to lookup ring buffer by arbitrary key. HASH_OF_MAPS addresses this with current approach. Additionally, given the performance of BPF ringbuf, many use cases would just opt into a simple single ring buffer shared among all CPUs, for which current approach would be an overkill.
Another approach could introduce a new concept, alongside BPF map, to represent generic "container" object, which doesn't necessarily have key/value interface with lookup/update/delete operations. This approach would add a lot of extra infrastructure that has to be built for observability and verifier support. It would also add another concept that BPF developers would have to familiarize themselves with, new syntax in libbpf, etc. But then would really provide no additional benefits over the approach of using a map. BPF_MAP_TYPE_RINGBUF doesn't support lookup/update/delete operations, but so doesn't few other map types (e.g., queue and stack; array doesn't support delete, etc).
The approach chosen has an advantage of re-using existing BPF map infrastructure (introspection APIs in kernel, libbpf support, etc), being familiar concept (no need to teach users a new type of object in BPF program), and utilizing existing tooling (bpftool). For common scenario of using a single ring buffer for all CPUs, it's as simple and straightforward, as would be with a dedicated "container" object. On the other hand, by being a map, it can be combined with ARRAY_OF_MAPS and HASH_OF_MAPS map-in-maps to implement a wide variety of topologies, from one ring buffer for each CPU (e.g., as a replacement for perf buffer use cases), to a complicated application hashing/sharding of ring buffers (e.g., having a small pool of ring buffers with hashed task's tgid being a look up key to preserve order, but reduce contention).
Key and value sizes are enforced to be zero. max_entries is used to specify the size of ring buffer and has to be a power of 2 value.
There are a bunch of similarities between perf buffer (BPF_MAP_TYPE_PERF_EVENT_ARRAY) and new BPF ring buffer semantics: - variable-length records; - if there is no more space left in ring buffer, reservation fails, no blocking; - memory-mappable data area for user-space applications for ease of consumption and high performance; - epoll notifications for new incoming data; - but still the ability to do busy polling for new data to achieve the lowest latency, if necessary.
BPF ringbuf provides two sets of APIs to BPF programs: - bpf_ringbuf_output() allows to *copy* data from one place to a ring buffer, similarly to bpf_perf_event_output(); - bpf_ringbuf_reserve()/bpf_ringbuf_commit()/bpf_ringbuf_discard() APIs split the whole process into two steps. First, a fixed amount of space is reserved. If successful, a pointer to a data inside ring buffer data area is returned, which BPF programs can use similarly to a data inside array/hash maps. Once ready, this piece of memory is either committed or discarded. Discard is similar to commit, but makes consumer ignore the record.
bpf_ringbuf_output() has disadvantage of incurring extra memory copy, because record has to be prepared in some other place first. But it allows to submit records of the length that's not known to verifier beforehand. It also closely matches bpf_perf_event_output(), so will simplify migration significantly.
bpf_ringbuf_reserve() avoids the extra copy of memory by providing a memory pointer directly to ring buffer memory. In a lot of cases records are larger than BPF stack space allows, so many programs have use extra per-CPU array as a temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs completely. But in exchange, it only allows a known constant size of memory to be reserved, such that verifier can verify that BPF program can't access memory outside its reserved record space. bpf_ringbuf_output(), while slightly slower due to extra memory copy, covers some use cases that are not suitable for bpf_ringbuf_reserve().
The difference between commit and discard is very small. Discard just marks a record as discarded, and such records are supposed to be ignored by consumer code. Discard is useful for some advanced use-cases, such as ensuring all-or-nothing multi-record submission, or emulating temporary malloc()/free() within single BPF program invocation.
Each reserved record is tracked by verifier through existing reference-tracking logic, similar to socket ref-tracking. It is thus impossible to reserve a record, but forget to submit (or discard) it.
bpf_ringbuf_query() helper allows to query various properties of ring buffer. Currently 4 are supported: - BPF_RB_AVAIL_DATA returns amount of unconsumed data in ring buffer; - BPF_RB_RING_SIZE returns the size of ring buffer; - BPF_RB_CONS_POS/BPF_RB_PROD_POS returns current logical possition of consumer/producer, respectively. Returned values are momentarily snapshots of ring buffer state and could be off by the time helper returns, so this should be used only for debugging/reporting reasons or for implementing various heuristics, that take into account highly-changeable nature of some of those characteristics.
One such heuristic might involve more fine-grained control over poll/epoll notifications about new data availability in ring buffer. Together with BPF_RB_NO_WAKEUP/BPF_RB_FORCE_WAKEUP flags for output/commit/discard helpers, it allows BPF program a high degree of control and, e.g., more efficient batched notifications. Default self-balancing strategy, though, should be adequate for most applications and will work reliable and efficiently already.
Design and implementation ------------------------- This reserve/commit schema allows a natural way for multiple producers, either on different CPUs or even on the same CPU/in the same BPF program, to reserve independent records and work with them without blocking other producers. This means that if BPF program was interruped by another BPF program sharing the same ring buffer, they will both get a record reserved (provided there is enough space left) and can work with it and submit it independently. This applies to NMI context as well, except that due to using a spinlock during reservation, in NMI context, bpf_ringbuf_reserve() might fail to get a lock, in which case reservation will fail even if ring buffer is not full.
The ring buffer itself internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters (which might wrap around on 32-bit architectures, that's not a problem): - consumer counter shows up to which logical position consumer consumed the data; - producer counter denotes amount of data reserved by all producers.
Each time a record is reserved, producer that "owns" the record will successfully advance producer counter. At that point, data is still not yet ready to be consumed, though. Each record has 8 byte header, which contains the length of reserved record, as well as two extra bits: busy bit to denote that record is still being worked on, and discard bit, which might be set at commit time if record is discarded. In the latter case, consumer is supposed to skip the record and move on to the next one. Record header also encodes record's relative offset from the beginning of ring buffer data area (in pages). This allows bpf_ringbuf_commit()/bpf_ringbuf_discard() to accept only the pointer to the record itself, without requiring also the pointer to ring buffer itself. Ring buffer memory location will be restored from record metadata header. This significantly simplifies verifier, as well as improving API usability.
Producer counter increments are serialized under spinlock, so there is a strict ordering between reservations. Commits, on the other hand, are completely lockless and independent. All records become available to consumer in the order of reservations, but only after all previous records where already committed. It is thus possible for slow producers to temporarily hold off submitted records, that were reserved later.
Reservation/commit/consumer protocol is verified by litmus tests in Documentation/litmus-test/bpf-rb.
One interesting implementation bit, that significantly simplifies (and thus speeds up as well) implementation of both producers and consumers is how data area is mapped twice contiguously back-to-back in the virtual memory. This allows to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. See comment and a simple ASCII diagram showing this visually in bpf_ringbuf_area_alloc().
Another feature that distinguishes BPF ringbuf from perf ring buffer is a self-pacing notifications of new data being availability. bpf_ringbuf_commit() implementation will send a notification of new record being available after commit only if consumer has already caught up right up to the record being committed. If not, consumer still has to catch up and thus will see new data anyways without needing an extra poll notification. Benchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbuf.c) show that this allows to achieve a very high throughput without having to resort to tricks like "notify only every Nth sample", which are necessary with perf buffer. For extreme cases, when BPF program wants more manual control of notifications, commit/discard/output helpers accept BPF_RB_NO_WAKEUP and BPF_RB_FORCE_WAKEUP flags, which give full control over notifications of data availability, but require extra caution and diligence in using this API.
Comparison to alternatives -------------------------- Before considering implementing BPF ring buffer from scratch existing alternatives in kernel were evaluated, but didn't seem to meet the needs. They largely fell into few categores: - per-CPU buffers (perf, ftrace, etc), which don't satisfy two motivations outlined above (ordering and memory consumption); - linked list-based implementations; while some were multi-producer designs, consuming these from user-space would be very complicated and most probably not performant; memory-mapping contiguous piece of memory is simpler and more performant for user-space consumers; - io_uring is SPSC, but also requires fixed-sized elements. Naively turning SPSC queue into MPSC w/ lock would have subpar performance compared to locked reserve + lockless commit, as with BPF ring buffer. Fixed sized elements would be too limiting for BPF programs, given existing BPF programs heavily rely on variable-sized perf buffer already; - specialized implementations (like a new printk ring buffer, [0]) with lots of printk-specific limitations and implications, that didn't seem to fit well for intended use with BPF programs.
[0] https://lwn.net/Articles/779550/
Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200529075424.3139988-2-andriin@fb.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
d053cf0d |
| 01-Jun-2020 |
Petr Mladek <pmladek@suse.com> |
Merge branch 'for-5.8' into for-linus
|
Revision tags: v5.4.43 |
|
#
a152b859 |
| 22-May-2020 |
David S. Miller <davem@davemloft.net> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
==================== pull-request: bpf-next 2020-05-23
The following pull-request contains BPF updates for yo
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
==================== pull-request: bpf-next 2020-05-23
The following pull-request contains BPF updates for your *net-next* tree.
We've added 50 non-merge commits during the last 8 day(s) which contain a total of 109 files changed, 2776 insertions(+), 2887 deletions(-).
The main changes are:
1) Add a new AF_XDP buffer allocation API to the core in order to help lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe as well as mlx5 have been moved over to the new API and also gained a small improvement in performance, from Björn Töpel and Magnus Karlsson.
2) Add getpeername()/getsockname() attach types for BPF sock_addr programs in order to allow for e.g. reverse translation of load-balancer backend to service address/port tuple from a connected peer, from Daniel Borkmann.
3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers being non-NULL, e.g. if after an initial test another non-NULL test on that pointer follows in a given path, then it can be pruned right away, from John Fastabend.
4) Larger rework of BPF sockmap selftests to make output easier to understand and to reduce overall runtime as well as adding new BPF kTLS selftests that run in combination with sockmap, also from John Fastabend.
5) Batch of misc updates to BPF selftests including fixing up test_align to match verifier output again and moving it under test_progs, allowing bpf_iter selftest to compile on machines with older vmlinux.h, and updating config options for lirc and v6 segment routing helpers, from Stanislav Fomichev, Andrii Nakryiko and Alan Maguire.
6) Conversion of BPF tracing samples outdated internal BPF loader to use libbpf API instead, from Daniel T. Lee.
7) Follow-up to BPF kernel test infrastructure in order to fix a flake in the XDP selftests, from Jesper Dangaard Brouer.
8) Minor improvements to libbpf's internal hashmap implementation, from Ian Rogers. ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
1f422417 |
| 22-May-2020 |
Daniel Lezcano <daniel.lezcano@linaro.org> |
Merge branch 'timers/drivers/timer-ti' into timers/drivers/next
|
#
79917b24 |
| 21-May-2020 |
Alexei Starovoitov <ast@kernel.org> |
Merge branch 'af_xdp-common-alloc'
Björn Töpel says:
==================== Overview ========
Driver adoption for AF_XDP has been slow. The amount of code required to proper support AF_XDP is substa
Merge branch 'af_xdp-common-alloc'
Björn Töpel says:
==================== Overview ========
Driver adoption for AF_XDP has been slow. The amount of code required to proper support AF_XDP is substantial and the driver/core APIs are vague or even non-existing. Drivers have to manually adjust data offsets, updating AF_XDP handles differently for different modes (aligned/unaligned).
This series attempts to improve the situation by introducing an AF_XDP buffer allocation API. The implementation is based on a single core (single producer/consumer) buffer pool for the AF_XDP UMEM.
A buffer is allocated using the xsk_buff_alloc() function, and returned using xsk_buff_free(). If a buffer is disassociated with the pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is said to be released. Currently, the release function is only used by the AF_XDP internals and not visible to the driver.
Drivers using this API should register the XDP memory model with the new MEM_TYPE_XSK_BUFF_POOL type, which will supersede the MEM_TYPE_ZERO_COPY type.
The buffer type is struct xdp_buff, and follows the lifetime of regular xdp_buffs, i.e. the lifetime of an xdp_buff is restricted to a NAPI context. In other words, the API is not replacing xdp_frames.
DMA mapping/synching is folded into the buffer handling as well.
@JeffK The Intel drivers changes should go through the bpf-next tree, and not your regular Intel tree, since multiple (non-Intel) drivers are affected.
The outline of the series is as following:
Patch 1 is a fix for xsk_umem_xdp_frame_sz().
Patch 2 to 4 are restructures/clean ups. The XSKMAP implementation is moved to net/xdp/. Functions/defines/enums that are only used by the AF_XDP internals are moved from the global include/net/xdp_sock.h to net/xdp/xsk.h. We are also introducing a new "driver include file", include/net/xdp_sock_drv.h, which is the only file NIC driver developers adding AF_XDP zero-copy support should care about.
Patch 5 adds the new API, and migrates the "copy-mode"/skb-mode AF_XDP path to the new API.
Patch 6 to 11 migrates the existing zero-copy drivers to the new API.
Patch 12 removes the MEM_TYPE_ZERO_COPY memory type, and the "handle" member of struct xdp_buff.
Patch 13 simplifies the xdp_return_{frame,frame_rx_napi,buff} functions.
Patch 14 is a performance patch, where some functions are inlined.
Finally, patch 15 updates the MAINTAINERS file to correctly mirror the new file layout.
Note that this series removes the "handle" member from struct xdp_buff, which reduces the xdp_buff size.
After this series, the diff stat of drivers/net/ is: 27 files changed, 419 insertions(+), 1288 deletions(-)
This series is a first step of simplifying the driver side of AF_XDP. I think more of the AF_XDP logic can be moved from the drivers to the AF_XDP core, e.g. the "need wakeup" set/clear functionality.
Statistics when allocation fails can now be added to the socket statistics via the XDP_STATISTICS getsockopt(). This will be added in a follow up series.
Performance ===========
As a nice side effect, performance is up a bit as well.
* i40e: 3% higher pps for rxdrop, zero-copy, aligned and unaligned (40 GbE, 64B packets). * mlx5: RX +0.8 Mpps, TX +0.4 Mpps
Changelog =========
v4->v5: * Fix various kdoc and GCC warnings (W=1). (Jakub)
v3->v4: * mlx5: Remove unused variable num_xsk_frames. (Jakub) * i40e: Made i40e_fd_handle_status() static. (kbuild test robot)
v2->v3: * Added xsk_umem_xdp_frame_sz() fix to the series. (Björn) * Initialize struct xdp_buff member frame_sz. (Björn) * Add API to query the DMA address of a frame. (Maxim) * Do DMA sync for CPU till the end of the frame to handle possible growth (frame_sz). (Maxim) * mlx5: Handle frame_sz, use xsk_buff_xdp_get_frame_dma, use xsk_buff API for DMA sync on TX, add performance numbers. (Maxim)
v1->v2: * mlx5: Fix DMA address handling, set XDP metadata to invalid. (Maxim) * ixgbe: Fixed xdp_buff data_end update. (Björn) * Swapped SoBs in patch 4. (Maxim)
rfc->v1: * Fixed build errors/warnings for m68k and riscv. (kbuild test robot) * Added headroom/chunk size getter. (Maxim/Björn) * mlx5: Put back the sanity check for XSK params, use XSK API to get the total headroom size. (Maxim) * Fixed spelling in commit message. (Björn) * Make sure xp_validate_desc() is inlined for Tx perf. (Maxim) * Sorted file entries. (Joe) * Added xdp_return_{frame,frame_rx_napi,buff} simplification (Björn)
Thanks for all the comments/input/help! ====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
d20a1676 |
| 20-May-2020 |
Björn Töpel <bjorn.topel@intel.com> |
xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related code. Also, move AF_XDP st
xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related code. Also, move AF_XDP struct definitions, and function declarations only used by AF_XDP internals into net/xdp/xsk.h.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
show more ...
|
Revision tags: v5.4.42 |
|
#
d00f26b6 |
| 14-May-2020 |
David S. Miller <davem@davemloft.net> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
==================== pull-request: bpf-next 2020-05-14
The following pull-request contains BPF updates for
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:
==================== pull-request: bpf-next 2020-05-14
The following pull-request contains BPF updates for your *net-next* tree.
The main changes are:
1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.
2) support for narrow loads in bpf_sock_addr progs and additional helpers in cg-skb progs, from Andrey.
3) bpf benchmark runner, from Andrii.
4) arm and riscv JIT optimizations, from Luke.
5) bpf iterator infrastructure, from Yonghong. ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.41 |
|
#
0fdc50df |
| 12-May-2020 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v5.6' into next
Sync up with mainline to get device tree and other changes.
|
#
68f0f269 |
| 11-May-2020 |
Thomas Gleixner <tglx@linutronix.de> |
Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul McKenney:
1. Miscellaneous fixes. 2. kfree_rcu() updates.
Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul McKenney:
1. Miscellaneous fixes. 2. kfree_rcu() updates. 3. Remove scheduler locking restriction 4. RCU-tasks update, including addition of RCU Tasks Trace for BPF use and RCU Tasks Rude. (This branch is on top of #3 due to overlap of changed code.) 5. RCU CPU stall warning updates. 6. Torture-test updates.
show more ...
|
Revision tags: v5.4.40 |
|
#
180139dc |
| 09-May-2020 |
Alexei Starovoitov <ast@kernel.org> |
Merge branch 'bpf_iter'
Yonghong Song says:
==================== Motivation: The current way to dump kernel data structures mostly: 1. /proc system 2. various specific tools like "ss" whi
Merge branch 'bpf_iter'
Yonghong Song says:
==================== Motivation: The current way to dump kernel data structures mostly: 1. /proc system 2. various specific tools like "ss" which requires kernel support. 3. drgn The dropback for the first two is that whenever you want to dump more, you need change the kernel. For example, Martin wants to dump socket local storage with "ss". Kernel change is needed for it to work ([1]). This is also the direct motivation for this work.
drgn ([2]) solves this proble nicely and no kernel change is not needed. But since drgn is not able to verify the validity of a particular pointer value, it might present the wrong results in rare cases.
In this patch set, we introduce bpf iterator. Initial kernel changes are still needed for interested kernel data, but a later data structure change will not require kernel changes any more. bpf program itself can adapt to new data structure changes. This will give certain flexibility with guaranteed correctness.
In this patch set, kernel seq_ops is used to facilitate iterating through kernel data, similar to current /proc and many other lossless kernel dumping facilities. In the future, different iterators can be implemented to trade off losslessness for other criteria e.g. no repeated object visits, etc.
User Interface: 1. Similar to prog/map/link, the iterator can be pinned into a path within a bpffs mount point. 2. The bpftool command can pin an iterator to a file bpftool iter pin <bpf_prog.o> <path> 3. Use `cat <path>` to dump the contents. Use `rm -f <path>` to remove the pinned iterator. 4. The anonymous iterator can be created as well.
Please see patch #19 andd #20 for bpf programs and bpf iterator output examples.
Note that certain iterators are namespace aware. For example, task and task_file targets only iterate through current pid namespace. ipv6_route and netlink will iterate through current net namespace.
Please see individual patches for implementation details.
Performance: The bpf iterator provides in-kernel aggregation abilities for kernel data. This can greatly improve performance compared to e.g., iterating all process directories under /proc. For example, I did an experiment on my VM with an application forking different number of tasks and each forked process opening various number of files. The following is the result with the latency with unit of microseconds:
# of forked tasks # of open files # of bpf_prog calls # latency (us) 100 100 11503 7586 1000 1000 1013203 709513 10000 100 1130203 764519
The number of bpf_prog calls may be more than forked tasks multipled by open files since there are other tasks running on the system. The bpf program is a do-nothing program. One millions of bpf calls takes less than one second.
Although the initial motivation is from Martin's sk_local_storage, this patch didn't implement tcp6 sockets and sk_local_storage. The /proc/net/tcp6 involves three types of sockets, timewait, request and tcp6 sockets. Some kind of type casting or other mechanism is needed to handle all these socket types in one bpf program. This will be addressed in future work.
Currently, we do not support kernel data generated under module. This requires some BTF work.
More work for more iterators, e.g., tcp, udp, bpf_map elements, etc.
Changelog: v3 -> v4: - in bpf_seq_read(), if start() failed with an error, return that error to user space (Andrii) - in bpf_seq_printf(), if reading kernel memory failed for %s and %p{i,I}{4,6}, set buffer to empty string or address 0. Documented this behavior in uapi header (Andrii) - fix a few error handling issues for bpftool (Andrii) - A few other minor fixes and cosmetic changes. v2 -> v3: - add bpf_iter_unreg_target() to unregister a target, used in the error path of the __init functions. - handle err != 0 before handling overflow (Andrii) - reference count "task" for task_file target (Andrii) - remove some redundancy for bpf_map/task/task_file targets - add bpf_iter_unreg_target() in ip6_route_cleanup() - Handling "%%" format in bpf_seq_printf() (Andrii) - implement auto-attach for bpf_iter in libbpf (Andrii) - add macros offsetof and container_of in bpf_helpers.h (Andrii) - add tests for auto-attach and program-return-1 cases - some other minor fixes v1 -> v2: - removed target_feature, using callback functions instead - checking target to ensure program specified btf_id supported (Martin) - link_create change with new changes from Andrii - better handling of btf_iter vs. seq_file private data (Martin, Andrii) - implemented bpf_seq_read() (Andrii, Alexei) - percpu buffer for bpf_seq_printf() (Andrii) - better syntax for BPF_SEQ_PRINTF macro (Andrii) - bpftool fixes (Quentin) - a lot of other fixes RFC v2 -> v1: - rename bpfdump to bpf_iter - use bpffs instead of a new file system - use bpf_link to streamline and simplify iterator creation.
References: [1]: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com [2]: https://github.com/osandov/drgn ====================
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
show more ...
|
#
eaaacd23 |
| 09-May-2020 |
Yonghong Song <yhs@fb.com> |
bpf: Add task and task/file iterator targets
Only the tasks belonging to "current" pid namespace are enumerated.
For task/file target, the bpf program will have access to struct task_struct *task
bpf: Add task and task/file iterator targets
Only the tasks belonging to "current" pid namespace are enumerated.
For task/file target, the bpf program will have access to struct task_struct *task u32 fd struct file *file where fd/file is an open file for the task.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175911.2476407-1-yhs@fb.com
show more ...
|
#
6086d29d |
| 09-May-2020 |
Yonghong Song <yhs@fb.com> |
bpf: Add bpf_map iterator
Implement seq_file operations to traverse all bpf_maps.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakry
bpf: Add bpf_map iterator
Implement seq_file operations to traverse all bpf_maps.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175909.2476096-1-yhs@fb.com
show more ...
|
#
ae24345d |
| 09-May-2020 |
Yonghong Song <yhs@fb.com> |
bpf: Implement an interface to register bpf_iter targets
The target can call bpf_iter_reg_target() to register itself. The needed information: target: target name seq_ops: the
bpf: Implement an interface to register bpf_iter targets
The target can call bpf_iter_reg_target() to register itself. The needed information: target: target name seq_ops: the seq_file operations for the target init_seq_private target callback to initialize seq_priv during file open fini_seq_private target callback to clean up seq_priv during file release seq_priv_size: the private_data size needed by the seq_file operations
The target name represents a target which provides a seq_ops for iterating objects.
The target can provide two callback functions, init_seq_private and fini_seq_private, called during file open/release time. For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net name space needs to be setup properly during file open and released properly during file release.
Function bpf_iter_unreg_target() is also implemented to unregister a particular target.
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
show more ...
|
Revision tags: v5.4.39, v5.4.38, v5.4.37, v5.4.36 |
|
#
4353dd3b |
| 25-Apr-2020 |
Ingo Molnar <mingo@kernel.org> |
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI changes for v5.8 from Ard Biesheuvel:
"- preliminary changes for RISC-V - add support for setti
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi into efi/core
Pull EFI changes for v5.8 from Ard Biesheuvel:
"- preliminary changes for RISC-V - add support for setting the resolution on the EFI framebuffer - simplify kernel image loading for arm64 - Move .bss into .data via the linker script instead of relying on symbol annotations. - Get rid of __pure getters to access global variables - Clean up the config table matching arrays"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
36dbae99 |
| 24-Apr-2020 |
Takashi Iwai <tiwai@suse.de> |
Merge branch 'topic/nhlt' into for-next
Merge NHLT init cleanup.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|