xref: /openbmc/linux/Documentation/bpf/ringbuf.rst (revision 1d3cab43)
197abb2b3SAndrii Nakryiko===============
297abb2b3SAndrii NakryikoBPF ring buffer
397abb2b3SAndrii Nakryiko===============
497abb2b3SAndrii Nakryiko
597abb2b3SAndrii NakryikoThis document describes BPF ring buffer design, API, and implementation details.
697abb2b3SAndrii Nakryiko
797abb2b3SAndrii Nakryiko.. contents::
897abb2b3SAndrii Nakryiko    :local:
997abb2b3SAndrii Nakryiko    :depth: 2
1097abb2b3SAndrii Nakryiko
1197abb2b3SAndrii NakryikoMotivation
1297abb2b3SAndrii Nakryiko----------
1397abb2b3SAndrii Nakryiko
1497abb2b3SAndrii NakryikoThere are two distinctive motivators for this work, which are not satisfied by
1597abb2b3SAndrii Nakryikoexisting perf buffer, which prompted creation of a new ring buffer
1697abb2b3SAndrii Nakryikoimplementation.
1797abb2b3SAndrii Nakryiko
1897abb2b3SAndrii Nakryiko- more efficient memory utilization by sharing ring buffer across CPUs;
1997abb2b3SAndrii Nakryiko- preserving ordering of events that happen sequentially in time, even across
2097abb2b3SAndrii Nakryiko  multiple CPUs (e.g., fork/exec/exit events for a task).
2197abb2b3SAndrii Nakryiko
2297abb2b3SAndrii NakryikoThese two problems are independent, but perf buffer fails to satisfy both.
2397abb2b3SAndrii NakryikoBoth are a result of a choice to have per-CPU perf ring buffer.  Both can be
2497abb2b3SAndrii Nakryikoalso solved by having an MPSC implementation of ring buffer. The ordering
2597abb2b3SAndrii Nakryikoproblem could technically be solved for perf buffer with some in-kernel
2697abb2b3SAndrii Nakryikocounting, but given the first one requires an MPSC buffer, the same solution
2797abb2b3SAndrii Nakryikowould solve the second problem automatically.
2897abb2b3SAndrii Nakryiko
2997abb2b3SAndrii NakryikoSemantics and APIs
3097abb2b3SAndrii Nakryiko------------------
3197abb2b3SAndrii Nakryiko
3297abb2b3SAndrii NakryikoSingle ring buffer is presented to BPF programs as an instance of BPF map of
3397abb2b3SAndrii Nakryikotype ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but
3497abb2b3SAndrii Nakryikoultimately rejected.
3597abb2b3SAndrii Nakryiko
3697abb2b3SAndrii NakryikoOne way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make
3797abb2b3SAndrii Nakryiko``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not
3897abb2b3SAndrii Nakryikoenforce "same CPU only" rule. This would be more familiar interface compatible
3997abb2b3SAndrii Nakryikowith existing perf buffer use in BPF, but would fail if application needed more
4097abb2b3SAndrii Nakryikoadvanced logic to lookup ring buffer by arbitrary key.
4197abb2b3SAndrii Nakryiko``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach.
4297abb2b3SAndrii NakryikoAdditionally, given the performance of BPF ringbuf, many use cases would just
4397abb2b3SAndrii Nakryikoopt into a simple single ring buffer shared among all CPUs, for which current
4497abb2b3SAndrii Nakryikoapproach would be an overkill.
4597abb2b3SAndrii Nakryiko
4697abb2b3SAndrii NakryikoAnother approach could introduce a new concept, alongside BPF map, to represent
4797abb2b3SAndrii Nakryikogeneric "container" object, which doesn't necessarily have key/value interface
4897abb2b3SAndrii Nakryikowith lookup/update/delete operations. This approach would add a lot of extra
4997abb2b3SAndrii Nakryikoinfrastructure that has to be built for observability and verifier support. It
5097abb2b3SAndrii Nakryikowould also add another concept that BPF developers would have to familiarize
5197abb2b3SAndrii Nakryikothemselves with, new syntax in libbpf, etc. But then would really provide no
5297abb2b3SAndrii Nakryikoadditional benefits over the approach of using a map.  ``BPF_MAP_TYPE_RINGBUF``
5397abb2b3SAndrii Nakryikodoesn't support lookup/update/delete operations, but so doesn't few other map
5497abb2b3SAndrii Nakryikotypes (e.g., queue and stack; array doesn't support delete, etc).
5597abb2b3SAndrii Nakryiko
5697abb2b3SAndrii NakryikoThe approach chosen has an advantage of re-using existing BPF map
5797abb2b3SAndrii Nakryikoinfrastructure (introspection APIs in kernel, libbpf support, etc), being
5897abb2b3SAndrii Nakryikofamiliar concept (no need to teach users a new type of object in BPF program),
5997abb2b3SAndrii Nakryikoand utilizing existing tooling (bpftool). For common scenario of using a single
6097abb2b3SAndrii Nakryikoring buffer for all CPUs, it's as simple and straightforward, as would be with
6197abb2b3SAndrii Nakryikoa dedicated "container" object. On the other hand, by being a map, it can be
6297abb2b3SAndrii Nakryikocombined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement
6397abb2b3SAndrii Nakryikoa wide variety of topologies, from one ring buffer for each CPU (e.g., as
6497abb2b3SAndrii Nakryikoa replacement for perf buffer use cases), to a complicated application
6597abb2b3SAndrii Nakryikohashing/sharding of ring buffers (e.g., having a small pool of ring buffers
6697abb2b3SAndrii Nakryikowith hashed task's tgid being a look up key to preserve order, but reduce
6797abb2b3SAndrii Nakryikocontention).
6897abb2b3SAndrii Nakryiko
6997abb2b3SAndrii NakryikoKey and value sizes are enforced to be zero. ``max_entries`` is used to specify
7097abb2b3SAndrii Nakryikothe size of ring buffer and has to be a power of 2 value.
7197abb2b3SAndrii Nakryiko
7297abb2b3SAndrii NakryikoThere are a bunch of similarities between perf buffer
7397abb2b3SAndrii Nakryiko(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics:
7497abb2b3SAndrii Nakryiko
7597abb2b3SAndrii Nakryiko- variable-length records;
7697abb2b3SAndrii Nakryiko- if there is no more space left in ring buffer, reservation fails, no
7797abb2b3SAndrii Nakryiko  blocking;
7897abb2b3SAndrii Nakryiko- memory-mappable data area for user-space applications for ease of
7997abb2b3SAndrii Nakryiko  consumption and high performance;
8097abb2b3SAndrii Nakryiko- epoll notifications for new incoming data;
8197abb2b3SAndrii Nakryiko- but still the ability to do busy polling for new data to achieve the
8297abb2b3SAndrii Nakryiko  lowest latency, if necessary.
8397abb2b3SAndrii Nakryiko
8497abb2b3SAndrii NakryikoBPF ringbuf provides two sets of APIs to BPF programs:
8597abb2b3SAndrii Nakryiko
8697abb2b3SAndrii Nakryiko- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring
8797abb2b3SAndrii Nakryiko  buffer, similarly to ``bpf_perf_event_output()``;
8897abb2b3SAndrii Nakryiko- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()``
8997abb2b3SAndrii Nakryiko  APIs split the whole process into two steps. First, a fixed amount of space
9097abb2b3SAndrii Nakryiko  is reserved. If successful, a pointer to a data inside ring buffer data
9197abb2b3SAndrii Nakryiko  area is returned, which BPF programs can use similarly to a data inside
9297abb2b3SAndrii Nakryiko  array/hash maps. Once ready, this piece of memory is either committed or
9397abb2b3SAndrii Nakryiko  discarded. Discard is similar to commit, but makes consumer ignore the
9497abb2b3SAndrii Nakryiko  record.
9597abb2b3SAndrii Nakryiko
9697abb2b3SAndrii Nakryiko``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy,
9797abb2b3SAndrii Nakryikobecause record has to be prepared in some other place first. But it allows to
9897abb2b3SAndrii Nakryikosubmit records of the length that's not known to verifier beforehand. It also
9997abb2b3SAndrii Nakryikoclosely matches ``bpf_perf_event_output()``, so will simplify migration
10097abb2b3SAndrii Nakryikosignificantly.
10197abb2b3SAndrii Nakryiko
10297abb2b3SAndrii Nakryiko``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory
10397abb2b3SAndrii Nakryikopointer directly to ring buffer memory. In a lot of cases records are larger
10497abb2b3SAndrii Nakryikothan BPF stack space allows, so many programs have use extra per-CPU array as
10597abb2b3SAndrii Nakryikoa temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs
10697abb2b3SAndrii Nakryikocompletely. But in exchange, it only allows a known constant size of memory to
10797abb2b3SAndrii Nakryikobe reserved, such that verifier can verify that BPF program can't access memory
10897abb2b3SAndrii Nakryikooutside its reserved record space. bpf_ringbuf_output(), while slightly slower
10997abb2b3SAndrii Nakryikodue to extra memory copy, covers some use cases that are not suitable for
11097abb2b3SAndrii Nakryiko``bpf_ringbuf_reserve()``.
11197abb2b3SAndrii Nakryiko
11297abb2b3SAndrii NakryikoThe difference between commit and discard is very small. Discard just marks
11397abb2b3SAndrii Nakryikoa record as discarded, and such records are supposed to be ignored by consumer
11497abb2b3SAndrii Nakryikocode. Discard is useful for some advanced use-cases, such as ensuring
11597abb2b3SAndrii Nakryikoall-or-nothing multi-record submission, or emulating temporary
11697abb2b3SAndrii Nakryiko``malloc()``/``free()`` within single BPF program invocation.
11797abb2b3SAndrii Nakryiko
11897abb2b3SAndrii NakryikoEach reserved record is tracked by verifier through existing
11997abb2b3SAndrii Nakryikoreference-tracking logic, similar to socket ref-tracking. It is thus
12097abb2b3SAndrii Nakryikoimpossible to reserve a record, but forget to submit (or discard) it.
12197abb2b3SAndrii Nakryiko
12297abb2b3SAndrii Nakryiko``bpf_ringbuf_query()`` helper allows to query various properties of ring
12397abb2b3SAndrii Nakryikobuffer.  Currently 4 are supported:
12497abb2b3SAndrii Nakryiko
12597abb2b3SAndrii Nakryiko- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer;
12697abb2b3SAndrii Nakryiko- ``BPF_RB_RING_SIZE`` returns the size of ring buffer;
127*1d3cab43SRandy Dunlap- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position
12897abb2b3SAndrii Nakryiko  of consumer/producer, respectively.
12997abb2b3SAndrii Nakryiko
13097abb2b3SAndrii NakryikoReturned values are momentarily snapshots of ring buffer state and could be
13197abb2b3SAndrii Nakryikooff by the time helper returns, so this should be used only for
13297abb2b3SAndrii Nakryikodebugging/reporting reasons or for implementing various heuristics, that take
13397abb2b3SAndrii Nakryikointo account highly-changeable nature of some of those characteristics.
13497abb2b3SAndrii Nakryiko
13597abb2b3SAndrii NakryikoOne such heuristic might involve more fine-grained control over poll/epoll
13697abb2b3SAndrii Nakryikonotifications about new data availability in ring buffer. Together with
13797abb2b3SAndrii Nakryiko``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard
13897abb2b3SAndrii Nakryikohelpers, it allows BPF program a high degree of control and, e.g., more
13997abb2b3SAndrii Nakryikoefficient batched notifications. Default self-balancing strategy, though,
14097abb2b3SAndrii Nakryikoshould be adequate for most applications and will work reliable and efficiently
14197abb2b3SAndrii Nakryikoalready.
14297abb2b3SAndrii Nakryiko
14397abb2b3SAndrii NakryikoDesign and Implementation
14497abb2b3SAndrii Nakryiko-------------------------
14597abb2b3SAndrii Nakryiko
14697abb2b3SAndrii NakryikoThis reserve/commit schema allows a natural way for multiple producers, either
14797abb2b3SAndrii Nakryikoon different CPUs or even on the same CPU/in the same BPF program, to reserve
14897abb2b3SAndrii Nakryikoindependent records and work with them without blocking other producers. This
149*1d3cab43SRandy Dunlapmeans that if BPF program was interrupted by another BPF program sharing the
15097abb2b3SAndrii Nakryikosame ring buffer, they will both get a record reserved (provided there is
15197abb2b3SAndrii Nakryikoenough space left) and can work with it and submit it independently. This
15297abb2b3SAndrii Nakryikoapplies to NMI context as well, except that due to using a spinlock during
15397abb2b3SAndrii Nakryikoreservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get
15497abb2b3SAndrii Nakryikoa lock, in which case reservation will fail even if ring buffer is not full.
15597abb2b3SAndrii Nakryiko
15697abb2b3SAndrii NakryikoThe ring buffer itself internally is implemented as a power-of-2 sized
15797abb2b3SAndrii Nakryikocircular buffer, with two logical and ever-increasing counters (which might
15897abb2b3SAndrii Nakryikowrap around on 32-bit architectures, that's not a problem):
15997abb2b3SAndrii Nakryiko
16097abb2b3SAndrii Nakryiko- consumer counter shows up to which logical position consumer consumed the
16197abb2b3SAndrii Nakryiko  data;
16297abb2b3SAndrii Nakryiko- producer counter denotes amount of data reserved by all producers.
16397abb2b3SAndrii Nakryiko
16497abb2b3SAndrii NakryikoEach time a record is reserved, producer that "owns" the record will
16597abb2b3SAndrii Nakryikosuccessfully advance producer counter. At that point, data is still not yet
16697abb2b3SAndrii Nakryikoready to be consumed, though. Each record has 8 byte header, which contains the
16797abb2b3SAndrii Nakryikolength of reserved record, as well as two extra bits: busy bit to denote that
16897abb2b3SAndrii Nakryikorecord is still being worked on, and discard bit, which might be set at commit
16997abb2b3SAndrii Nakryikotime if record is discarded. In the latter case, consumer is supposed to skip
17097abb2b3SAndrii Nakryikothe record and move on to the next one. Record header also encodes record's
17197abb2b3SAndrii Nakryikorelative offset from the beginning of ring buffer data area (in pages). This
17297abb2b3SAndrii Nakryikoallows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the
17397abb2b3SAndrii Nakryikopointer to the record itself, without requiring also the pointer to ring buffer
17497abb2b3SAndrii Nakryikoitself. Ring buffer memory location will be restored from record metadata
17597abb2b3SAndrii Nakryikoheader. This significantly simplifies verifier, as well as improving API
17697abb2b3SAndrii Nakryikousability.
17797abb2b3SAndrii Nakryiko
17897abb2b3SAndrii NakryikoProducer counter increments are serialized under spinlock, so there is
17997abb2b3SAndrii Nakryikoa strict ordering between reservations. Commits, on the other hand, are
18097abb2b3SAndrii Nakryikocompletely lockless and independent. All records become available to consumer
18197abb2b3SAndrii Nakryikoin the order of reservations, but only after all previous records where
18297abb2b3SAndrii Nakryikoalready committed. It is thus possible for slow producers to temporarily hold
18397abb2b3SAndrii Nakryikooff submitted records, that were reserved later.
18497abb2b3SAndrii Nakryiko
18597abb2b3SAndrii NakryikoOne interesting implementation bit, that significantly simplifies (and thus
18697abb2b3SAndrii Nakryikospeeds up as well) implementation of both producers and consumers is how data
18797abb2b3SAndrii Nakryikoarea is mapped twice contiguously back-to-back in the virtual memory. This
18897abb2b3SAndrii Nakryikoallows to not take any special measures for samples that have to wrap around
18997abb2b3SAndrii Nakryikoat the end of the circular buffer data area, because the next page after the
19097abb2b3SAndrii Nakryikolast data page would be first data page again, and thus the sample will still
19197abb2b3SAndrii Nakryikoappear completely contiguous in virtual memory. See comment and a simple ASCII
19297abb2b3SAndrii Nakryikodiagram showing this visually in ``bpf_ringbuf_area_alloc()``.
19397abb2b3SAndrii Nakryiko
19497abb2b3SAndrii NakryikoAnother feature that distinguishes BPF ringbuf from perf ring buffer is
19597abb2b3SAndrii Nakryikoa self-pacing notifications of new data being availability.
19697abb2b3SAndrii Nakryiko``bpf_ringbuf_commit()`` implementation will send a notification of new record
19797abb2b3SAndrii Nakryikobeing available after commit only if consumer has already caught up right up to
19897abb2b3SAndrii Nakryikothe record being committed. If not, consumer still has to catch up and thus
19997abb2b3SAndrii Nakryikowill see new data anyways without needing an extra poll notification.
20065dce596SAndrii NakryikoBenchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that
20197abb2b3SAndrii Nakryikothis allows to achieve a very high throughput without having to resort to
20297abb2b3SAndrii Nakryikotricks like "notify only every Nth sample", which are necessary with perf
20397abb2b3SAndrii Nakryikobuffer. For extreme cases, when BPF program wants more manual control of
20497abb2b3SAndrii Nakryikonotifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and
20597abb2b3SAndrii Nakryiko``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of
20697abb2b3SAndrii Nakryikodata availability, but require extra caution and diligence in using this API.
207