197abb2b3SAndrii Nakryiko=============== 297abb2b3SAndrii NakryikoBPF ring buffer 397abb2b3SAndrii Nakryiko=============== 497abb2b3SAndrii Nakryiko 597abb2b3SAndrii NakryikoThis document describes BPF ring buffer design, API, and implementation details. 697abb2b3SAndrii Nakryiko 797abb2b3SAndrii Nakryiko.. contents:: 897abb2b3SAndrii Nakryiko :local: 997abb2b3SAndrii Nakryiko :depth: 2 1097abb2b3SAndrii Nakryiko 1197abb2b3SAndrii NakryikoMotivation 1297abb2b3SAndrii Nakryiko---------- 1397abb2b3SAndrii Nakryiko 1497abb2b3SAndrii NakryikoThere are two distinctive motivators for this work, which are not satisfied by 1597abb2b3SAndrii Nakryikoexisting perf buffer, which prompted creation of a new ring buffer 1697abb2b3SAndrii Nakryikoimplementation. 1797abb2b3SAndrii Nakryiko 1897abb2b3SAndrii Nakryiko- more efficient memory utilization by sharing ring buffer across CPUs; 1997abb2b3SAndrii Nakryiko- preserving ordering of events that happen sequentially in time, even across 2097abb2b3SAndrii Nakryiko multiple CPUs (e.g., fork/exec/exit events for a task). 2197abb2b3SAndrii Nakryiko 2297abb2b3SAndrii NakryikoThese two problems are independent, but perf buffer fails to satisfy both. 2397abb2b3SAndrii NakryikoBoth are a result of a choice to have per-CPU perf ring buffer. Both can be 2497abb2b3SAndrii Nakryikoalso solved by having an MPSC implementation of ring buffer. The ordering 2597abb2b3SAndrii Nakryikoproblem could technically be solved for perf buffer with some in-kernel 2697abb2b3SAndrii Nakryikocounting, but given the first one requires an MPSC buffer, the same solution 2797abb2b3SAndrii Nakryikowould solve the second problem automatically. 2897abb2b3SAndrii Nakryiko 2997abb2b3SAndrii NakryikoSemantics and APIs 3097abb2b3SAndrii Nakryiko------------------ 3197abb2b3SAndrii Nakryiko 3297abb2b3SAndrii NakryikoSingle ring buffer is presented to BPF programs as an instance of BPF map of 3397abb2b3SAndrii Nakryikotype ``BPF_MAP_TYPE_RINGBUF``. Two other alternatives considered, but 3497abb2b3SAndrii Nakryikoultimately rejected. 3597abb2b3SAndrii Nakryiko 3697abb2b3SAndrii NakryikoOne way would be to, similar to ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, make 3797abb2b3SAndrii Nakryiko``BPF_MAP_TYPE_RINGBUF`` could represent an array of ring buffers, but not 3897abb2b3SAndrii Nakryikoenforce "same CPU only" rule. This would be more familiar interface compatible 3997abb2b3SAndrii Nakryikowith existing perf buffer use in BPF, but would fail if application needed more 4097abb2b3SAndrii Nakryikoadvanced logic to lookup ring buffer by arbitrary key. 4197abb2b3SAndrii Nakryiko``BPF_MAP_TYPE_HASH_OF_MAPS`` addresses this with current approach. 4297abb2b3SAndrii NakryikoAdditionally, given the performance of BPF ringbuf, many use cases would just 4397abb2b3SAndrii Nakryikoopt into a simple single ring buffer shared among all CPUs, for which current 4497abb2b3SAndrii Nakryikoapproach would be an overkill. 4597abb2b3SAndrii Nakryiko 4697abb2b3SAndrii NakryikoAnother approach could introduce a new concept, alongside BPF map, to represent 4797abb2b3SAndrii Nakryikogeneric "container" object, which doesn't necessarily have key/value interface 4897abb2b3SAndrii Nakryikowith lookup/update/delete operations. This approach would add a lot of extra 4997abb2b3SAndrii Nakryikoinfrastructure that has to be built for observability and verifier support. It 5097abb2b3SAndrii Nakryikowould also add another concept that BPF developers would have to familiarize 5197abb2b3SAndrii Nakryikothemselves with, new syntax in libbpf, etc. But then would really provide no 5297abb2b3SAndrii Nakryikoadditional benefits over the approach of using a map. ``BPF_MAP_TYPE_RINGBUF`` 5397abb2b3SAndrii Nakryikodoesn't support lookup/update/delete operations, but so doesn't few other map 5497abb2b3SAndrii Nakryikotypes (e.g., queue and stack; array doesn't support delete, etc). 5597abb2b3SAndrii Nakryiko 5697abb2b3SAndrii NakryikoThe approach chosen has an advantage of re-using existing BPF map 5797abb2b3SAndrii Nakryikoinfrastructure (introspection APIs in kernel, libbpf support, etc), being 5897abb2b3SAndrii Nakryikofamiliar concept (no need to teach users a new type of object in BPF program), 5997abb2b3SAndrii Nakryikoand utilizing existing tooling (bpftool). For common scenario of using a single 6097abb2b3SAndrii Nakryikoring buffer for all CPUs, it's as simple and straightforward, as would be with 6197abb2b3SAndrii Nakryikoa dedicated "container" object. On the other hand, by being a map, it can be 6297abb2b3SAndrii Nakryikocombined with ``ARRAY_OF_MAPS`` and ``HASH_OF_MAPS`` map-in-maps to implement 6397abb2b3SAndrii Nakryikoa wide variety of topologies, from one ring buffer for each CPU (e.g., as 6497abb2b3SAndrii Nakryikoa replacement for perf buffer use cases), to a complicated application 6597abb2b3SAndrii Nakryikohashing/sharding of ring buffers (e.g., having a small pool of ring buffers 6697abb2b3SAndrii Nakryikowith hashed task's tgid being a look up key to preserve order, but reduce 6797abb2b3SAndrii Nakryikocontention). 6897abb2b3SAndrii Nakryiko 6997abb2b3SAndrii NakryikoKey and value sizes are enforced to be zero. ``max_entries`` is used to specify 7097abb2b3SAndrii Nakryikothe size of ring buffer and has to be a power of 2 value. 7197abb2b3SAndrii Nakryiko 7297abb2b3SAndrii NakryikoThere are a bunch of similarities between perf buffer 7397abb2b3SAndrii Nakryiko(``BPF_MAP_TYPE_PERF_EVENT_ARRAY``) and new BPF ring buffer semantics: 7497abb2b3SAndrii Nakryiko 7597abb2b3SAndrii Nakryiko- variable-length records; 7697abb2b3SAndrii Nakryiko- if there is no more space left in ring buffer, reservation fails, no 7797abb2b3SAndrii Nakryiko blocking; 7897abb2b3SAndrii Nakryiko- memory-mappable data area for user-space applications for ease of 7997abb2b3SAndrii Nakryiko consumption and high performance; 8097abb2b3SAndrii Nakryiko- epoll notifications for new incoming data; 8197abb2b3SAndrii Nakryiko- but still the ability to do busy polling for new data to achieve the 8297abb2b3SAndrii Nakryiko lowest latency, if necessary. 8397abb2b3SAndrii Nakryiko 8497abb2b3SAndrii NakryikoBPF ringbuf provides two sets of APIs to BPF programs: 8597abb2b3SAndrii Nakryiko 8697abb2b3SAndrii Nakryiko- ``bpf_ringbuf_output()`` allows to *copy* data from one place to a ring 8797abb2b3SAndrii Nakryiko buffer, similarly to ``bpf_perf_event_output()``; 8897abb2b3SAndrii Nakryiko- ``bpf_ringbuf_reserve()``/``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` 8997abb2b3SAndrii Nakryiko APIs split the whole process into two steps. First, a fixed amount of space 9097abb2b3SAndrii Nakryiko is reserved. If successful, a pointer to a data inside ring buffer data 9197abb2b3SAndrii Nakryiko area is returned, which BPF programs can use similarly to a data inside 9297abb2b3SAndrii Nakryiko array/hash maps. Once ready, this piece of memory is either committed or 9397abb2b3SAndrii Nakryiko discarded. Discard is similar to commit, but makes consumer ignore the 9497abb2b3SAndrii Nakryiko record. 9597abb2b3SAndrii Nakryiko 9697abb2b3SAndrii Nakryiko``bpf_ringbuf_output()`` has disadvantage of incurring extra memory copy, 9797abb2b3SAndrii Nakryikobecause record has to be prepared in some other place first. But it allows to 9897abb2b3SAndrii Nakryikosubmit records of the length that's not known to verifier beforehand. It also 9997abb2b3SAndrii Nakryikoclosely matches ``bpf_perf_event_output()``, so will simplify migration 10097abb2b3SAndrii Nakryikosignificantly. 10197abb2b3SAndrii Nakryiko 10297abb2b3SAndrii Nakryiko``bpf_ringbuf_reserve()`` avoids the extra copy of memory by providing a memory 10397abb2b3SAndrii Nakryikopointer directly to ring buffer memory. In a lot of cases records are larger 10497abb2b3SAndrii Nakryikothan BPF stack space allows, so many programs have use extra per-CPU array as 10597abb2b3SAndrii Nakryikoa temporary heap for preparing sample. bpf_ringbuf_reserve() avoid this needs 10697abb2b3SAndrii Nakryikocompletely. But in exchange, it only allows a known constant size of memory to 10797abb2b3SAndrii Nakryikobe reserved, such that verifier can verify that BPF program can't access memory 10897abb2b3SAndrii Nakryikooutside its reserved record space. bpf_ringbuf_output(), while slightly slower 10997abb2b3SAndrii Nakryikodue to extra memory copy, covers some use cases that are not suitable for 11097abb2b3SAndrii Nakryiko``bpf_ringbuf_reserve()``. 11197abb2b3SAndrii Nakryiko 11297abb2b3SAndrii NakryikoThe difference between commit and discard is very small. Discard just marks 11397abb2b3SAndrii Nakryikoa record as discarded, and such records are supposed to be ignored by consumer 11497abb2b3SAndrii Nakryikocode. Discard is useful for some advanced use-cases, such as ensuring 11597abb2b3SAndrii Nakryikoall-or-nothing multi-record submission, or emulating temporary 11697abb2b3SAndrii Nakryiko``malloc()``/``free()`` within single BPF program invocation. 11797abb2b3SAndrii Nakryiko 11897abb2b3SAndrii NakryikoEach reserved record is tracked by verifier through existing 11997abb2b3SAndrii Nakryikoreference-tracking logic, similar to socket ref-tracking. It is thus 12097abb2b3SAndrii Nakryikoimpossible to reserve a record, but forget to submit (or discard) it. 12197abb2b3SAndrii Nakryiko 12297abb2b3SAndrii Nakryiko``bpf_ringbuf_query()`` helper allows to query various properties of ring 12397abb2b3SAndrii Nakryikobuffer. Currently 4 are supported: 12497abb2b3SAndrii Nakryiko 12597abb2b3SAndrii Nakryiko- ``BPF_RB_AVAIL_DATA`` returns amount of unconsumed data in ring buffer; 12697abb2b3SAndrii Nakryiko- ``BPF_RB_RING_SIZE`` returns the size of ring buffer; 127*1d3cab43SRandy Dunlap- ``BPF_RB_CONS_POS``/``BPF_RB_PROD_POS`` returns current logical position 12897abb2b3SAndrii Nakryiko of consumer/producer, respectively. 12997abb2b3SAndrii Nakryiko 13097abb2b3SAndrii NakryikoReturned values are momentarily snapshots of ring buffer state and could be 13197abb2b3SAndrii Nakryikooff by the time helper returns, so this should be used only for 13297abb2b3SAndrii Nakryikodebugging/reporting reasons or for implementing various heuristics, that take 13397abb2b3SAndrii Nakryikointo account highly-changeable nature of some of those characteristics. 13497abb2b3SAndrii Nakryiko 13597abb2b3SAndrii NakryikoOne such heuristic might involve more fine-grained control over poll/epoll 13697abb2b3SAndrii Nakryikonotifications about new data availability in ring buffer. Together with 13797abb2b3SAndrii Nakryiko``BPF_RB_NO_WAKEUP``/``BPF_RB_FORCE_WAKEUP`` flags for output/commit/discard 13897abb2b3SAndrii Nakryikohelpers, it allows BPF program a high degree of control and, e.g., more 13997abb2b3SAndrii Nakryikoefficient batched notifications. Default self-balancing strategy, though, 14097abb2b3SAndrii Nakryikoshould be adequate for most applications and will work reliable and efficiently 14197abb2b3SAndrii Nakryikoalready. 14297abb2b3SAndrii Nakryiko 14397abb2b3SAndrii NakryikoDesign and Implementation 14497abb2b3SAndrii Nakryiko------------------------- 14597abb2b3SAndrii Nakryiko 14697abb2b3SAndrii NakryikoThis reserve/commit schema allows a natural way for multiple producers, either 14797abb2b3SAndrii Nakryikoon different CPUs or even on the same CPU/in the same BPF program, to reserve 14897abb2b3SAndrii Nakryikoindependent records and work with them without blocking other producers. This 149*1d3cab43SRandy Dunlapmeans that if BPF program was interrupted by another BPF program sharing the 15097abb2b3SAndrii Nakryikosame ring buffer, they will both get a record reserved (provided there is 15197abb2b3SAndrii Nakryikoenough space left) and can work with it and submit it independently. This 15297abb2b3SAndrii Nakryikoapplies to NMI context as well, except that due to using a spinlock during 15397abb2b3SAndrii Nakryikoreservation, in NMI context, ``bpf_ringbuf_reserve()`` might fail to get 15497abb2b3SAndrii Nakryikoa lock, in which case reservation will fail even if ring buffer is not full. 15597abb2b3SAndrii Nakryiko 15697abb2b3SAndrii NakryikoThe ring buffer itself internally is implemented as a power-of-2 sized 15797abb2b3SAndrii Nakryikocircular buffer, with two logical and ever-increasing counters (which might 15897abb2b3SAndrii Nakryikowrap around on 32-bit architectures, that's not a problem): 15997abb2b3SAndrii Nakryiko 16097abb2b3SAndrii Nakryiko- consumer counter shows up to which logical position consumer consumed the 16197abb2b3SAndrii Nakryiko data; 16297abb2b3SAndrii Nakryiko- producer counter denotes amount of data reserved by all producers. 16397abb2b3SAndrii Nakryiko 16497abb2b3SAndrii NakryikoEach time a record is reserved, producer that "owns" the record will 16597abb2b3SAndrii Nakryikosuccessfully advance producer counter. At that point, data is still not yet 16697abb2b3SAndrii Nakryikoready to be consumed, though. Each record has 8 byte header, which contains the 16797abb2b3SAndrii Nakryikolength of reserved record, as well as two extra bits: busy bit to denote that 16897abb2b3SAndrii Nakryikorecord is still being worked on, and discard bit, which might be set at commit 16997abb2b3SAndrii Nakryikotime if record is discarded. In the latter case, consumer is supposed to skip 17097abb2b3SAndrii Nakryikothe record and move on to the next one. Record header also encodes record's 17197abb2b3SAndrii Nakryikorelative offset from the beginning of ring buffer data area (in pages). This 17297abb2b3SAndrii Nakryikoallows ``bpf_ringbuf_commit()``/``bpf_ringbuf_discard()`` to accept only the 17397abb2b3SAndrii Nakryikopointer to the record itself, without requiring also the pointer to ring buffer 17497abb2b3SAndrii Nakryikoitself. Ring buffer memory location will be restored from record metadata 17597abb2b3SAndrii Nakryikoheader. This significantly simplifies verifier, as well as improving API 17697abb2b3SAndrii Nakryikousability. 17797abb2b3SAndrii Nakryiko 17897abb2b3SAndrii NakryikoProducer counter increments are serialized under spinlock, so there is 17997abb2b3SAndrii Nakryikoa strict ordering between reservations. Commits, on the other hand, are 18097abb2b3SAndrii Nakryikocompletely lockless and independent. All records become available to consumer 18197abb2b3SAndrii Nakryikoin the order of reservations, but only after all previous records where 18297abb2b3SAndrii Nakryikoalready committed. It is thus possible for slow producers to temporarily hold 18397abb2b3SAndrii Nakryikooff submitted records, that were reserved later. 18497abb2b3SAndrii Nakryiko 18597abb2b3SAndrii NakryikoOne interesting implementation bit, that significantly simplifies (and thus 18697abb2b3SAndrii Nakryikospeeds up as well) implementation of both producers and consumers is how data 18797abb2b3SAndrii Nakryikoarea is mapped twice contiguously back-to-back in the virtual memory. This 18897abb2b3SAndrii Nakryikoallows to not take any special measures for samples that have to wrap around 18997abb2b3SAndrii Nakryikoat the end of the circular buffer data area, because the next page after the 19097abb2b3SAndrii Nakryikolast data page would be first data page again, and thus the sample will still 19197abb2b3SAndrii Nakryikoappear completely contiguous in virtual memory. See comment and a simple ASCII 19297abb2b3SAndrii Nakryikodiagram showing this visually in ``bpf_ringbuf_area_alloc()``. 19397abb2b3SAndrii Nakryiko 19497abb2b3SAndrii NakryikoAnother feature that distinguishes BPF ringbuf from perf ring buffer is 19597abb2b3SAndrii Nakryikoa self-pacing notifications of new data being availability. 19697abb2b3SAndrii Nakryiko``bpf_ringbuf_commit()`` implementation will send a notification of new record 19797abb2b3SAndrii Nakryikobeing available after commit only if consumer has already caught up right up to 19897abb2b3SAndrii Nakryikothe record being committed. If not, consumer still has to catch up and thus 19997abb2b3SAndrii Nakryikowill see new data anyways without needing an extra poll notification. 20065dce596SAndrii NakryikoBenchmarks (see tools/testing/selftests/bpf/benchs/bench_ringbufs.c) show that 20197abb2b3SAndrii Nakryikothis allows to achieve a very high throughput without having to resort to 20297abb2b3SAndrii Nakryikotricks like "notify only every Nth sample", which are necessary with perf 20397abb2b3SAndrii Nakryikobuffer. For extreme cases, when BPF program wants more manual control of 20497abb2b3SAndrii Nakryikonotifications, commit/discard/output helpers accept ``BPF_RB_NO_WAKEUP`` and 20597abb2b3SAndrii Nakryiko``BPF_RB_FORCE_WAKEUP`` flags, which give full control over notifications of 20697abb2b3SAndrii Nakryikodata availability, but require extra caution and diligence in using this API. 207