xref: /openbmc/linux/Documentation/bpf/graph_ds_impl.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
1c31315c3SDave Marchevsky=========================
2c31315c3SDave MarchevskyBPF Graph Data Structures
3c31315c3SDave Marchevsky=========================
4c31315c3SDave Marchevsky
5c31315c3SDave MarchevskyThis document describes implementation details of new-style "graph" data
6c31315c3SDave Marchevskystructures (linked_list, rbtree), with particular focus on the verifier's
7c31315c3SDave Marchevskyimplementation of semantics specific to those data structures.
8c31315c3SDave Marchevsky
9c31315c3SDave MarchevskyAlthough no specific verifier code is referred to in this document, the document
10c31315c3SDave Marchevskyassumes that the reader has general knowledge of BPF verifier internals, BPF
11c31315c3SDave Marchevskymaps, and BPF program writing.
12c31315c3SDave Marchevsky
13c31315c3SDave MarchevskyNote that the intent of this document is to describe the current state of
14c31315c3SDave Marchevskythese graph data structures. **No guarantees** of stability for either
15c31315c3SDave Marchevskysemantics or APIs are made or implied here.
16c31315c3SDave Marchevsky
17c31315c3SDave Marchevsky.. contents::
18c31315c3SDave Marchevsky    :local:
19c31315c3SDave Marchevsky    :depth: 2
20c31315c3SDave Marchevsky
21c31315c3SDave MarchevskyIntroduction
22c31315c3SDave Marchevsky------------
23c31315c3SDave Marchevsky
24c31315c3SDave MarchevskyThe BPF map API has historically been the main way to expose data structures
25c31315c3SDave Marchevskyof various types for use within BPF programs. Some data structures fit naturally
26*d56b699dSBjorn Helgaaswith the map API (HASH, ARRAY), others less so. Consequently, programs
27c31315c3SDave Marchevskyinteracting with the latter group of data structures can be hard to parse
28c31315c3SDave Marchevskyfor kernel programmers without previous BPF experience.
29c31315c3SDave Marchevsky
30c31315c3SDave MarchevskyLuckily, some restrictions which necessitated the use of BPF map semantics are
31c31315c3SDave Marchevskyno longer relevant. With the introduction of kfuncs, kptrs, and the any-context
32c31315c3SDave MarchevskyBPF allocator, it is now possible to implement BPF data structures whose API
33c31315c3SDave Marchevskyand semantics more closely match those exposed to the rest of the kernel.
34c31315c3SDave Marchevsky
35c31315c3SDave MarchevskyTwo such data structures - linked_list and rbtree - have many verification
36c31315c3SDave Marchevskydetails in common. Because both have "root"s ("head" for linked_list) and
37c31315c3SDave Marchevsky"node"s, the verifier code and this document refer to common functionality
38c31315c3SDave Marchevskyas "graph_api", "graph_root", "graph_node", etc.
39c31315c3SDave Marchevsky
40c31315c3SDave MarchevskyUnless otherwise stated, examples and semantics below apply to both graph data
41c31315c3SDave Marchevskystructures.
42c31315c3SDave Marchevsky
43c31315c3SDave MarchevskyUnstable API
44c31315c3SDave Marchevsky------------
45c31315c3SDave Marchevsky
46c31315c3SDave MarchevskyData structures implemented using the BPF map API have historically used BPF
47c31315c3SDave Marchevskyhelper functions - either standard map API helpers like ``bpf_map_update_elem``
48c31315c3SDave Marchevskyor map-specific helpers. The new-style graph data structures instead use kfuncs
49c31315c3SDave Marchevskyto define their manipulation helpers. Because there are no stability guarantees
50c31315c3SDave Marchevskyfor kfuncs, the API and semantics for these data structures can be evolved in
51c31315c3SDave Marchevskya way that breaks backwards compatibility if necessary.
52c31315c3SDave Marchevsky
53c31315c3SDave MarchevskyRoot and node types for the new data structures are opaquely defined in the
54c31315c3SDave Marchevsky``uapi/linux/bpf.h`` header.
55c31315c3SDave Marchevsky
56c31315c3SDave MarchevskyLocking
57c31315c3SDave Marchevsky-------
58c31315c3SDave Marchevsky
59c31315c3SDave MarchevskyThe new-style data structures are intrusive and are defined similarly to their
60c31315c3SDave Marchevskyvanilla kernel counterparts:
61c31315c3SDave Marchevsky
62c31315c3SDave Marchevsky.. code-block:: c
63e2d323a1SBagas Sanjaya
64c31315c3SDave Marchevsky        struct node_data {
65c31315c3SDave Marchevsky          long key;
66c31315c3SDave Marchevsky          long data;
67c31315c3SDave Marchevsky          struct bpf_rb_node node;
68c31315c3SDave Marchevsky        };
69c31315c3SDave Marchevsky
70c31315c3SDave Marchevsky        struct bpf_spin_lock glock;
71c31315c3SDave Marchevsky        struct bpf_rb_root groot __contains(node_data, node);
72c31315c3SDave Marchevsky
73c31315c3SDave MarchevskyThe "root" type for both linked_list and rbtree expects to be in a map_value
74c31315c3SDave Marchevskywhich also contains a ``bpf_spin_lock`` - in the above example both global
75c31315c3SDave Marchevskyvariables are placed in a single-value arraymap. The verifier considers this
76c31315c3SDave Marchevskyspin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
77c31315c3SDave Marchevskythe same map_value and will enforce that the correct lock is held when
78c31315c3SDave Marchevskyverifying BPF programs that manipulate the tree. Since this lock checking
79c31315c3SDave Marchevskyhappens at verification time, there is no runtime penalty.
80c31315c3SDave Marchevsky
81c31315c3SDave MarchevskyNon-owning references
82c31315c3SDave Marchevsky---------------------
83c31315c3SDave Marchevsky
84c31315c3SDave Marchevsky**Motivation**
85c31315c3SDave Marchevsky
86c31315c3SDave MarchevskyConsider the following BPF code:
87c31315c3SDave Marchevsky
88c31315c3SDave Marchevsky.. code-block:: c
89c31315c3SDave Marchevsky
90c31315c3SDave Marchevsky        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
91c31315c3SDave Marchevsky
92c31315c3SDave Marchevsky        bpf_spin_lock(&lock);
93c31315c3SDave Marchevsky
94c31315c3SDave Marchevsky        bpf_rbtree_add(&tree, n); /* PASSED */
95c31315c3SDave Marchevsky
96c31315c3SDave Marchevsky        bpf_spin_unlock(&lock);
97c31315c3SDave Marchevsky
98c31315c3SDave MarchevskyFrom the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
99c31315c3SDave Marchevskyhas type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
100c31315c3SDave Marchevsky``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
101c31315c3SDave Marchevskyprogram has ownership of the pointee's (object pointed to by ``n``) lifetime.
102c31315c3SDave MarchevskyThe BPF program must pass off ownership before exiting - either via
103c31315c3SDave Marchevsky``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
104c31315c3SDave Marchevsky``bpf_rbtree_add``.
105c31315c3SDave Marchevsky
106c31315c3SDave Marchevsky(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
107c31315c3SDave Marchevsky"ownership is acquired" and "ownership is passed", respectively)
108c31315c3SDave Marchevsky
109c31315c3SDave MarchevskyWhat should the verifier do with ``n`` after ownership is passed off? If the
110c31315c3SDave Marchevskyobject was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
111c31315c3SDave Marchevskyshould reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
112c31315c3SDave Marchevskythe object is no longer valid. The underlying memory may have been reused for
113c31315c3SDave Marchevskysome other allocation, unmapped, etc.
114c31315c3SDave Marchevsky
115c31315c3SDave MarchevskyWhen ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
116c31315c3SDave Marchevskyobvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
117c31315c3SDave Marchevskybut that would result in programs with useful, common coding patterns being
118c31315c3SDave Marchevskyrejected, e.g.:
119c31315c3SDave Marchevsky
120c31315c3SDave Marchevsky.. code-block:: c
121c31315c3SDave Marchevsky
122c31315c3SDave Marchevsky        int x;
123c31315c3SDave Marchevsky        struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */
124c31315c3SDave Marchevsky
125c31315c3SDave Marchevsky        bpf_spin_lock(&lock);
126c31315c3SDave Marchevsky
127c31315c3SDave Marchevsky        bpf_rbtree_add(&tree, n); /* PASSED */
128c31315c3SDave Marchevsky        x = n->data;
129c31315c3SDave Marchevsky        n->data = 42;
130c31315c3SDave Marchevsky
131c31315c3SDave Marchevsky        bpf_spin_unlock(&lock);
132c31315c3SDave Marchevsky
133c31315c3SDave MarchevskyBoth the read from and write to ``n->data`` would be rejected. The verifier
134c31315c3SDave Marchevskycan do better, though, by taking advantage of two details:
135c31315c3SDave Marchevsky
136c31315c3SDave Marchevsky  * Graph data structure APIs can only be used when the ``bpf_spin_lock``
137c31315c3SDave Marchevsky    associated with the graph root is held
138c31315c3SDave Marchevsky
139c31315c3SDave Marchevsky  * Both graph data structures have pointer stability
140c31315c3SDave Marchevsky
141c31315c3SDave Marchevsky     * Because graph nodes are allocated with ``bpf_obj_new`` and
142c31315c3SDave Marchevsky       adding / removing from the root involves fiddling with the
143c31315c3SDave Marchevsky       ``bpf_{list,rb}_node`` field of the node struct, a graph node will
144c31315c3SDave Marchevsky       remain at the same address after either operation.
145c31315c3SDave Marchevsky
146c31315c3SDave MarchevskyBecause the associated ``bpf_spin_lock`` must be held by any program adding
147c31315c3SDave Marchevskyor removing, if we're in the critical section bounded by that lock, we know
148c31315c3SDave Marchevskythat no other program can add or remove until the end of the critical section.
149c31315c3SDave MarchevskyThis combined with pointer stability means that, until the critical section
150c31315c3SDave Marchevskyends, we can safely access the graph node through ``n`` even after it was used
151c31315c3SDave Marchevskyto pass ownership.
152c31315c3SDave Marchevsky
153c31315c3SDave MarchevskyThe verifier considers such a reference a *non-owning reference*. The ref
154c31315c3SDave Marchevskyreturned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
155c31315c3SDave MarchevskyBoth terms currently only have meaning in the context of graph nodes and API.
156c31315c3SDave Marchevsky
157c31315c3SDave Marchevsky**Details**
158c31315c3SDave Marchevsky
159c31315c3SDave MarchevskyLet's enumerate the properties of both types of references.
160c31315c3SDave Marchevsky
161c31315c3SDave Marchevsky*owning reference*
162c31315c3SDave Marchevsky
163c31315c3SDave Marchevsky  * This reference controls the lifetime of the pointee
164c31315c3SDave Marchevsky
165c31315c3SDave Marchevsky  * Ownership of pointee must be 'released' by passing it to some graph API
166c31315c3SDave Marchevsky    kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee
167c31315c3SDave Marchevsky
168c31315c3SDave Marchevsky    * If not released before program ends, verifier considers program invalid
169c31315c3SDave Marchevsky
170c31315c3SDave Marchevsky  * Access to the pointee's memory will not page fault
171c31315c3SDave Marchevsky
172c31315c3SDave Marchevsky*non-owning reference*
173c31315c3SDave Marchevsky
174c31315c3SDave Marchevsky  * This reference does not own the pointee
175c31315c3SDave Marchevsky
176c31315c3SDave Marchevsky     * It cannot be used to add the graph node to a graph root, nor ``free``'d via
177c31315c3SDave Marchevsky       ``bpf_obj_drop``
178c31315c3SDave Marchevsky
179c31315c3SDave Marchevsky  * No explicit control of lifetime, but can infer valid lifetime based on
180c31315c3SDave Marchevsky    non-owning ref existence (see explanation below)
181c31315c3SDave Marchevsky
182c31315c3SDave Marchevsky  * Access to the pointee's memory will not page fault
183c31315c3SDave Marchevsky
184c31315c3SDave MarchevskyFrom verifier's perspective non-owning references can only exist
185c31315c3SDave Marchevskybetween spin_lock and spin_unlock. Why? After spin_unlock another program
186c31315c3SDave Marchevskycan do arbitrary operations on the data structure like removing and ``free``-ing
187c31315c3SDave Marchevskyvia bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
188c31315c3SDave Marchevsky``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
189c31315c3SDave MarchevskyOr the memory could go away.
190c31315c3SDave Marchevsky
191c31315c3SDave MarchevskyTo prevent this logic violation all non-owning references are invalidated by the
192c31315c3SDave Marchevskyverifier after a critical section ends. This is necessary to ensure the "will
193c31315c3SDave Marchevskynot page fault" property of non-owning references. So if the verifier hasn't
194c31315c3SDave Marchevskyinvalidated a non-owning ref, accessing it will not page fault.
195c31315c3SDave Marchevsky
196c31315c3SDave MarchevskyCurrently ``bpf_obj_drop`` is not allowed in the critical section, so
197c31315c3SDave Marchevskyif there's a valid non-owning ref, we must be in a critical section, and can
198c31315c3SDave Marchevskyconclude that the ref's memory hasn't been dropped-and- ``free``'d or
199c31315c3SDave Marchevskydropped-and-reused.
200c31315c3SDave Marchevsky
201c31315c3SDave MarchevskyAny reference to a node that is in an rbtree _must_ be non-owning, since
202c31315c3SDave Marchevskythe tree has control of the pointee's lifetime. Similarly, any ref to a node
203c31315c3SDave Marchevskythat isn't in rbtree _must_ be owning. This results in a nice property:
204c31315c3SDave Marchevskygraph API add / remove implementations don't need to check if a node
205c31315c3SDave Marchevskyhas already been added (or already removed), as the ownership model
206c31315c3SDave Marchevskyallows the verifier to prevent such a state from being valid by simply checking
207c31315c3SDave Marchevskytypes.
208c31315c3SDave Marchevsky
209c31315c3SDave MarchevskyHowever, pointer aliasing poses an issue for the above "nice property".
210c31315c3SDave MarchevskyConsider the following example:
211c31315c3SDave Marchevsky
212c31315c3SDave Marchevsky.. code-block:: c
213c31315c3SDave Marchevsky
214c31315c3SDave Marchevsky        struct node_data *n, *m, *o, *p;
215c31315c3SDave Marchevsky        n = bpf_obj_new(typeof(*n));     /* 1 */
216c31315c3SDave Marchevsky
217c31315c3SDave Marchevsky        bpf_spin_lock(&lock);
218c31315c3SDave Marchevsky
219c31315c3SDave Marchevsky        bpf_rbtree_add(&tree, n);        /* 2 */
220c31315c3SDave Marchevsky        m = bpf_rbtree_first(&tree);     /* 3 */
221c31315c3SDave Marchevsky
222c31315c3SDave Marchevsky        o = bpf_rbtree_remove(&tree, n); /* 4 */
223c31315c3SDave Marchevsky        p = bpf_rbtree_remove(&tree, m); /* 5 */
224c31315c3SDave Marchevsky
225c31315c3SDave Marchevsky        bpf_spin_unlock(&lock);
226c31315c3SDave Marchevsky
227c31315c3SDave Marchevsky        bpf_obj_drop(o);
228c31315c3SDave Marchevsky        bpf_obj_drop(p); /* 6 */
229c31315c3SDave Marchevsky
230c31315c3SDave MarchevskyAssume the tree is empty before this program runs. If we track verifier state
231c31315c3SDave Marchevskychanges here using numbers in above comments:
232c31315c3SDave Marchevsky
233c31315c3SDave Marchevsky  1) n is an owning reference
234c31315c3SDave Marchevsky
235c31315c3SDave Marchevsky  2) n is a non-owning reference, it's been added to the tree
236c31315c3SDave Marchevsky
237c31315c3SDave Marchevsky  3) n and m are non-owning references, they both point to the same node
238c31315c3SDave Marchevsky
239c31315c3SDave Marchevsky  4) o is an owning reference, n and m non-owning, all point to same node
240c31315c3SDave Marchevsky
241c31315c3SDave Marchevsky  5) o and p are owning, n and m non-owning, all point to the same node
242c31315c3SDave Marchevsky
243c31315c3SDave Marchevsky  6) a double-free has occurred, since o and p point to same node and o was
244c31315c3SDave Marchevsky     ``free``'d in previous statement
245c31315c3SDave Marchevsky
246c31315c3SDave MarchevskyStates 4 and 5 violate our "nice property", as there are non-owning refs to
247c31315c3SDave Marchevskya node which is not in an rbtree. Statement 5 will try to remove a node which
248c31315c3SDave Marchevskyhas already been removed as a result of this violation. State 6 is a dangerous
249c31315c3SDave Marchevskydouble-free.
250c31315c3SDave Marchevsky
251c31315c3SDave MarchevskyAt a minimum we should prevent state 6 from being possible. If we can't also
252c31315c3SDave Marchevskyprevent state 5 then we must abandon our "nice property" and check whether a
253c31315c3SDave Marchevskynode has already been removed at runtime.
254c31315c3SDave Marchevsky
255c31315c3SDave MarchevskyWe prevent both by generalizing the "invalidate non-owning references" behavior
256c31315c3SDave Marchevskyof ``bpf_spin_unlock`` and doing similar invalidation after
257c31315c3SDave Marchevsky``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:
258c31315c3SDave Marchevsky
259c31315c3SDave Marchevsky  * takes an arbitrary node argument
260c31315c3SDave Marchevsky
261c31315c3SDave Marchevsky  * removes it from the data structure
262c31315c3SDave Marchevsky
263c31315c3SDave Marchevsky  * returns an owning reference to the removed node
264c31315c3SDave Marchevsky
265c31315c3SDave MarchevskyMay result in a state where some other non-owning reference points to the same
266c31315c3SDave Marchevskynode. So ``remove``-type kfuncs must be considered a non-owning reference
267c31315c3SDave Marchevskyinvalidation point as well.
268