1========================= 2BPF Graph Data Structures 3========================= 4 5This document describes implementation details of new-style "graph" data 6structures (linked_list, rbtree), with particular focus on the verifier's 7implementation of semantics specific to those data structures. 8 9Although no specific verifier code is referred to in this document, the document 10assumes that the reader has general knowledge of BPF verifier internals, BPF 11maps, and BPF program writing. 12 13Note that the intent of this document is to describe the current state of 14these graph data structures. **No guarantees** of stability for either 15semantics or APIs are made or implied here. 16 17.. contents:: 18 :local: 19 :depth: 2 20 21Introduction 22------------ 23 24The BPF map API has historically been the main way to expose data structures 25of various types for use within BPF programs. Some data structures fit naturally 26with the map API (HASH, ARRAY), others less so. Consequentially, programs 27interacting with the latter group of data structures can be hard to parse 28for kernel programmers without previous BPF experience. 29 30Luckily, some restrictions which necessitated the use of BPF map semantics are 31no longer relevant. With the introduction of kfuncs, kptrs, and the any-context 32BPF allocator, it is now possible to implement BPF data structures whose API 33and semantics more closely match those exposed to the rest of the kernel. 34 35Two such data structures - linked_list and rbtree - have many verification 36details in common. Because both have "root"s ("head" for linked_list) and 37"node"s, the verifier code and this document refer to common functionality 38as "graph_api", "graph_root", "graph_node", etc. 39 40Unless otherwise stated, examples and semantics below apply to both graph data 41structures. 42 43Unstable API 44------------ 45 46Data structures implemented using the BPF map API have historically used BPF 47helper functions - either standard map API helpers like ``bpf_map_update_elem`` 48or map-specific helpers. The new-style graph data structures instead use kfuncs 49to define their manipulation helpers. Because there are no stability guarantees 50for kfuncs, the API and semantics for these data structures can be evolved in 51a way that breaks backwards compatibility if necessary. 52 53Root and node types for the new data structures are opaquely defined in the 54``uapi/linux/bpf.h`` header. 55 56Locking 57------- 58 59The new-style data structures are intrusive and are defined similarly to their 60vanilla kernel counterparts: 61 62.. code-block:: c 63 64 struct node_data { 65 long key; 66 long data; 67 struct bpf_rb_node node; 68 }; 69 70 struct bpf_spin_lock glock; 71 struct bpf_rb_root groot __contains(node_data, node); 72 73The "root" type for both linked_list and rbtree expects to be in a map_value 74which also contains a ``bpf_spin_lock`` - in the above example both global 75variables are placed in a single-value arraymap. The verifier considers this 76spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in 77the same map_value and will enforce that the correct lock is held when 78verifying BPF programs that manipulate the tree. Since this lock checking 79happens at verification time, there is no runtime penalty. 80 81Non-owning references 82--------------------- 83 84**Motivation** 85 86Consider the following BPF code: 87 88.. code-block:: c 89 90 struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 91 92 bpf_spin_lock(&lock); 93 94 bpf_rbtree_add(&tree, n); /* PASSED */ 95 96 bpf_spin_unlock(&lock); 97 98From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` 99has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of 100``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the 101program has ownership of the pointee's (object pointed to by ``n``) lifetime. 102The BPF program must pass off ownership before exiting - either via 103``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with 104``bpf_rbtree_add``. 105 106(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where 107"ownership is acquired" and "ownership is passed", respectively) 108 109What should the verifier do with ``n`` after ownership is passed off? If the 110object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier 111should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as 112the object is no longer valid. The underlying memory may have been reused for 113some other allocation, unmapped, etc. 114 115When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less 116obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, 117but that would result in programs with useful, common coding patterns being 118rejected, e.g.: 119 120.. code-block:: c 121 122 int x; 123 struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 124 125 bpf_spin_lock(&lock); 126 127 bpf_rbtree_add(&tree, n); /* PASSED */ 128 x = n->data; 129 n->data = 42; 130 131 bpf_spin_unlock(&lock); 132 133Both the read from and write to ``n->data`` would be rejected. The verifier 134can do better, though, by taking advantage of two details: 135 136 * Graph data structure APIs can only be used when the ``bpf_spin_lock`` 137 associated with the graph root is held 138 139 * Both graph data structures have pointer stability 140 141 * Because graph nodes are allocated with ``bpf_obj_new`` and 142 adding / removing from the root involves fiddling with the 143 ``bpf_{list,rb}_node`` field of the node struct, a graph node will 144 remain at the same address after either operation. 145 146Because the associated ``bpf_spin_lock`` must be held by any program adding 147or removing, if we're in the critical section bounded by that lock, we know 148that no other program can add or remove until the end of the critical section. 149This combined with pointer stability means that, until the critical section 150ends, we can safely access the graph node through ``n`` even after it was used 151to pass ownership. 152 153The verifier considers such a reference a *non-owning reference*. The ref 154returned by ``bpf_obj_new`` is accordingly considered an *owning reference*. 155Both terms currently only have meaning in the context of graph nodes and API. 156 157**Details** 158 159Let's enumerate the properties of both types of references. 160 161*owning reference* 162 163 * This reference controls the lifetime of the pointee 164 165 * Ownership of pointee must be 'released' by passing it to some graph API 166 kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee 167 168 * If not released before program ends, verifier considers program invalid 169 170 * Access to the pointee's memory will not page fault 171 172*non-owning reference* 173 174 * This reference does not own the pointee 175 176 * It cannot be used to add the graph node to a graph root, nor ``free``'d via 177 ``bpf_obj_drop`` 178 179 * No explicit control of lifetime, but can infer valid lifetime based on 180 non-owning ref existence (see explanation below) 181 182 * Access to the pointee's memory will not page fault 183 184From verifier's perspective non-owning references can only exist 185between spin_lock and spin_unlock. Why? After spin_unlock another program 186can do arbitrary operations on the data structure like removing and ``free``-ing 187via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, 188``free``'d, and reused via bpf_obj_new would point to an entirely different thing. 189Or the memory could go away. 190 191To prevent this logic violation all non-owning references are invalidated by the 192verifier after a critical section ends. This is necessary to ensure the "will 193not page fault" property of non-owning references. So if the verifier hasn't 194invalidated a non-owning ref, accessing it will not page fault. 195 196Currently ``bpf_obj_drop`` is not allowed in the critical section, so 197if there's a valid non-owning ref, we must be in a critical section, and can 198conclude that the ref's memory hasn't been dropped-and- ``free``'d or 199dropped-and-reused. 200 201Any reference to a node that is in an rbtree _must_ be non-owning, since 202the tree has control of the pointee's lifetime. Similarly, any ref to a node 203that isn't in rbtree _must_ be owning. This results in a nice property: 204graph API add / remove implementations don't need to check if a node 205has already been added (or already removed), as the ownership model 206allows the verifier to prevent such a state from being valid by simply checking 207types. 208 209However, pointer aliasing poses an issue for the above "nice property". 210Consider the following example: 211 212.. code-block:: c 213 214 struct node_data *n, *m, *o, *p; 215 n = bpf_obj_new(typeof(*n)); /* 1 */ 216 217 bpf_spin_lock(&lock); 218 219 bpf_rbtree_add(&tree, n); /* 2 */ 220 m = bpf_rbtree_first(&tree); /* 3 */ 221 222 o = bpf_rbtree_remove(&tree, n); /* 4 */ 223 p = bpf_rbtree_remove(&tree, m); /* 5 */ 224 225 bpf_spin_unlock(&lock); 226 227 bpf_obj_drop(o); 228 bpf_obj_drop(p); /* 6 */ 229 230Assume the tree is empty before this program runs. If we track verifier state 231changes here using numbers in above comments: 232 233 1) n is an owning reference 234 235 2) n is a non-owning reference, it's been added to the tree 236 237 3) n and m are non-owning references, they both point to the same node 238 239 4) o is an owning reference, n and m non-owning, all point to same node 240 241 5) o and p are owning, n and m non-owning, all point to the same node 242 243 6) a double-free has occurred, since o and p point to same node and o was 244 ``free``'d in previous statement 245 246States 4 and 5 violate our "nice property", as there are non-owning refs to 247a node which is not in an rbtree. Statement 5 will try to remove a node which 248has already been removed as a result of this violation. State 6 is a dangerous 249double-free. 250 251At a minimum we should prevent state 6 from being possible. If we can't also 252prevent state 5 then we must abandon our "nice property" and check whether a 253node has already been removed at runtime. 254 255We prevent both by generalizing the "invalidate non-owning references" behavior 256of ``bpf_spin_unlock`` and doing similar invalidation after 257``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: 258 259 * takes an arbitrary node argument 260 261 * removes it from the data structure 262 263 * returns an owning reference to the removed node 264 265May result in a state where some other non-owning reference points to the same 266node. So ``remove``-type kfuncs must be considered a non-owning reference 267invalidation point as well. 268