1c31315c3SDave Marchevsky========================= 2c31315c3SDave MarchevskyBPF Graph Data Structures 3c31315c3SDave Marchevsky========================= 4c31315c3SDave Marchevsky 5c31315c3SDave MarchevskyThis document describes implementation details of new-style "graph" data 6c31315c3SDave Marchevskystructures (linked_list, rbtree), with particular focus on the verifier's 7c31315c3SDave Marchevskyimplementation of semantics specific to those data structures. 8c31315c3SDave Marchevsky 9c31315c3SDave MarchevskyAlthough no specific verifier code is referred to in this document, the document 10c31315c3SDave Marchevskyassumes that the reader has general knowledge of BPF verifier internals, BPF 11c31315c3SDave Marchevskymaps, and BPF program writing. 12c31315c3SDave Marchevsky 13c31315c3SDave MarchevskyNote that the intent of this document is to describe the current state of 14c31315c3SDave Marchevskythese graph data structures. **No guarantees** of stability for either 15c31315c3SDave Marchevskysemantics or APIs are made or implied here. 16c31315c3SDave Marchevsky 17c31315c3SDave Marchevsky.. contents:: 18c31315c3SDave Marchevsky :local: 19c31315c3SDave Marchevsky :depth: 2 20c31315c3SDave Marchevsky 21c31315c3SDave MarchevskyIntroduction 22c31315c3SDave Marchevsky------------ 23c31315c3SDave Marchevsky 24c31315c3SDave MarchevskyThe BPF map API has historically been the main way to expose data structures 25c31315c3SDave Marchevskyof various types for use within BPF programs. Some data structures fit naturally 26*d56b699dSBjorn Helgaaswith the map API (HASH, ARRAY), others less so. Consequently, programs 27c31315c3SDave Marchevskyinteracting with the latter group of data structures can be hard to parse 28c31315c3SDave Marchevskyfor kernel programmers without previous BPF experience. 29c31315c3SDave Marchevsky 30c31315c3SDave MarchevskyLuckily, some restrictions which necessitated the use of BPF map semantics are 31c31315c3SDave Marchevskyno longer relevant. With the introduction of kfuncs, kptrs, and the any-context 32c31315c3SDave MarchevskyBPF allocator, it is now possible to implement BPF data structures whose API 33c31315c3SDave Marchevskyand semantics more closely match those exposed to the rest of the kernel. 34c31315c3SDave Marchevsky 35c31315c3SDave MarchevskyTwo such data structures - linked_list and rbtree - have many verification 36c31315c3SDave Marchevskydetails in common. Because both have "root"s ("head" for linked_list) and 37c31315c3SDave Marchevsky"node"s, the verifier code and this document refer to common functionality 38c31315c3SDave Marchevskyas "graph_api", "graph_root", "graph_node", etc. 39c31315c3SDave Marchevsky 40c31315c3SDave MarchevskyUnless otherwise stated, examples and semantics below apply to both graph data 41c31315c3SDave Marchevskystructures. 42c31315c3SDave Marchevsky 43c31315c3SDave MarchevskyUnstable API 44c31315c3SDave Marchevsky------------ 45c31315c3SDave Marchevsky 46c31315c3SDave MarchevskyData structures implemented using the BPF map API have historically used BPF 47c31315c3SDave Marchevskyhelper functions - either standard map API helpers like ``bpf_map_update_elem`` 48c31315c3SDave Marchevskyor map-specific helpers. The new-style graph data structures instead use kfuncs 49c31315c3SDave Marchevskyto define their manipulation helpers. Because there are no stability guarantees 50c31315c3SDave Marchevskyfor kfuncs, the API and semantics for these data structures can be evolved in 51c31315c3SDave Marchevskya way that breaks backwards compatibility if necessary. 52c31315c3SDave Marchevsky 53c31315c3SDave MarchevskyRoot and node types for the new data structures are opaquely defined in the 54c31315c3SDave Marchevsky``uapi/linux/bpf.h`` header. 55c31315c3SDave Marchevsky 56c31315c3SDave MarchevskyLocking 57c31315c3SDave Marchevsky------- 58c31315c3SDave Marchevsky 59c31315c3SDave MarchevskyThe new-style data structures are intrusive and are defined similarly to their 60c31315c3SDave Marchevskyvanilla kernel counterparts: 61c31315c3SDave Marchevsky 62c31315c3SDave Marchevsky.. code-block:: c 63e2d323a1SBagas Sanjaya 64c31315c3SDave Marchevsky struct node_data { 65c31315c3SDave Marchevsky long key; 66c31315c3SDave Marchevsky long data; 67c31315c3SDave Marchevsky struct bpf_rb_node node; 68c31315c3SDave Marchevsky }; 69c31315c3SDave Marchevsky 70c31315c3SDave Marchevsky struct bpf_spin_lock glock; 71c31315c3SDave Marchevsky struct bpf_rb_root groot __contains(node_data, node); 72c31315c3SDave Marchevsky 73c31315c3SDave MarchevskyThe "root" type for both linked_list and rbtree expects to be in a map_value 74c31315c3SDave Marchevskywhich also contains a ``bpf_spin_lock`` - in the above example both global 75c31315c3SDave Marchevskyvariables are placed in a single-value arraymap. The verifier considers this 76c31315c3SDave Marchevskyspin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in 77c31315c3SDave Marchevskythe same map_value and will enforce that the correct lock is held when 78c31315c3SDave Marchevskyverifying BPF programs that manipulate the tree. Since this lock checking 79c31315c3SDave Marchevskyhappens at verification time, there is no runtime penalty. 80c31315c3SDave Marchevsky 81c31315c3SDave MarchevskyNon-owning references 82c31315c3SDave Marchevsky--------------------- 83c31315c3SDave Marchevsky 84c31315c3SDave Marchevsky**Motivation** 85c31315c3SDave Marchevsky 86c31315c3SDave MarchevskyConsider the following BPF code: 87c31315c3SDave Marchevsky 88c31315c3SDave Marchevsky.. code-block:: c 89c31315c3SDave Marchevsky 90c31315c3SDave Marchevsky struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 91c31315c3SDave Marchevsky 92c31315c3SDave Marchevsky bpf_spin_lock(&lock); 93c31315c3SDave Marchevsky 94c31315c3SDave Marchevsky bpf_rbtree_add(&tree, n); /* PASSED */ 95c31315c3SDave Marchevsky 96c31315c3SDave Marchevsky bpf_spin_unlock(&lock); 97c31315c3SDave Marchevsky 98c31315c3SDave MarchevskyFrom the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new`` 99c31315c3SDave Marchevskyhas type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of 100c31315c3SDave Marchevsky``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the 101c31315c3SDave Marchevskyprogram has ownership of the pointee's (object pointed to by ``n``) lifetime. 102c31315c3SDave MarchevskyThe BPF program must pass off ownership before exiting - either via 103c31315c3SDave Marchevsky``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with 104c31315c3SDave Marchevsky``bpf_rbtree_add``. 105c31315c3SDave Marchevsky 106c31315c3SDave Marchevsky(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where 107c31315c3SDave Marchevsky"ownership is acquired" and "ownership is passed", respectively) 108c31315c3SDave Marchevsky 109c31315c3SDave MarchevskyWhat should the verifier do with ``n`` after ownership is passed off? If the 110c31315c3SDave Marchevskyobject was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier 111c31315c3SDave Marchevskyshould reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as 112c31315c3SDave Marchevskythe object is no longer valid. The underlying memory may have been reused for 113c31315c3SDave Marchevskysome other allocation, unmapped, etc. 114c31315c3SDave Marchevsky 115c31315c3SDave MarchevskyWhen ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less 116c31315c3SDave Marchevskyobvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``, 117c31315c3SDave Marchevskybut that would result in programs with useful, common coding patterns being 118c31315c3SDave Marchevskyrejected, e.g.: 119c31315c3SDave Marchevsky 120c31315c3SDave Marchevsky.. code-block:: c 121c31315c3SDave Marchevsky 122c31315c3SDave Marchevsky int x; 123c31315c3SDave Marchevsky struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */ 124c31315c3SDave Marchevsky 125c31315c3SDave Marchevsky bpf_spin_lock(&lock); 126c31315c3SDave Marchevsky 127c31315c3SDave Marchevsky bpf_rbtree_add(&tree, n); /* PASSED */ 128c31315c3SDave Marchevsky x = n->data; 129c31315c3SDave Marchevsky n->data = 42; 130c31315c3SDave Marchevsky 131c31315c3SDave Marchevsky bpf_spin_unlock(&lock); 132c31315c3SDave Marchevsky 133c31315c3SDave MarchevskyBoth the read from and write to ``n->data`` would be rejected. The verifier 134c31315c3SDave Marchevskycan do better, though, by taking advantage of two details: 135c31315c3SDave Marchevsky 136c31315c3SDave Marchevsky * Graph data structure APIs can only be used when the ``bpf_spin_lock`` 137c31315c3SDave Marchevsky associated with the graph root is held 138c31315c3SDave Marchevsky 139c31315c3SDave Marchevsky * Both graph data structures have pointer stability 140c31315c3SDave Marchevsky 141c31315c3SDave Marchevsky * Because graph nodes are allocated with ``bpf_obj_new`` and 142c31315c3SDave Marchevsky adding / removing from the root involves fiddling with the 143c31315c3SDave Marchevsky ``bpf_{list,rb}_node`` field of the node struct, a graph node will 144c31315c3SDave Marchevsky remain at the same address after either operation. 145c31315c3SDave Marchevsky 146c31315c3SDave MarchevskyBecause the associated ``bpf_spin_lock`` must be held by any program adding 147c31315c3SDave Marchevskyor removing, if we're in the critical section bounded by that lock, we know 148c31315c3SDave Marchevskythat no other program can add or remove until the end of the critical section. 149c31315c3SDave MarchevskyThis combined with pointer stability means that, until the critical section 150c31315c3SDave Marchevskyends, we can safely access the graph node through ``n`` even after it was used 151c31315c3SDave Marchevskyto pass ownership. 152c31315c3SDave Marchevsky 153c31315c3SDave MarchevskyThe verifier considers such a reference a *non-owning reference*. The ref 154c31315c3SDave Marchevskyreturned by ``bpf_obj_new`` is accordingly considered an *owning reference*. 155c31315c3SDave MarchevskyBoth terms currently only have meaning in the context of graph nodes and API. 156c31315c3SDave Marchevsky 157c31315c3SDave Marchevsky**Details** 158c31315c3SDave Marchevsky 159c31315c3SDave MarchevskyLet's enumerate the properties of both types of references. 160c31315c3SDave Marchevsky 161c31315c3SDave Marchevsky*owning reference* 162c31315c3SDave Marchevsky 163c31315c3SDave Marchevsky * This reference controls the lifetime of the pointee 164c31315c3SDave Marchevsky 165c31315c3SDave Marchevsky * Ownership of pointee must be 'released' by passing it to some graph API 166c31315c3SDave Marchevsky kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee 167c31315c3SDave Marchevsky 168c31315c3SDave Marchevsky * If not released before program ends, verifier considers program invalid 169c31315c3SDave Marchevsky 170c31315c3SDave Marchevsky * Access to the pointee's memory will not page fault 171c31315c3SDave Marchevsky 172c31315c3SDave Marchevsky*non-owning reference* 173c31315c3SDave Marchevsky 174c31315c3SDave Marchevsky * This reference does not own the pointee 175c31315c3SDave Marchevsky 176c31315c3SDave Marchevsky * It cannot be used to add the graph node to a graph root, nor ``free``'d via 177c31315c3SDave Marchevsky ``bpf_obj_drop`` 178c31315c3SDave Marchevsky 179c31315c3SDave Marchevsky * No explicit control of lifetime, but can infer valid lifetime based on 180c31315c3SDave Marchevsky non-owning ref existence (see explanation below) 181c31315c3SDave Marchevsky 182c31315c3SDave Marchevsky * Access to the pointee's memory will not page fault 183c31315c3SDave Marchevsky 184c31315c3SDave MarchevskyFrom verifier's perspective non-owning references can only exist 185c31315c3SDave Marchevskybetween spin_lock and spin_unlock. Why? After spin_unlock another program 186c31315c3SDave Marchevskycan do arbitrary operations on the data structure like removing and ``free``-ing 187c31315c3SDave Marchevskyvia bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd, 188c31315c3SDave Marchevsky``free``'d, and reused via bpf_obj_new would point to an entirely different thing. 189c31315c3SDave MarchevskyOr the memory could go away. 190c31315c3SDave Marchevsky 191c31315c3SDave MarchevskyTo prevent this logic violation all non-owning references are invalidated by the 192c31315c3SDave Marchevskyverifier after a critical section ends. This is necessary to ensure the "will 193c31315c3SDave Marchevskynot page fault" property of non-owning references. So if the verifier hasn't 194c31315c3SDave Marchevskyinvalidated a non-owning ref, accessing it will not page fault. 195c31315c3SDave Marchevsky 196c31315c3SDave MarchevskyCurrently ``bpf_obj_drop`` is not allowed in the critical section, so 197c31315c3SDave Marchevskyif there's a valid non-owning ref, we must be in a critical section, and can 198c31315c3SDave Marchevskyconclude that the ref's memory hasn't been dropped-and- ``free``'d or 199c31315c3SDave Marchevskydropped-and-reused. 200c31315c3SDave Marchevsky 201c31315c3SDave MarchevskyAny reference to a node that is in an rbtree _must_ be non-owning, since 202c31315c3SDave Marchevskythe tree has control of the pointee's lifetime. Similarly, any ref to a node 203c31315c3SDave Marchevskythat isn't in rbtree _must_ be owning. This results in a nice property: 204c31315c3SDave Marchevskygraph API add / remove implementations don't need to check if a node 205c31315c3SDave Marchevskyhas already been added (or already removed), as the ownership model 206c31315c3SDave Marchevskyallows the verifier to prevent such a state from being valid by simply checking 207c31315c3SDave Marchevskytypes. 208c31315c3SDave Marchevsky 209c31315c3SDave MarchevskyHowever, pointer aliasing poses an issue for the above "nice property". 210c31315c3SDave MarchevskyConsider the following example: 211c31315c3SDave Marchevsky 212c31315c3SDave Marchevsky.. code-block:: c 213c31315c3SDave Marchevsky 214c31315c3SDave Marchevsky struct node_data *n, *m, *o, *p; 215c31315c3SDave Marchevsky n = bpf_obj_new(typeof(*n)); /* 1 */ 216c31315c3SDave Marchevsky 217c31315c3SDave Marchevsky bpf_spin_lock(&lock); 218c31315c3SDave Marchevsky 219c31315c3SDave Marchevsky bpf_rbtree_add(&tree, n); /* 2 */ 220c31315c3SDave Marchevsky m = bpf_rbtree_first(&tree); /* 3 */ 221c31315c3SDave Marchevsky 222c31315c3SDave Marchevsky o = bpf_rbtree_remove(&tree, n); /* 4 */ 223c31315c3SDave Marchevsky p = bpf_rbtree_remove(&tree, m); /* 5 */ 224c31315c3SDave Marchevsky 225c31315c3SDave Marchevsky bpf_spin_unlock(&lock); 226c31315c3SDave Marchevsky 227c31315c3SDave Marchevsky bpf_obj_drop(o); 228c31315c3SDave Marchevsky bpf_obj_drop(p); /* 6 */ 229c31315c3SDave Marchevsky 230c31315c3SDave MarchevskyAssume the tree is empty before this program runs. If we track verifier state 231c31315c3SDave Marchevskychanges here using numbers in above comments: 232c31315c3SDave Marchevsky 233c31315c3SDave Marchevsky 1) n is an owning reference 234c31315c3SDave Marchevsky 235c31315c3SDave Marchevsky 2) n is a non-owning reference, it's been added to the tree 236c31315c3SDave Marchevsky 237c31315c3SDave Marchevsky 3) n and m are non-owning references, they both point to the same node 238c31315c3SDave Marchevsky 239c31315c3SDave Marchevsky 4) o is an owning reference, n and m non-owning, all point to same node 240c31315c3SDave Marchevsky 241c31315c3SDave Marchevsky 5) o and p are owning, n and m non-owning, all point to the same node 242c31315c3SDave Marchevsky 243c31315c3SDave Marchevsky 6) a double-free has occurred, since o and p point to same node and o was 244c31315c3SDave Marchevsky ``free``'d in previous statement 245c31315c3SDave Marchevsky 246c31315c3SDave MarchevskyStates 4 and 5 violate our "nice property", as there are non-owning refs to 247c31315c3SDave Marchevskya node which is not in an rbtree. Statement 5 will try to remove a node which 248c31315c3SDave Marchevskyhas already been removed as a result of this violation. State 6 is a dangerous 249c31315c3SDave Marchevskydouble-free. 250c31315c3SDave Marchevsky 251c31315c3SDave MarchevskyAt a minimum we should prevent state 6 from being possible. If we can't also 252c31315c3SDave Marchevskyprevent state 5 then we must abandon our "nice property" and check whether a 253c31315c3SDave Marchevskynode has already been removed at runtime. 254c31315c3SDave Marchevsky 255c31315c3SDave MarchevskyWe prevent both by generalizing the "invalidate non-owning references" behavior 256c31315c3SDave Marchevskyof ``bpf_spin_unlock`` and doing similar invalidation after 257c31315c3SDave Marchevsky``bpf_rbtree_remove``. The logic here being that any graph API kfunc which: 258c31315c3SDave Marchevsky 259c31315c3SDave Marchevsky * takes an arbitrary node argument 260c31315c3SDave Marchevsky 261c31315c3SDave Marchevsky * removes it from the data structure 262c31315c3SDave Marchevsky 263c31315c3SDave Marchevsky * returns an owning reference to the removed node 264c31315c3SDave Marchevsky 265c31315c3SDave MarchevskyMay result in a state where some other non-owning reference points to the same 266c31315c3SDave Marchevskynode. So ``remove``-type kfuncs must be considered a non-owning reference 267c31315c3SDave Marchevskyinvalidation point as well. 268