1======================================== 2Generic Associative Array Implementation 3======================================== 4 5Overview 6======== 7 8This associative array implementation is an object container with the following 9properties: 10 111. Objects are opaque pointers. The implementation does not care where they 12 point (if anywhere) or what they point to (if anything). 13.. note:: Pointers to objects _must_ be zero in the least significant bit. 14 152. Objects do not need to contain linkage blocks for use by the array. This 16 permits an object to be located in multiple arrays simultaneously. 17 Rather, the array is made up of metadata blocks that point to objects. 18 193. Objects require index keys to locate them within the array. 20 214. Index keys must be unique. Inserting an object with the same key as one 22 already in the array will replace the old object. 23 245. Index keys can be of any length and can be of different lengths. 25 266. Index keys should encode the length early on, before any variation due to 27 length is seen. 28 297. Index keys can include a hash to scatter objects throughout the array. 30 318. The array can iterated over. The objects will not necessarily come out in 32 key order. 33 349. The array can be iterated over whilst it is being modified, provided the 35 RCU readlock is being held by the iterator. Note, however, under these 36 circumstances, some objects may be seen more than once. If this is a 37 problem, the iterator should lock against modification. Objects will not 38 be missed, however, unless deleted. 39 4010. Objects in the array can be looked up by means of their index key. 41 4211. Objects can be looked up whilst the array is being modified, provided the 43 RCU readlock is being held by the thread doing the look up. 44 45The implementation uses a tree of 16-pointer nodes internally that are indexed 46on each level by nibbles from the index key in the same manner as in a radix 47tree. To improve memory efficiency, shortcuts can be emplaced to skip over 48what would otherwise be a series of single-occupancy nodes. Further, nodes 49pack leaf object pointers into spare space in the node rather than making an 50extra branch until as such time an object needs to be added to a full node. 51 52 53The Public API 54============== 55 56The public API can be found in ``<linux/assoc_array.h>``. The associative 57array is rooted on the following structure:: 58 59 struct assoc_array { 60 ... 61 }; 62 63The code is selected by enabling ``CONFIG_ASSOCIATIVE_ARRAY`` with:: 64 65 ./script/config -e ASSOCIATIVE_ARRAY 66 67 68Edit Script 69----------- 70 71The insertion and deletion functions produce an 'edit script' that can later be 72applied to effect the changes without risking ``ENOMEM``. This retains the 73preallocated metadata blocks that will be installed in the internal tree and 74keeps track of the metadata blocks that will be removed from the tree when the 75script is applied. 76 77This is also used to keep track of dead blocks and dead objects after the 78script has been applied so that they can be freed later. The freeing is done 79after an RCU grace period has passed - thus allowing access functions to 80proceed under the RCU read lock. 81 82The script appears as outside of the API as a pointer of the type:: 83 84 struct assoc_array_edit; 85 86There are two functions for dealing with the script: 87 881. Apply an edit script:: 89 90 void assoc_array_apply_edit(struct assoc_array_edit *edit); 91 92This will perform the edit functions, interpolating various write barriers 93to permit accesses under the RCU read lock to continue. The edit script 94will then be passed to ``call_rcu()`` to free it and any dead stuff it points 95to. 96 972. Cancel an edit script:: 98 99 void assoc_array_cancel_edit(struct assoc_array_edit *edit); 100 101This frees the edit script and all preallocated memory immediately. If 102this was for insertion, the new object is _not_ released by this function, 103but must rather be released by the caller. 104 105These functions are guaranteed not to fail. 106 107 108Operations Table 109---------------- 110 111Various functions take a table of operations:: 112 113 struct assoc_array_ops { 114 ... 115 }; 116 117This points to a number of methods, all of which need to be provided: 118 1191. Get a chunk of index key from caller data:: 120 121 unsigned long (*get_key_chunk)(const void *index_key, int level); 122 123This should return a chunk of caller-supplied index key starting at the 124*bit* position given by the level argument. The level argument will be a 125multiple of ``ASSOC_ARRAY_KEY_CHUNK_SIZE`` and the function should return 126``ASSOC_ARRAY_KEY_CHUNK_SIZE bits``. No error is possible. 127 128 1292. Get a chunk of an object's index key:: 130 131 unsigned long (*get_object_key_chunk)(const void *object, int level); 132 133As the previous function, but gets its data from an object in the array 134rather than from a caller-supplied index key. 135 136 1373. See if this is the object we're looking for:: 138 139 bool (*compare_object)(const void *object, const void *index_key); 140 141Compare the object against an index key and return ``true`` if it matches and 142``false`` if it doesn't. 143 144 1454. Diff the index keys of two objects:: 146 147 int (*diff_objects)(const void *object, const void *index_key); 148 149Return the bit position at which the index key of the specified object 150differs from the given index key or -1 if they are the same. 151 152 1535. Free an object:: 154 155 void (*free_object)(void *object); 156 157Free the specified object. Note that this may be called an RCU grace period 158after ``assoc_array_apply_edit()`` was called, so ``synchronize_rcu()`` may be 159necessary on module unloading. 160 161 162Manipulation Functions 163---------------------- 164 165There are a number of functions for manipulating an associative array: 166 1671. Initialise an associative array:: 168 169 void assoc_array_init(struct assoc_array *array); 170 171This initialises the base structure for an associative array. It can't fail. 172 173 1742. Insert/replace an object in an associative array:: 175 176 struct assoc_array_edit * 177 assoc_array_insert(struct assoc_array *array, 178 const struct assoc_array_ops *ops, 179 const void *index_key, 180 void *object); 181 182This inserts the given object into the array. Note that the least 183significant bit of the pointer must be zero as it's used to type-mark 184pointers internally. 185 186If an object already exists for that key then it will be replaced with the 187new object and the old one will be freed automatically. 188 189The ``index_key`` argument should hold index key information and is 190passed to the methods in the ops table when they are called. 191 192This function makes no alteration to the array itself, but rather returns 193an edit script that must be applied. ``-ENOMEM`` is returned in the case of 194an out-of-memory error. 195 196The caller should lock exclusively against other modifiers of the array. 197 198 1993. Delete an object from an associative array:: 200 201 struct assoc_array_edit * 202 assoc_array_delete(struct assoc_array *array, 203 const struct assoc_array_ops *ops, 204 const void *index_key); 205 206This deletes an object that matches the specified data from the array. 207 208The ``index_key`` argument should hold index key information and is 209passed to the methods in the ops table when they are called. 210 211This function makes no alteration to the array itself, but rather returns 212an edit script that must be applied. ``-ENOMEM`` is returned in the case of 213an out-of-memory error. ``NULL`` will be returned if the specified object is 214not found within the array. 215 216The caller should lock exclusively against other modifiers of the array. 217 218 2194. Delete all objects from an associative array:: 220 221 struct assoc_array_edit * 222 assoc_array_clear(struct assoc_array *array, 223 const struct assoc_array_ops *ops); 224 225This deletes all the objects from an associative array and leaves it 226completely empty. 227 228This function makes no alteration to the array itself, but rather returns 229an edit script that must be applied. ``-ENOMEM`` is returned in the case of 230an out-of-memory error. 231 232The caller should lock exclusively against other modifiers of the array. 233 234 2355. Destroy an associative array, deleting all objects:: 236 237 void assoc_array_destroy(struct assoc_array *array, 238 const struct assoc_array_ops *ops); 239 240This destroys the contents of the associative array and leaves it 241completely empty. It is not permitted for another thread to be traversing 242the array under the RCU read lock at the same time as this function is 243destroying it as no RCU deferral is performed on memory release - 244something that would require memory to be allocated. 245 246The caller should lock exclusively against other modifiers and accessors 247of the array. 248 249 2506. Garbage collect an associative array:: 251 252 int assoc_array_gc(struct assoc_array *array, 253 const struct assoc_array_ops *ops, 254 bool (*iterator)(void *object, void *iterator_data), 255 void *iterator_data); 256 257This iterates over the objects in an associative array and passes each one to 258``iterator()``. If ``iterator()`` returns ``true``, the object is kept. If it 259returns ``false``, the object will be freed. If the ``iterator()`` function 260returns ``true``, it must perform any appropriate refcount incrementing on the 261object before returning. 262 263The internal tree will be packed down if possible as part of the iteration 264to reduce the number of nodes in it. 265 266The ``iterator_data`` is passed directly to ``iterator()`` and is otherwise 267ignored by the function. 268 269The function will return ``0`` if successful and ``-ENOMEM`` if there wasn't 270enough memory. 271 272It is possible for other threads to iterate over or search the array under 273the RCU read lock whilst this function is in progress. The caller should 274lock exclusively against other modifiers of the array. 275 276 277Access Functions 278---------------- 279 280There are two functions for accessing an associative array: 281 2821. Iterate over all the objects in an associative array:: 283 284 int assoc_array_iterate(const struct assoc_array *array, 285 int (*iterator)(const void *object, 286 void *iterator_data), 287 void *iterator_data); 288 289This passes each object in the array to the iterator callback function. 290``iterator_data`` is private data for that function. 291 292This may be used on an array at the same time as the array is being 293modified, provided the RCU read lock is held. Under such circumstances, 294it is possible for the iteration function to see some objects twice. If 295this is a problem, then modification should be locked against. The 296iteration algorithm should not, however, miss any objects. 297 298The function will return ``0`` if no objects were in the array or else it will 299return the result of the last iterator function called. Iteration stops 300immediately if any call to the iteration function results in a non-zero 301return. 302 303 3042. Find an object in an associative array:: 305 306 void *assoc_array_find(const struct assoc_array *array, 307 const struct assoc_array_ops *ops, 308 const void *index_key); 309 310This walks through the array's internal tree directly to the object 311specified by the index key.. 312 313This may be used on an array at the same time as the array is being 314modified, provided the RCU read lock is held. 315 316The function will return the object if found (and set ``*_type`` to the object 317type) or will return ``NULL`` if the object was not found. 318 319 320Index Key Form 321-------------- 322 323The index key can be of any form, but since the algorithms aren't told how long 324the key is, it is strongly recommended that the index key includes its length 325very early on before any variation due to the length would have an effect on 326comparisons. 327 328This will cause leaves with different length keys to scatter away from each 329other - and those with the same length keys to cluster together. 330 331It is also recommended that the index key begin with a hash of the rest of the 332key to maximise scattering throughout keyspace. 333 334The better the scattering, the wider and lower the internal tree will be. 335 336Poor scattering isn't too much of a problem as there are shortcuts and nodes 337can contain mixtures of leaves and metadata pointers. 338 339The index key is read in chunks of machine word. Each chunk is subdivided into 340one nibble (4 bits) per level, so on a 32-bit CPU this is good for 8 levels and 341on a 64-bit CPU, 16 levels. Unless the scattering is really poor, it is 342unlikely that more than one word of any particular index key will have to be 343used. 344 345 346Internal Workings 347================= 348 349The associative array data structure has an internal tree. This tree is 350constructed of two types of metadata blocks: nodes and shortcuts. 351 352A node is an array of slots. Each slot can contain one of four things: 353 354* A NULL pointer, indicating that the slot is empty. 355* A pointer to an object (a leaf). 356* A pointer to a node at the next level. 357* A pointer to a shortcut. 358 359 360Basic Internal Tree Layout 361-------------------------- 362 363Ignoring shortcuts for the moment, the nodes form a multilevel tree. The index 364key space is strictly subdivided by the nodes in the tree and nodes occur on 365fixed levels. For example:: 366 367 Level: 0 1 2 3 368 =============== =============== =============== =============== 369 NODE D 370 NODE B NODE C +------>+---+ 371 +------>+---+ +------>+---+ | | 0 | 372 NODE A | | 0 | | | 0 | | +---+ 373 +---+ | +---+ | +---+ | : : 374 | 0 | | : : | : : | +---+ 375 +---+ | +---+ | +---+ | | f | 376 | 1 |---+ | 3 |---+ | 7 |---+ +---+ 377 +---+ +---+ +---+ 378 : : : : | 8 |---+ 379 +---+ +---+ +---+ | NODE E 380 | e |---+ | f | : : +------>+---+ 381 +---+ | +---+ +---+ | 0 | 382 | f | | | f | +---+ 383 +---+ | +---+ : : 384 | NODE F +---+ 385 +------>+---+ | f | 386 | 0 | NODE G +---+ 387 +---+ +------>+---+ 388 : : | | 0 | 389 +---+ | +---+ 390 | 6 |---+ : : 391 +---+ +---+ 392 : : | f | 393 +---+ +---+ 394 | f | 395 +---+ 396 397In the above example, there are 7 nodes (A-G), each with 16 slots (0-f). 398Assuming no other meta data nodes in the tree, the key space is divided 399thusly:: 400 401 KEY PREFIX NODE 402 ========== ==== 403 137* D 404 138* E 405 13[0-69-f]* C 406 1[0-24-f]* B 407 e6* G 408 e[0-57-f]* F 409 [02-df]* A 410 411So, for instance, keys with the following example index keys will be found in 412the appropriate nodes:: 413 414 INDEX KEY PREFIX NODE 415 =============== ======= ==== 416 13694892892489 13 C 417 13795289025897 137 D 418 13889dde88793 138 E 419 138bbb89003093 138 E 420 1394879524789 12 C 421 1458952489 1 B 422 9431809de993ba - A 423 b4542910809cd - A 424 e5284310def98 e F 425 e68428974237 e6 G 426 e7fffcbd443 e F 427 f3842239082 - A 428 429To save memory, if a node can hold all the leaves in its portion of keyspace, 430then the node will have all those leaves in it and will not have any metadata 431pointers - even if some of those leaves would like to be in the same slot. 432 433A node can contain a heterogeneous mix of leaves and metadata pointers. 434Metadata pointers must be in the slots that match their subdivisions of key 435space. The leaves can be in any slot not occupied by a metadata pointer. It 436is guaranteed that none of the leaves in a node will match a slot occupied by a 437metadata pointer. If the metadata pointer is there, any leaf whose key matches 438the metadata key prefix must be in the subtree that the metadata pointer points 439to. 440 441In the above example list of index keys, node A will contain:: 442 443 SLOT CONTENT INDEX KEY (PREFIX) 444 ==== =============== ================== 445 1 PTR TO NODE B 1* 446 any LEAF 9431809de993ba 447 any LEAF b4542910809cd 448 e PTR TO NODE F e* 449 any LEAF f3842239082 450 451and node B:: 452 453 3 PTR TO NODE C 13* 454 any LEAF 1458952489 455 456 457Shortcuts 458--------- 459 460Shortcuts are metadata records that jump over a piece of keyspace. A shortcut 461is a replacement for a series of single-occupancy nodes ascending through the 462levels. Shortcuts exist to save memory and to speed up traversal. 463 464It is possible for the root of the tree to be a shortcut - say, for example, 465the tree contains at least 17 nodes all with key prefix ``1111``. The 466insertion algorithm will insert a shortcut to skip over the ``1111`` keyspace 467in a single bound and get to the fourth level where these actually become 468different. 469 470 471Splitting And Collapsing Nodes 472------------------------------ 473 474Each node has a maximum capacity of 16 leaves and metadata pointers. If the 475insertion algorithm finds that it is trying to insert a 17th object into a 476node, that node will be split such that at least two leaves that have a common 477key segment at that level end up in a separate node rooted on that slot for 478that common key segment. 479 480If the leaves in a full node and the leaf that is being inserted are 481sufficiently similar, then a shortcut will be inserted into the tree. 482 483When the number of objects in the subtree rooted at a node falls to 16 or 484fewer, then the subtree will be collapsed down to a single node - and this will 485ripple towards the root if possible. 486 487 488Non-Recursive Iteration 489----------------------- 490 491Each node and shortcut contains a back pointer to its parent and the number of 492slot in that parent that points to it. None-recursive iteration uses these to 493proceed rootwards through the tree, going to the parent node, slot N + 1 to 494make sure progress is made without the need for a stack. 495 496The backpointers, however, make simultaneous alteration and iteration tricky. 497 498 499Simultaneous Alteration And Iteration 500------------------------------------- 501 502There are a number of cases to consider: 503 5041. Simple insert/replace. This involves simply replacing a NULL or old 505 matching leaf pointer with the pointer to the new leaf after a barrier. 506 The metadata blocks don't change otherwise. An old leaf won't be freed 507 until after the RCU grace period. 508 5092. Simple delete. This involves just clearing an old matching leaf. The 510 metadata blocks don't change otherwise. The old leaf won't be freed until 511 after the RCU grace period. 512 5133. Insertion replacing part of a subtree that we haven't yet entered. This 514 may involve replacement of part of that subtree - but that won't affect 515 the iteration as we won't have reached the pointer to it yet and the 516 ancestry blocks are not replaced (the layout of those does not change). 517 5184. Insertion replacing nodes that we're actively processing. This isn't a 519 problem as we've passed the anchoring pointer and won't switch onto the 520 new layout until we follow the back pointers - at which point we've 521 already examined the leaves in the replaced node (we iterate over all the 522 leaves in a node before following any of its metadata pointers). 523 524 We might, however, re-see some leaves that have been split out into a new 525 branch that's in a slot further along than we were at. 526 5275. Insertion replacing nodes that we're processing a dependent branch of. 528 This won't affect us until we follow the back pointers. Similar to (4). 529 5306. Deletion collapsing a branch under us. This doesn't affect us because the 531 back pointers will get us back to the parent of the new node before we 532 could see the new node. The entire collapsed subtree is thrown away 533 unchanged - and will still be rooted on the same slot, so we shouldn't 534 process it a second time as we'll go back to slot + 1. 535 536.. note:: 537 538 Under some circumstances, we need to simultaneously change the parent 539 pointer and the parent slot pointer on a node (say, for example, we 540 inserted another node before it and moved it up a level). We cannot do 541 this without locking against a read - so we have to replace that node too. 542 543 However, when we're changing a shortcut into a node this isn't a problem 544 as shortcuts only have one slot and so the parent slot number isn't used 545 when traversing backwards over one. This means that it's okay to change 546 the slot number first - provided suitable barriers are used to make sure 547 the parent slot number is read after the back pointer. 548 549Obsolete blocks and leaves are freed up after an RCU grace period has passed, 550so as long as anyone doing walking or iteration holds the RCU read lock, the 551old superstructure should not go away on them. 552