1.. This file is dual-licensed: you can use it either under the terms 2.. of the GPL 2.0 or the GFDL 1.2 license, at your option. Note that this 3.. dual licensing only applies to this file, and not this project as a 4.. whole. 5.. 6.. a) This file is free software; you can redistribute it and/or 7.. modify it under the terms of the GNU General Public License as 8.. published by the Free Software Foundation version 2 of 9.. the License. 10.. 11.. This file is distributed in the hope that it will be useful, 12.. but WITHOUT ANY WARRANTY; without even the implied warranty of 13.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 14.. GNU General Public License for more details. 15.. 16.. Or, alternatively, 17.. 18.. b) Permission is granted to copy, distribute and/or modify this 19.. document under the terms of the GNU Free Documentation License, 20.. Version 1.2 version published by the Free Software 21.. Foundation, with no Invariant Sections, no Front-Cover Texts 22.. and no Back-Cover Texts. A copy of the license is included at 23.. Documentation/userspace-api/media/fdl-appendix.rst. 24.. 25.. TODO: replace it to GPL-2.0 OR GFDL-1.2 WITH no-invariant-sections 26 27=========================== 28Lockless Ring Buffer Design 29=========================== 30 31Copyright 2009 Red Hat Inc. 32 33:Author: Steven Rostedt <srostedt@redhat.com> 34:License: The GNU Free Documentation License, Version 1.2 35 (dual licensed under the GPL v2) 36:Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, 37 and Frederic Weisbecker. 38 39 40Written for: 2.6.31 41 42Terminology used in this Document 43--------------------------------- 44 45tail 46 - where new writes happen in the ring buffer. 47 48head 49 - where new reads happen in the ring buffer. 50 51producer 52 - the task that writes into the ring buffer (same as writer) 53 54writer 55 - same as producer 56 57consumer 58 - the task that reads from the buffer (same as reader) 59 60reader 61 - same as consumer. 62 63reader_page 64 - A page outside the ring buffer used solely (for the most part) 65 by the reader. 66 67head_page 68 - a pointer to the page that the reader will use next 69 70tail_page 71 - a pointer to the page that will be written to next 72 73commit_page 74 - a pointer to the page with the last finished non-nested write. 75 76cmpxchg 77 - hardware-assisted atomic transaction that performs the following:: 78 79 A = B if previous A == C 80 81 R = cmpxchg(A, C, B) is saying that we replace A with B if and only 82 if current A is equal to C, and we put the old (current) 83 A into R 84 85 R gets the previous A regardless if A is updated with B or not. 86 87 To see if the update was successful a compare of ``R == C`` 88 may be used. 89 90The Generic Ring Buffer 91----------------------- 92 93The ring buffer can be used in either an overwrite mode or in 94producer/consumer mode. 95 96Producer/consumer mode is where if the producer were to fill up the 97buffer before the consumer could free up anything, the producer 98will stop writing to the buffer. This will lose most recent events. 99 100Overwrite mode is where if the producer were to fill up the buffer 101before the consumer could free up anything, the producer will 102overwrite the older data. This will lose the oldest events. 103 104No two writers can write at the same time (on the same per-cpu buffer), 105but a writer may interrupt another writer, but it must finish writing 106before the previous writer may continue. This is very important to the 107algorithm. The writers act like a "stack". The way interrupts works 108enforces this behavior:: 109 110 111 writer1 start 112 <preempted> writer2 start 113 <preempted> writer3 start 114 writer3 finishes 115 writer2 finishes 116 writer1 finishes 117 118This is very much like a writer being preempted by an interrupt and 119the interrupt doing a write as well. 120 121Readers can happen at any time. But no two readers may run at the 122same time, nor can a reader preempt/interrupt another reader. A reader 123cannot preempt/interrupt a writer, but it may read/consume from the 124buffer at the same time as a writer is writing, but the reader must be 125on another processor to do so. A reader may read on its own processor 126and can be preempted by a writer. 127 128A writer can preempt a reader, but a reader cannot preempt a writer. 129But a reader can read the buffer at the same time (on another processor) 130as a writer. 131 132The ring buffer is made up of a list of pages held together by a linked list. 133 134At initialization a reader page is allocated for the reader that is not 135part of the ring buffer. 136 137The head_page, tail_page and commit_page are all initialized to point 138to the same page. 139 140The reader page is initialized to have its next pointer pointing to 141the head page, and its previous pointer pointing to a page before 142the head page. 143 144The reader has its own page to use. At start up time, this page is 145allocated but is not attached to the list. When the reader wants 146to read from the buffer, if its page is empty (like it is on start-up), 147it will swap its page with the head_page. The old reader page will 148become part of the ring buffer and the head_page will be removed. 149The page after the inserted page (old reader_page) will become the 150new head page. 151 152Once the new page is given to the reader, the reader could do what 153it wants with it, as long as a writer has left that page. 154 155A sample of how the reader page is swapped: Note this does not 156show the head page in the buffer, it is for demonstrating a swap 157only. 158 159:: 160 161 +------+ 162 |reader| RING BUFFER 163 |page | 164 +------+ 165 +---+ +---+ +---+ 166 | |-->| |-->| | 167 | |<--| |<--| | 168 +---+ +---+ +---+ 169 ^ | ^ | 170 | +-------------+ | 171 +-----------------+ 172 173 174 +------+ 175 |reader| RING BUFFER 176 |page |-------------------+ 177 +------+ v 178 | +---+ +---+ +---+ 179 | | |-->| |-->| | 180 | | |<--| |<--| |<-+ 181 | +---+ +---+ +---+ | 182 | ^ | ^ | | 183 | | +-------------+ | | 184 | +-----------------+ | 185 +------------------------------------+ 186 187 +------+ 188 |reader| RING BUFFER 189 |page |-------------------+ 190 +------+ <---------------+ v 191 | ^ +---+ +---+ +---+ 192 | | | |-->| |-->| | 193 | | | | | |<--| |<-+ 194 | | +---+ +---+ +---+ | 195 | | | ^ | | 196 | | +-------------+ | | 197 | +-----------------------------+ | 198 +------------------------------------+ 199 200 +------+ 201 |buffer| RING BUFFER 202 |page |-------------------+ 203 +------+ <---------------+ v 204 | ^ +---+ +---+ +---+ 205 | | | | | |-->| | 206 | | New | | | |<--| |<-+ 207 | | Reader +---+ +---+ +---+ | 208 | | page ----^ | | 209 | | | | 210 | +-----------------------------+ | 211 +------------------------------------+ 212 213 214 215It is possible that the page swapped is the commit page and the tail page, 216if what is in the ring buffer is less than what is held in a buffer page. 217 218:: 219 220 reader page commit page tail page 221 | | | 222 v | | 223 +---+ | | 224 | |<----------+ | 225 | |<------------------------+ 226 | |------+ 227 +---+ | 228 | 229 v 230 +---+ +---+ +---+ +---+ 231 <---| |--->| |--->| |--->| |---> 232 --->| |<---| |<---| |<---| |<--- 233 +---+ +---+ +---+ +---+ 234 235This case is still valid for this algorithm. 236When the writer leaves the page, it simply goes into the ring buffer 237since the reader page still points to the next location in the ring 238buffer. 239 240 241The main pointers: 242 243 reader page 244 - The page used solely by the reader and is not part 245 of the ring buffer (may be swapped in) 246 247 head page 248 - the next page in the ring buffer that will be swapped 249 with the reader page. 250 251 tail page 252 - the page where the next write will take place. 253 254 commit page 255 - the page that last finished a write. 256 257The commit page only is updated by the outermost writer in the 258writer stack. A writer that preempts another writer will not move the 259commit page. 260 261When data is written into the ring buffer, a position is reserved 262in the ring buffer and passed back to the writer. When the writer 263is finished writing data into that position, it commits the write. 264 265Another write (or a read) may take place at anytime during this 266transaction. If another write happens it must finish before continuing 267with the previous write. 268 269 270 Write reserve:: 271 272 Buffer page 273 +---------+ 274 |written | 275 +---------+ <--- given back to writer (current commit) 276 |reserved | 277 +---------+ <--- tail pointer 278 | empty | 279 +---------+ 280 281 Write commit:: 282 283 Buffer page 284 +---------+ 285 |written | 286 +---------+ 287 |written | 288 +---------+ <--- next position for write (current commit) 289 | empty | 290 +---------+ 291 292 293 If a write happens after the first reserve:: 294 295 Buffer page 296 +---------+ 297 |written | 298 +---------+ <-- current commit 299 |reserved | 300 +---------+ <--- given back to second writer 301 |reserved | 302 +---------+ <--- tail pointer 303 304 After second writer commits:: 305 306 307 Buffer page 308 +---------+ 309 |written | 310 +---------+ <--(last full commit) 311 |reserved | 312 +---------+ 313 |pending | 314 |commit | 315 +---------+ <--- tail pointer 316 317 When the first writer commits:: 318 319 Buffer page 320 +---------+ 321 |written | 322 +---------+ 323 |written | 324 +---------+ 325 |written | 326 +---------+ <--(last full commit and tail pointer) 327 328 329The commit pointer points to the last write location that was 330committed without preempting another write. When a write that 331preempted another write is committed, it only becomes a pending commit 332and will not be a full commit until all writes have been committed. 333 334The commit page points to the page that has the last full commit. 335The tail page points to the page with the last write (before 336committing). 337 338The tail page is always equal to or after the commit page. It may 339be several pages ahead. If the tail page catches up to the commit 340page then no more writes may take place (regardless of the mode 341of the ring buffer: overwrite and produce/consumer). 342 343The order of pages is:: 344 345 head page 346 commit page 347 tail page 348 349Possible scenario:: 350 351 tail page 352 head page commit page | 353 | | | 354 v v v 355 +---+ +---+ +---+ +---+ 356 <---| |--->| |--->| |--->| |---> 357 --->| |<---| |<---| |<---| |<--- 358 +---+ +---+ +---+ +---+ 359 360There is a special case that the head page is after either the commit page 361and possibly the tail page. That is when the commit (and tail) page has been 362swapped with the reader page. This is because the head page is always 363part of the ring buffer, but the reader page is not. Whenever there 364has been less than a full page that has been committed inside the ring buffer, 365and a reader swaps out a page, it will be swapping out the commit page. 366 367:: 368 369 reader page commit page tail page 370 | | | 371 v | | 372 +---+ | | 373 | |<----------+ | 374 | |<------------------------+ 375 | |------+ 376 +---+ | 377 | 378 v 379 +---+ +---+ +---+ +---+ 380 <---| |--->| |--->| |--->| |---> 381 --->| |<---| |<---| |<---| |<--- 382 +---+ +---+ +---+ +---+ 383 ^ 384 | 385 head page 386 387 388In this case, the head page will not move when the tail and commit 389move back into the ring buffer. 390 391The reader cannot swap a page into the ring buffer if the commit page 392is still on that page. If the read meets the last commit (real commit 393not pending or reserved), then there is nothing more to read. 394The buffer is considered empty until another full commit finishes. 395 396When the tail meets the head page, if the buffer is in overwrite mode, 397the head page will be pushed ahead one. If the buffer is in producer/consumer 398mode, the write will fail. 399 400Overwrite mode:: 401 402 tail page 403 | 404 v 405 +---+ +---+ +---+ +---+ 406 <---| |--->| |--->| |--->| |---> 407 --->| |<---| |<---| |<---| |<--- 408 +---+ +---+ +---+ +---+ 409 ^ 410 | 411 head page 412 413 414 tail page 415 | 416 v 417 +---+ +---+ +---+ +---+ 418 <---| |--->| |--->| |--->| |---> 419 --->| |<---| |<---| |<---| |<--- 420 +---+ +---+ +---+ +---+ 421 ^ 422 | 423 head page 424 425 426 tail page 427 | 428 v 429 +---+ +---+ +---+ +---+ 430 <---| |--->| |--->| |--->| |---> 431 --->| |<---| |<---| |<---| |<--- 432 +---+ +---+ +---+ +---+ 433 ^ 434 | 435 head page 436 437Note, the reader page will still point to the previous head page. 438But when a swap takes place, it will use the most recent head page. 439 440 441Making the Ring Buffer Lockless: 442-------------------------------- 443 444The main idea behind the lockless algorithm is to combine the moving 445of the head_page pointer with the swapping of pages with the reader. 446State flags are placed inside the pointer to the page. To do this, 447each page must be aligned in memory by 4 bytes. This will allow the 2 448least significant bits of the address to be used as flags, since 449they will always be zero for the address. To get the address, 450simply mask out the flags:: 451 452 MASK = ~3 453 454 address & MASK 455 456Two flags will be kept by these two bits: 457 458 HEADER 459 - the page being pointed to is a head page 460 461 UPDATE 462 - the page being pointed to is being updated by a writer 463 and was or is about to be a head page. 464 465:: 466 467 reader page 468 | 469 v 470 +---+ 471 | |------+ 472 +---+ | 473 | 474 v 475 +---+ +---+ +---+ +---+ 476 <---| |--->| |-H->| |--->| |---> 477 --->| |<---| |<---| |<---| |<--- 478 +---+ +---+ +---+ +---+ 479 480 481The above pointer "-H->" would have the HEADER flag set. That is 482the next page is the next page to be swapped out by the reader. 483This pointer means the next page is the head page. 484 485When the tail page meets the head pointer, it will use cmpxchg to 486change the pointer to the UPDATE state:: 487 488 489 tail page 490 | 491 v 492 +---+ +---+ +---+ +---+ 493 <---| |--->| |-H->| |--->| |---> 494 --->| |<---| |<---| |<---| |<--- 495 +---+ +---+ +---+ +---+ 496 497 tail page 498 | 499 v 500 +---+ +---+ +---+ +---+ 501 <---| |--->| |-U->| |--->| |---> 502 --->| |<---| |<---| |<---| |<--- 503 +---+ +---+ +---+ +---+ 504 505"-U->" represents a pointer in the UPDATE state. 506 507Any access to the reader will need to take some sort of lock to serialize 508the readers. But the writers will never take a lock to write to the 509ring buffer. This means we only need to worry about a single reader, 510and writes only preempt in "stack" formation. 511 512When the reader tries to swap the page with the ring buffer, it 513will also use cmpxchg. If the flag bit in the pointer to the 514head page does not have the HEADER flag set, the compare will fail 515and the reader will need to look for the new head page and try again. 516Note, the flags UPDATE and HEADER are never set at the same time. 517 518The reader swaps the reader page as follows:: 519 520 +------+ 521 |reader| RING BUFFER 522 |page | 523 +------+ 524 +---+ +---+ +---+ 525 | |--->| |--->| | 526 | |<---| |<---| | 527 +---+ +---+ +---+ 528 ^ | ^ | 529 | +---------------+ | 530 +-----H-------------+ 531 532The reader sets the reader page next pointer as HEADER to the page after 533the head page:: 534 535 536 +------+ 537 |reader| RING BUFFER 538 |page |-------H-----------+ 539 +------+ v 540 | +---+ +---+ +---+ 541 | | |--->| |--->| | 542 | | |<---| |<---| |<-+ 543 | +---+ +---+ +---+ | 544 | ^ | ^ | | 545 | | +---------------+ | | 546 | +-----H-------------+ | 547 +--------------------------------------+ 548 549It does a cmpxchg with the pointer to the previous head page to make it 550point to the reader page. Note that the new pointer does not have the HEADER 551flag set. This action atomically moves the head page forward:: 552 553 +------+ 554 |reader| RING BUFFER 555 |page |-------H-----------+ 556 +------+ v 557 | ^ +---+ +---+ +---+ 558 | | | |-->| |-->| | 559 | | | |<--| |<--| |<-+ 560 | | +---+ +---+ +---+ | 561 | | | ^ | | 562 | | +-------------+ | | 563 | +-----------------------------+ | 564 +------------------------------------+ 565 566After the new head page is set, the previous pointer of the head page is 567updated to the reader page:: 568 569 +------+ 570 |reader| RING BUFFER 571 |page |-------H-----------+ 572 +------+ <---------------+ v 573 | ^ +---+ +---+ +---+ 574 | | | |-->| |-->| | 575 | | | | | |<--| |<-+ 576 | | +---+ +---+ +---+ | 577 | | | ^ | | 578 | | +-------------+ | | 579 | +-----------------------------+ | 580 +------------------------------------+ 581 582 +------+ 583 |buffer| RING BUFFER 584 |page |-------H-----------+ <--- New head page 585 +------+ <---------------+ v 586 | ^ +---+ +---+ +---+ 587 | | | | | |-->| | 588 | | New | | | |<--| |<-+ 589 | | Reader +---+ +---+ +---+ | 590 | | page ----^ | | 591 | | | | 592 | +-----------------------------+ | 593 +------------------------------------+ 594 595Another important point: The page that the reader page points back to 596by its previous pointer (the one that now points to the new head page) 597never points back to the reader page. That is because the reader page is 598not part of the ring buffer. Traversing the ring buffer via the next pointers 599will always stay in the ring buffer. Traversing the ring buffer via the 600prev pointers may not. 601 602Note, the way to determine a reader page is simply by examining the previous 603pointer of the page. If the next pointer of the previous page does not 604point back to the original page, then the original page is a reader page:: 605 606 607 +--------+ 608 | reader | next +----+ 609 | page |-------->| |<====== (buffer page) 610 +--------+ +----+ 611 | | ^ 612 | v | next 613 prev | +----+ 614 +------------->| | 615 +----+ 616 617The way the head page moves forward: 618 619When the tail page meets the head page and the buffer is in overwrite mode 620and more writes take place, the head page must be moved forward before the 621writer may move the tail page. The way this is done is that the writer 622performs a cmpxchg to convert the pointer to the head page from the HEADER 623flag to have the UPDATE flag set. Once this is done, the reader will 624not be able to swap the head page from the buffer, nor will it be able to 625move the head page, until the writer is finished with the move. 626 627This eliminates any races that the reader can have on the writer. The reader 628must spin, and this is why the reader cannot preempt the writer:: 629 630 tail page 631 | 632 v 633 +---+ +---+ +---+ +---+ 634 <---| |--->| |-H->| |--->| |---> 635 --->| |<---| |<---| |<---| |<--- 636 +---+ +---+ +---+ +---+ 637 638 tail page 639 | 640 v 641 +---+ +---+ +---+ +---+ 642 <---| |--->| |-U->| |--->| |---> 643 --->| |<---| |<---| |<---| |<--- 644 +---+ +---+ +---+ +---+ 645 646The following page will be made into the new head page:: 647 648 tail page 649 | 650 v 651 +---+ +---+ +---+ +---+ 652 <---| |--->| |-U->| |-H->| |---> 653 --->| |<---| |<---| |<---| |<--- 654 +---+ +---+ +---+ +---+ 655 656After the new head page has been set, we can set the old head page 657pointer back to NORMAL:: 658 659 tail page 660 | 661 v 662 +---+ +---+ +---+ +---+ 663 <---| |--->| |--->| |-H->| |---> 664 --->| |<---| |<---| |<---| |<--- 665 +---+ +---+ +---+ +---+ 666 667After the head page has been moved, the tail page may now move forward:: 668 669 tail page 670 | 671 v 672 +---+ +---+ +---+ +---+ 673 <---| |--->| |--->| |-H->| |---> 674 --->| |<---| |<---| |<---| |<--- 675 +---+ +---+ +---+ +---+ 676 677 678The above are the trivial updates. Now for the more complex scenarios. 679 680 681As stated before, if enough writes preempt the first write, the 682tail page may make it all the way around the buffer and meet the commit 683page. At this time, we must start dropping writes (usually with some kind 684of warning to the user). But what happens if the commit was still on the 685reader page? The commit page is not part of the ring buffer. The tail page 686must account for this:: 687 688 689 reader page commit page 690 | | 691 v | 692 +---+ | 693 | |<----------+ 694 | | 695 | |------+ 696 +---+ | 697 | 698 v 699 +---+ +---+ +---+ +---+ 700 <---| |--->| |-H->| |--->| |---> 701 --->| |<---| |<---| |<---| |<--- 702 +---+ +---+ +---+ +---+ 703 ^ 704 | 705 tail page 706 707If the tail page were to simply push the head page forward, the commit when 708leaving the reader page would not be pointing to the correct page. 709 710The solution to this is to test if the commit page is on the reader page 711before pushing the head page. If it is, then it can be assumed that the 712tail page wrapped the buffer, and we must drop new writes. 713 714This is not a race condition, because the commit page can only be moved 715by the outermost writer (the writer that was preempted). 716This means that the commit will not move while a writer is moving the 717tail page. The reader cannot swap the reader page if it is also being 718used as the commit page. The reader can simply check that the commit 719is off the reader page. Once the commit page leaves the reader page 720it will never go back on it unless a reader does another swap with the 721buffer page that is also the commit page. 722 723 724Nested writes 725------------- 726 727In the pushing forward of the tail page we must first push forward 728the head page if the head page is the next page. If the head page 729is not the next page, the tail page is simply updated with a cmpxchg. 730 731Only writers move the tail page. This must be done atomically to protect 732against nested writers:: 733 734 temp_page = tail_page 735 next_page = temp_page->next 736 cmpxchg(tail_page, temp_page, next_page) 737 738The above will update the tail page if it is still pointing to the expected 739page. If this fails, a nested write pushed it forward, the current write 740does not need to push it:: 741 742 743 temp page 744 | 745 v 746 tail page 747 | 748 v 749 +---+ +---+ +---+ +---+ 750 <---| |--->| |--->| |--->| |---> 751 --->| |<---| |<---| |<---| |<--- 752 +---+ +---+ +---+ +---+ 753 754Nested write comes in and moves the tail page forward:: 755 756 tail page (moved by nested writer) 757 temp page | 758 | | 759 v v 760 +---+ +---+ +---+ +---+ 761 <---| |--->| |--->| |--->| |---> 762 --->| |<---| |<---| |<---| |<--- 763 +---+ +---+ +---+ +---+ 764 765The above would fail the cmpxchg, but since the tail page has already 766been moved forward, the writer will just try again to reserve storage 767on the new tail page. 768 769But the moving of the head page is a bit more complex:: 770 771 tail page 772 | 773 v 774 +---+ +---+ +---+ +---+ 775 <---| |--->| |-H->| |--->| |---> 776 --->| |<---| |<---| |<---| |<--- 777 +---+ +---+ +---+ +---+ 778 779The write converts the head page pointer to UPDATE:: 780 781 tail page 782 | 783 v 784 +---+ +---+ +---+ +---+ 785 <---| |--->| |-U->| |--->| |---> 786 --->| |<---| |<---| |<---| |<--- 787 +---+ +---+ +---+ +---+ 788 789But if a nested writer preempts here, it will see that the next 790page is a head page, but it is also nested. It will detect that 791it is nested and will save that information. The detection is the 792fact that it sees the UPDATE flag instead of a HEADER or NORMAL 793pointer. 794 795The nested writer will set the new head page pointer:: 796 797 tail page 798 | 799 v 800 +---+ +---+ +---+ +---+ 801 <---| |--->| |-U->| |-H->| |---> 802 --->| |<---| |<---| |<---| |<--- 803 +---+ +---+ +---+ +---+ 804 805But it will not reset the update back to normal. Only the writer 806that converted a pointer from HEAD to UPDATE will convert it back 807to NORMAL:: 808 809 tail page 810 | 811 v 812 +---+ +---+ +---+ +---+ 813 <---| |--->| |-U->| |-H->| |---> 814 --->| |<---| |<---| |<---| |<--- 815 +---+ +---+ +---+ +---+ 816 817After the nested writer finishes, the outermost writer will convert 818the UPDATE pointer to NORMAL:: 819 820 821 tail page 822 | 823 v 824 +---+ +---+ +---+ +---+ 825 <---| |--->| |--->| |-H->| |---> 826 --->| |<---| |<---| |<---| |<--- 827 +---+ +---+ +---+ +---+ 828 829 830It can be even more complex if several nested writes came in and moved 831the tail page ahead several pages:: 832 833 834 (first writer) 835 836 tail page 837 | 838 v 839 +---+ +---+ +---+ +---+ 840 <---| |--->| |-H->| |--->| |---> 841 --->| |<---| |<---| |<---| |<--- 842 +---+ +---+ +---+ +---+ 843 844The write converts the head page pointer to UPDATE:: 845 846 tail page 847 | 848 v 849 +---+ +---+ +---+ +---+ 850 <---| |--->| |-U->| |--->| |---> 851 --->| |<---| |<---| |<---| |<--- 852 +---+ +---+ +---+ +---+ 853 854Next writer comes in, and sees the update and sets up the new 855head page:: 856 857 (second writer) 858 859 tail page 860 | 861 v 862 +---+ +---+ +---+ +---+ 863 <---| |--->| |-U->| |-H->| |---> 864 --->| |<---| |<---| |<---| |<--- 865 +---+ +---+ +---+ +---+ 866 867The nested writer moves the tail page forward. But does not set the old 868update page to NORMAL because it is not the outermost writer:: 869 870 tail page 871 | 872 v 873 +---+ +---+ +---+ +---+ 874 <---| |--->| |-U->| |-H->| |---> 875 --->| |<---| |<---| |<---| |<--- 876 +---+ +---+ +---+ +---+ 877 878Another writer preempts and sees the page after the tail page is a head page. 879It changes it from HEAD to UPDATE:: 880 881 (third writer) 882 883 tail page 884 | 885 v 886 +---+ +---+ +---+ +---+ 887 <---| |--->| |-U->| |-U->| |---> 888 --->| |<---| |<---| |<---| |<--- 889 +---+ +---+ +---+ +---+ 890 891The writer will move the head page forward:: 892 893 894 (third writer) 895 896 tail page 897 | 898 v 899 +---+ +---+ +---+ +---+ 900 <---| |--->| |-U->| |-U->| |-H-> 901 --->| |<---| |<---| |<---| |<--- 902 +---+ +---+ +---+ +---+ 903 904But now that the third writer did change the HEAD flag to UPDATE it 905will convert it to normal:: 906 907 908 (third writer) 909 910 tail page 911 | 912 v 913 +---+ +---+ +---+ +---+ 914 <---| |--->| |-U->| |--->| |-H-> 915 --->| |<---| |<---| |<---| |<--- 916 +---+ +---+ +---+ +---+ 917 918 919Then it will move the tail page, and return back to the second writer:: 920 921 922 (second writer) 923 924 tail page 925 | 926 v 927 +---+ +---+ +---+ +---+ 928 <---| |--->| |-U->| |--->| |-H-> 929 --->| |<---| |<---| |<---| |<--- 930 +---+ +---+ +---+ +---+ 931 932 933The second writer will fail to move the tail page because it was already 934moved, so it will try again and add its data to the new tail page. 935It will return to the first writer:: 936 937 938 (first writer) 939 940 tail page 941 | 942 v 943 +---+ +---+ +---+ +---+ 944 <---| |--->| |-U->| |--->| |-H-> 945 --->| |<---| |<---| |<---| |<--- 946 +---+ +---+ +---+ +---+ 947 948The first writer cannot know atomically if the tail page moved 949while it updates the HEAD page. It will then update the head page to 950what it thinks is the new head page:: 951 952 953 (first writer) 954 955 tail page 956 | 957 v 958 +---+ +---+ +---+ +---+ 959 <---| |--->| |-U->| |-H->| |-H-> 960 --->| |<---| |<---| |<---| |<--- 961 +---+ +---+ +---+ +---+ 962 963Since the cmpxchg returns the old value of the pointer the first writer 964will see it succeeded in updating the pointer from NORMAL to HEAD. 965But as we can see, this is not good enough. It must also check to see 966if the tail page is either where it use to be or on the next page:: 967 968 969 (first writer) 970 971 A B tail page 972 | | | 973 v v v 974 +---+ +---+ +---+ +---+ 975 <---| |--->| |-U->| |-H->| |-H-> 976 --->| |<---| |<---| |<---| |<--- 977 +---+ +---+ +---+ +---+ 978 979If tail page != A and tail page != B, then it must reset the pointer 980back to NORMAL. The fact that it only needs to worry about nested 981writers means that it only needs to check this after setting the HEAD page:: 982 983 984 (first writer) 985 986 A B tail page 987 | | | 988 v v v 989 +---+ +---+ +---+ +---+ 990 <---| |--->| |-U->| |--->| |-H-> 991 --->| |<---| |<---| |<---| |<--- 992 +---+ +---+ +---+ +---+ 993 994Now the writer can update the head page. This is also why the head page must 995remain in UPDATE and only reset by the outermost writer. This prevents 996the reader from seeing the incorrect head page:: 997 998 999 (first writer) 1000 1001 A B tail page 1002 | | | 1003 v v v 1004 +---+ +---+ +---+ +---+ 1005 <---| |--->| |--->| |--->| |-H-> 1006 --->| |<---| |<---| |<---| |<--- 1007 +---+ +---+ +---+ +---+ 1008