1eddb1c22SJohn Hubbard.. SPDX-License-Identifier: GPL-2.0 2eddb1c22SJohn Hubbard 3eddb1c22SJohn Hubbard==================================================== 4eddb1c22SJohn Hubbardpin_user_pages() and related calls 5eddb1c22SJohn Hubbard==================================================== 6eddb1c22SJohn Hubbard 7eddb1c22SJohn Hubbard.. contents:: :local: 8eddb1c22SJohn Hubbard 9eddb1c22SJohn HubbardOverview 10eddb1c22SJohn Hubbard======== 11eddb1c22SJohn Hubbard 12eddb1c22SJohn HubbardThis document describes the following functions:: 13eddb1c22SJohn Hubbard 14eddb1c22SJohn Hubbard pin_user_pages() 15eddb1c22SJohn Hubbard pin_user_pages_fast() 16eddb1c22SJohn Hubbard pin_user_pages_remote() 17eddb1c22SJohn Hubbard 18eddb1c22SJohn HubbardBasic description of FOLL_PIN 19eddb1c22SJohn Hubbard============================= 20eddb1c22SJohn Hubbard 21eddb1c22SJohn HubbardFOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() 22eddb1c22SJohn Hubbard("gup") family of functions. FOLL_PIN has significant interactions and 23eddb1c22SJohn Hubbardinterdependencies with FOLL_LONGTERM, so both are covered here. 24eddb1c22SJohn Hubbard 25eddb1c22SJohn HubbardFOLL_PIN is internal to gup, meaning that it should not appear at the gup call 26eddb1c22SJohn Hubbardsites. This allows the associated wrapper functions (pin_user_pages*() and 27eddb1c22SJohn Hubbardothers) to set the correct combination of these flags, and to check for problems 28eddb1c22SJohn Hubbardas well. 29eddb1c22SJohn Hubbard 30eddb1c22SJohn HubbardFOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. 31eddb1c22SJohn HubbardThis is in order to avoid creating a large number of wrapper functions to cover 32eddb1c22SJohn Hubbardall combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the 33eddb1c22SJohn Hubbardpin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so 34eddb1c22SJohn Hubbardthat's a natural dividing line, and a good point to make separate wrapper calls. 35eddb1c22SJohn HubbardIn other words, use pin_user_pages*() for DMA-pinned pages, and 36f9e55970SSouptick Joarderget_user_pages*() for other cases. There are five cases described later on in 37eddb1c22SJohn Hubbardthis document, to further clarify that concept. 38eddb1c22SJohn Hubbard 39eddb1c22SJohn HubbardFOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, 40eddb1c22SJohn Hubbardmultiple threads and call sites are free to pin the same struct pages, via both 41eddb1c22SJohn HubbardFOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the 42eddb1c22SJohn Hubbardother, not the struct page(s). 43eddb1c22SJohn Hubbard 44eddb1c22SJohn HubbardThe FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN 45eddb1c22SJohn Hubbarduses a different reference counting technique. 46eddb1c22SJohn Hubbard 47eddb1c22SJohn HubbardFOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, 48eddb1c22SJohn HubbardFOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. 49eddb1c22SJohn Hubbard 50eddb1c22SJohn HubbardWhich flags are set by each wrapper 51eddb1c22SJohn Hubbard=================================== 52eddb1c22SJohn Hubbard 53eddb1c22SJohn HubbardFor these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup 54eddb1c22SJohn Hubbardflags the caller provides. The caller is required to pass in a non-null struct 5547e29d32SJohn Hubbardpages* array, and the function then pins pages by incrementing each by a special 5647e29d32SJohn Hubbardvalue: GUP_PIN_COUNTING_BIAS. 5747e29d32SJohn Hubbard 5894688e8eSMatthew Wilcox (Oracle)For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead, 5994688e8eSMatthew Wilcox (Oracle)the extra space available in the struct folio is used to store the 6094688e8eSMatthew Wilcox (Oracle)pincount directly. 6147e29d32SJohn Hubbard 6294688e8eSMatthew Wilcox (Oracle)This approach for large folios avoids the counting upper limit problems 6394688e8eSMatthew Wilcox (Oracle)that are discussed below. Those limitations would have been aggravated 6494688e8eSMatthew Wilcox (Oracle)severely by huge pages, because each tail page adds a refcount to the 6594688e8eSMatthew Wilcox (Oracle)head page. And in fact, testing revealed that, without a separate pincount 6694688e8eSMatthew Wilcox (Oracle)field, refcount overflows were seen in some huge page stress tests. 6747e29d32SJohn Hubbard 6894688e8eSMatthew Wilcox (Oracle)This also means that huge pages and large folios do not suffer 6947e29d32SJohn Hubbardfrom the false positives problem that is mentioned below.:: 70eddb1c22SJohn Hubbard 71eddb1c22SJohn Hubbard Function 72eddb1c22SJohn Hubbard -------- 73eddb1c22SJohn Hubbard pin_user_pages FOLL_PIN is always set internally by this function. 74eddb1c22SJohn Hubbard pin_user_pages_fast FOLL_PIN is always set internally by this function. 75eddb1c22SJohn Hubbard pin_user_pages_remote FOLL_PIN is always set internally by this function. 76eddb1c22SJohn Hubbard 77eddb1c22SJohn HubbardFor these get_user_pages*() functions, FOLL_GET might not even be specified. 78eddb1c22SJohn HubbardBehavior is a little more complex than above. If FOLL_GET was *not* specified, 79eddb1c22SJohn Hubbardbut the caller passed in a non-null struct pages* array, then the function 80eddb1c22SJohn Hubbardsets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount 81eddb1c22SJohn Hubbardof each page by +1.:: 82eddb1c22SJohn Hubbard 83eddb1c22SJohn Hubbard Function 84eddb1c22SJohn Hubbard -------- 85eddb1c22SJohn Hubbard get_user_pages FOLL_GET is sometimes set internally by this function. 86eddb1c22SJohn Hubbard get_user_pages_fast FOLL_GET is sometimes set internally by this function. 87eddb1c22SJohn Hubbard get_user_pages_remote FOLL_GET is sometimes set internally by this function. 88eddb1c22SJohn Hubbard 89eddb1c22SJohn HubbardTracking dma-pinned pages 90eddb1c22SJohn Hubbard========================= 91eddb1c22SJohn Hubbard 92eddb1c22SJohn HubbardSome of the key design constraints, and solutions, for tracking dma-pinned 93eddb1c22SJohn Hubbardpages: 94eddb1c22SJohn Hubbard 95eddb1c22SJohn Hubbard* An actual reference count, per struct page, is required. This is because 96eddb1c22SJohn Hubbard multiple processes may pin and unpin a page. 97eddb1c22SJohn Hubbard 98eddb1c22SJohn Hubbard* False positives (reporting that a page is dma-pinned, when in fact it is not) 99eddb1c22SJohn Hubbard are acceptable, but false negatives are not. 100eddb1c22SJohn Hubbard 101eddb1c22SJohn Hubbard* struct page may not be increased in size for this, and all fields are already 102eddb1c22SJohn Hubbard used. 103eddb1c22SJohn Hubbard 104eddb1c22SJohn Hubbard* Given the above, we can overload the page->_refcount field by using, sort of, 105eddb1c22SJohn Hubbard the upper bits in that field for a dma-pinned count. "Sort of", means that, 106eddb1c22SJohn Hubbard rather than dividing page->_refcount into bit fields, we simple add a medium- 107eddb1c22SJohn Hubbard large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to 108eddb1c22SJohn Hubbard page->_refcount. This provides fuzzy behavior: if a page has get_page() called 109eddb1c22SJohn Hubbard on it 1024 times, then it will appear to have a single dma-pinned count. 110eddb1c22SJohn Hubbard And again, that's acceptable. 111eddb1c22SJohn Hubbard 112eddb1c22SJohn HubbardThis also leads to limitations: there are only 31-10==21 bits available for a 113eddb1c22SJohn Hubbardcounter that increments 10 bits at a time. 114eddb1c22SJohn Hubbard 115*c8070b78SDavid Howells* Because of that limitation, special handling is applied to the zero pages 116*c8070b78SDavid Howells when using FOLL_PIN. We only pretend to pin a zero page - we don't alter its 117*c8070b78SDavid Howells refcount or pincount at all (it is permanent, so there's no need). The 118*c8070b78SDavid Howells unpinning functions also don't do anything to a zero page. This is 119*c8070b78SDavid Howells transparent to the caller. 120*c8070b78SDavid Howells 121eddb1c22SJohn Hubbard* Callers must specifically request "dma-pinned tracking of pages". In other 122eddb1c22SJohn Hubbard words, just calling get_user_pages() will not suffice; a new set of functions, 123eddb1c22SJohn Hubbard pin_user_page() and related, must be used. 124eddb1c22SJohn Hubbard 125eddb1c22SJohn HubbardFOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags 126eddb1c22SJohn Hubbard========================================================== 127eddb1c22SJohn Hubbard 128eddb1c22SJohn HubbardThanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing 129eddb1c22SJohn Hubbardthese categories: 130eddb1c22SJohn Hubbard 131eddb1c22SJohn HubbardCASE 1: Direct IO (DIO) 132eddb1c22SJohn Hubbard----------------------- 133eddb1c22SJohn HubbardThere are GUP references to pages that are serving 134eddb1c22SJohn Hubbardas DIO buffers. These buffers are needed for a relatively short time (so they 135eddb1c22SJohn Hubbardare not "long term"). No special synchronization with page_mkclean() or 136eddb1c22SJohn Hubbardmunmap() is provided. Therefore, flags to set at the call site are: :: 137eddb1c22SJohn Hubbard 138eddb1c22SJohn Hubbard FOLL_PIN 139eddb1c22SJohn Hubbard 140eddb1c22SJohn Hubbard...but rather than setting FOLL_PIN directly, call sites should use one of 141eddb1c22SJohn Hubbardthe pin_user_pages*() routines that set FOLL_PIN. 142eddb1c22SJohn Hubbard 143eddb1c22SJohn HubbardCASE 2: RDMA 144eddb1c22SJohn Hubbard------------ 145eddb1c22SJohn HubbardThere are GUP references to pages that are serving as DMA 146eddb1c22SJohn Hubbardbuffers. These buffers are needed for a long time ("long term"). No special 147eddb1c22SJohn Hubbardsynchronization with page_mkclean() or munmap() is provided. Therefore, flags 148eddb1c22SJohn Hubbardto set at the call site are: :: 149eddb1c22SJohn Hubbard 150eddb1c22SJohn Hubbard FOLL_PIN | FOLL_LONGTERM 151eddb1c22SJohn Hubbard 152eddb1c22SJohn HubbardNOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's 153eddb1c22SJohn Hubbardbecause DAX pages do not have a separate page cache, and so "pinning" implies 154eddb1c22SJohn Hubbardlocking down file system blocks, which is not (yet) supported in that way. 155eddb1c22SJohn Hubbard 156a8f80f53SJohn HubbardCASE 3: MMU notifier registration, with or without page faulting hardware 157a8f80f53SJohn Hubbard------------------------------------------------------------------------- 158a8f80f53SJohn HubbardDevice drivers can pin pages via get_user_pages*(), and register for mmu 159a8f80f53SJohn Hubbardnotifier callbacks for the memory range. Then, upon receiving a notifier 160a8f80f53SJohn Hubbard"invalidate range" callback , stop the device from using the range, and unpin 161a8f80f53SJohn Hubbardthe pages. There may be other possible schemes, such as for example explicitly 162a8f80f53SJohn Hubbardsynchronizing against pending IO, that accomplish approximately the same thing. 163eddb1c22SJohn Hubbard 164a8f80f53SJohn HubbardOr, if the hardware supports replayable page faults, then the device driver can 165a8f80f53SJohn Hubbardavoid pinning entirely (this is ideal), as follows: register for mmu notifier 166a8f80f53SJohn Hubbardcallbacks as above, but instead of stopping the device and unpinning in the 167a8f80f53SJohn Hubbardcallback, simply remove the range from the device's page tables. 168eddb1c22SJohn Hubbard 169a8f80f53SJohn HubbardEither way, as long as the driver unpins the pages upon mmu notifier callback, 170a8f80f53SJohn Hubbardthen there is proper synchronization with both filesystem and mm 171a8f80f53SJohn Hubbard(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set. 172eddb1c22SJohn Hubbard 173eddb1c22SJohn HubbardCASE 4: Pinning for struct page manipulation only 174eddb1c22SJohn Hubbard------------------------------------------------- 175a8f80f53SJohn HubbardIf only struct page data (as opposed to the actual memory contents that a page 176a8f80f53SJohn Hubbardis tracking) is affected, then normal GUP calls are sufficient, and neither flag 177a8f80f53SJohn Hubbardneeds to be set. 178eddb1c22SJohn Hubbard 179eaf4d22aSJohn HubbardCASE 5: Pinning in order to write to the data within the page 180eaf4d22aSJohn Hubbard------------------------------------------------------------- 181eaf4d22aSJohn HubbardEven though neither DMA nor Direct IO is involved, just a simple case of "pin, 182eaf4d22aSJohn Hubbardwrite to a page's data, unpin" can cause a problem. Case 5 may be considered a 183eaf4d22aSJohn Hubbardsuperset of Case 1, plus Case 2, plus anything that invokes that pattern. In 184eaf4d22aSJohn Hubbardother words, if the code is neither Case 1 nor Case 2, it may still require 185eaf4d22aSJohn HubbardFOLL_PIN, for patterns like this: 186eaf4d22aSJohn Hubbard 187eaf4d22aSJohn HubbardCorrect (uses FOLL_PIN calls): 188eaf4d22aSJohn Hubbard pin_user_pages() 189eaf4d22aSJohn Hubbard write to the data within the pages 190eaf4d22aSJohn Hubbard unpin_user_pages() 191eaf4d22aSJohn Hubbard 192eaf4d22aSJohn HubbardINCORRECT (uses FOLL_GET calls): 193eaf4d22aSJohn Hubbard get_user_pages() 194eaf4d22aSJohn Hubbard write to the data within the pages 195eaf4d22aSJohn Hubbard put_page() 196eaf4d22aSJohn Hubbard 1973faa52c0SJohn Hubbardpage_maybe_dma_pinned(): the whole point of pinning 1983faa52c0SJohn Hubbard=================================================== 199eddb1c22SJohn Hubbard 200eddb1c22SJohn HubbardThe whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able 201eddb1c22SJohn Hubbardto query, "is this page DMA-pinned?" That allows code such as page_mkclean() 202eddb1c22SJohn Hubbard(and file system writeback code in general) to make informed decisions about 203eddb1c22SJohn Hubbardwhat to do when a page cannot be unmapped due to such pins. 204eddb1c22SJohn Hubbard 205eddb1c22SJohn HubbardWhat to do in those cases is the subject of a years-long series of discussions 206eddb1c22SJohn Hubbardand debates (see the References at the end of this document). It's a TODO item 207eddb1c22SJohn Hubbardhere: fill in the details once that's worked out. Meanwhile, it's safe to say 208eddb1c22SJohn Hubbardthat having this available: :: 209eddb1c22SJohn Hubbard 2103faa52c0SJohn Hubbard static inline bool page_maybe_dma_pinned(struct page *page) 211eddb1c22SJohn Hubbard 212eddb1c22SJohn Hubbard...is a prerequisite to solving the long-running gup+DMA problem. 213eddb1c22SJohn Hubbard 214eddb1c22SJohn HubbardAnother way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM 215eddb1c22SJohn Hubbard=================================================================== 216eddb1c22SJohn Hubbard 217eddb1c22SJohn HubbardAnother way of thinking about these flags is as a progression of restrictions: 218eddb1c22SJohn HubbardFOLL_GET is for struct page manipulation, without affecting the data that the 219eddb1c22SJohn Hubbardstruct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for 220eddb1c22SJohn Hubbardshort term pins on pages whose data *will* get accessed. As such, FOLL_PIN is 221eddb1c22SJohn Hubbarda "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more 222eddb1c22SJohn Hubbardrestrictive case that has FOLL_PIN as a prerequisite: this is for pages that 223eddb1c22SJohn Hubbardwill be pinned longterm, and whose data will be accessed. 224eddb1c22SJohn Hubbard 225eddb1c22SJohn HubbardUnit testing 226eddb1c22SJohn Hubbard============ 227eddb1c22SJohn HubbardThis file:: 228eddb1c22SJohn Hubbard 229baa489faSSeongJae Park tools/testing/selftests/mm/gup_test.c 230eddb1c22SJohn Hubbard 231eddb1c22SJohn Hubbardhas the following new calls to exercise the new pin*() wrapper functions: 232eddb1c22SJohn Hubbard 2339c84f229SJohn Hubbard* PIN_FAST_BENCHMARK (./gup_test -a) 234a9bed1e1SJohn Hubbard* PIN_BASIC_TEST (./gup_test -b) 235eddb1c22SJohn Hubbard 236eddb1c22SJohn HubbardYou can monitor how many total dma-pinned pages have been acquired and released 237eddb1c22SJohn Hubbardsince the system was booted, via two new /proc/vmstat entries: :: 238eddb1c22SJohn Hubbard 2391970dc6fSJohn Hubbard /proc/vmstat/nr_foll_pin_acquired 2401970dc6fSJohn Hubbard /proc/vmstat/nr_foll_pin_released 241eddb1c22SJohn Hubbard 2421970dc6fSJohn HubbardUnder normal conditions, these two values will be equal unless there are any 2431970dc6fSJohn Hubbardlong-term [R]DMA pins in place, or during pin/unpin transitions. 2441970dc6fSJohn Hubbard 2451970dc6fSJohn Hubbard* nr_foll_pin_acquired: This is the number of logical pins that have been 2461970dc6fSJohn Hubbard acquired since the system was powered on. For huge pages, the head page is 2471970dc6fSJohn Hubbard pinned once for each page (head page and each tail page) within the huge page. 2481970dc6fSJohn Hubbard This follows the same sort of behavior that get_user_pages() uses for huge 2491970dc6fSJohn Hubbard pages: the head page is refcounted once for each tail or head page in the huge 2501970dc6fSJohn Hubbard page, when get_user_pages() is applied to a huge page. 2511970dc6fSJohn Hubbard 2521970dc6fSJohn Hubbard* nr_foll_pin_released: The number of logical pins that have been released since 2531970dc6fSJohn Hubbard the system was powered on. Note that pages are released (unpinned) on a 2541970dc6fSJohn Hubbard PAGE_SIZE granularity, even if the original pin was applied to a huge page. 2551970dc6fSJohn Hubbard Becaused of the pin count behavior described above in "nr_foll_pin_acquired", 2561970dc6fSJohn Hubbard the accounting balances out, so that after doing this:: 2571970dc6fSJohn Hubbard 2581970dc6fSJohn Hubbard pin_user_pages(huge_page); 2591970dc6fSJohn Hubbard for (each page in huge_page) 2601970dc6fSJohn Hubbard unpin_user_page(page); 2611970dc6fSJohn Hubbard 2621970dc6fSJohn Hubbard...the following is expected:: 2631970dc6fSJohn Hubbard 2641970dc6fSJohn Hubbard nr_foll_pin_released == nr_foll_pin_acquired 2651970dc6fSJohn Hubbard 2661970dc6fSJohn Hubbard(...unless it was already out of balance due to a long-term RDMA pin being in 2671970dc6fSJohn Hubbardplace.) 268eddb1c22SJohn Hubbard 269dc8fb2f2SJohn HubbardOther diagnostics 270dc8fb2f2SJohn Hubbard================= 271dc8fb2f2SJohn Hubbard 27294688e8eSMatthew Wilcox (Oracle)dump_page() has been enhanced slightly to handle these new counting 27394688e8eSMatthew Wilcox (Oracle)fields, and to better report on large folios in general. Specifically, 27494688e8eSMatthew Wilcox (Oracle)for large folios, the exact pincount is reported. 275dc8fb2f2SJohn Hubbard 276eddb1c22SJohn HubbardReferences 277eddb1c22SJohn Hubbard========== 278eddb1c22SJohn Hubbard 279eddb1c22SJohn Hubbard* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ 280eddb1c22SJohn Hubbard* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ 281eddb1c22SJohn Hubbard* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ 28247e29d32SJohn Hubbard* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ 283eddb1c22SJohn Hubbard 284eddb1c22SJohn HubbardJohn Hubbard, October, 2019 285