1.. SPDX-License-Identifier: GPL-2.0 2 3==================================================== 4pin_user_pages() and related calls 5==================================================== 6 7.. contents:: :local: 8 9Overview 10======== 11 12This document describes the following functions:: 13 14 pin_user_pages() 15 pin_user_pages_fast() 16 pin_user_pages_remote() 17 18Basic description of FOLL_PIN 19============================= 20 21FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() 22("gup") family of functions. FOLL_PIN has significant interactions and 23interdependencies with FOLL_LONGTERM, so both are covered here. 24 25FOLL_PIN is internal to gup, meaning that it should not appear at the gup call 26sites. This allows the associated wrapper functions (pin_user_pages*() and 27others) to set the correct combination of these flags, and to check for problems 28as well. 29 30FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. 31This is in order to avoid creating a large number of wrapper functions to cover 32all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the 33pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so 34that's a natural dividing line, and a good point to make separate wrapper calls. 35In other words, use pin_user_pages*() for DMA-pinned pages, and 36get_user_pages*() for other cases. There are four cases described later on in 37this document, to further clarify that concept. 38 39FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, 40multiple threads and call sites are free to pin the same struct pages, via both 41FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the 42other, not the struct page(s). 43 44The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN 45uses a different reference counting technique. 46 47FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, 48FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. 49 50Which flags are set by each wrapper 51=================================== 52 53For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup 54flags the caller provides. The caller is required to pass in a non-null struct 55pages* array, and the function then pin pages by incrementing each by a special 56value. For now, that value is +1, just like get_user_pages*().:: 57 58 Function 59 -------- 60 pin_user_pages FOLL_PIN is always set internally by this function. 61 pin_user_pages_fast FOLL_PIN is always set internally by this function. 62 pin_user_pages_remote FOLL_PIN is always set internally by this function. 63 64For these get_user_pages*() functions, FOLL_GET might not even be specified. 65Behavior is a little more complex than above. If FOLL_GET was *not* specified, 66but the caller passed in a non-null struct pages* array, then the function 67sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount 68of each page by +1.:: 69 70 Function 71 -------- 72 get_user_pages FOLL_GET is sometimes set internally by this function. 73 get_user_pages_fast FOLL_GET is sometimes set internally by this function. 74 get_user_pages_remote FOLL_GET is sometimes set internally by this function. 75 76Tracking dma-pinned pages 77========================= 78 79Some of the key design constraints, and solutions, for tracking dma-pinned 80pages: 81 82* An actual reference count, per struct page, is required. This is because 83 multiple processes may pin and unpin a page. 84 85* False positives (reporting that a page is dma-pinned, when in fact it is not) 86 are acceptable, but false negatives are not. 87 88* struct page may not be increased in size for this, and all fields are already 89 used. 90 91* Given the above, we can overload the page->_refcount field by using, sort of, 92 the upper bits in that field for a dma-pinned count. "Sort of", means that, 93 rather than dividing page->_refcount into bit fields, we simple add a medium- 94 large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to 95 page->_refcount. This provides fuzzy behavior: if a page has get_page() called 96 on it 1024 times, then it will appear to have a single dma-pinned count. 97 And again, that's acceptable. 98 99This also leads to limitations: there are only 31-10==21 bits available for a 100counter that increments 10 bits at a time. 101 102TODO: for 1GB and larger huge pages, this is cutting it close. That's because 103when pin_user_pages() follows such pages, it increments the head page by "1" 104(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for 105pin_user_pages()) for each tail page. So if you have a 1GB huge page: 106 107* There are 256K (18 bits) worth of 4 KB tail pages. 108* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, 109 10 bits at a time) 110* There are 21 - 18 == 3 bits available to count. Except that there aren't, 111 because you need to allow for a few normal get_page() calls on the head page, 112 as well. Fortunately, the approach of using addition, rather than "hard" 113 bitfields, within page->_refcount, allows for sharing these bits gracefully. 114 But we're still looking at about 8 references. 115 116This, however, is a missing feature more than anything else, because it's easily 117solved by addressing an obvious inefficiency in the original get_user_pages() 118approach of retrieving pages: stop treating all the pages as if they were 119PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of 120this, so some work is required. Once that's in place, this limitation mostly 121disappears from view, because there will be ample refcounting range available. 122 123* Callers must specifically request "dma-pinned tracking of pages". In other 124 words, just calling get_user_pages() will not suffice; a new set of functions, 125 pin_user_page() and related, must be used. 126 127FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags 128========================================================== 129 130Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing 131these categories: 132 133CASE 1: Direct IO (DIO) 134----------------------- 135There are GUP references to pages that are serving 136as DIO buffers. These buffers are needed for a relatively short time (so they 137are not "long term"). No special synchronization with page_mkclean() or 138munmap() is provided. Therefore, flags to set at the call site are: :: 139 140 FOLL_PIN 141 142...but rather than setting FOLL_PIN directly, call sites should use one of 143the pin_user_pages*() routines that set FOLL_PIN. 144 145CASE 2: RDMA 146------------ 147There are GUP references to pages that are serving as DMA 148buffers. These buffers are needed for a long time ("long term"). No special 149synchronization with page_mkclean() or munmap() is provided. Therefore, flags 150to set at the call site are: :: 151 152 FOLL_PIN | FOLL_LONGTERM 153 154NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's 155because DAX pages do not have a separate page cache, and so "pinning" implies 156locking down file system blocks, which is not (yet) supported in that way. 157 158CASE 3: Hardware with page faulting support 159------------------------------------------- 160Here, a well-written driver doesn't normally need to pin pages at all. However, 161if the driver does choose to do so, it can register MMU notifiers for the range, 162and will be called back upon invalidation. Either way (avoiding page pinning, or 163using MMU notifiers to unpin upon request), there is proper synchronization with 164both filesystem and mm (page_mkclean(), munmap(), etc). 165 166Therefore, neither flag needs to be set. 167 168In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 169called. Instead, the software should be written so that it does not pin pages. 170This allows mm and filesystems to operate more efficiently and reliably. 171 172CASE 4: Pinning for struct page manipulation only 173------------------------------------------------- 174Here, normal GUP calls are sufficient, so neither flag needs to be set. 175 176page_dma_pinned(): the whole point of pinning 177============================================= 178 179The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able 180to query, "is this page DMA-pinned?" That allows code such as page_mkclean() 181(and file system writeback code in general) to make informed decisions about 182what to do when a page cannot be unmapped due to such pins. 183 184What to do in those cases is the subject of a years-long series of discussions 185and debates (see the References at the end of this document). It's a TODO item 186here: fill in the details once that's worked out. Meanwhile, it's safe to say 187that having this available: :: 188 189 static inline bool page_dma_pinned(struct page *page) 190 191...is a prerequisite to solving the long-running gup+DMA problem. 192 193Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM 194=================================================================== 195 196Another way of thinking about these flags is as a progression of restrictions: 197FOLL_GET is for struct page manipulation, without affecting the data that the 198struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for 199short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is 200a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more 201restrictive case that has FOLL_PIN as a prerequisite: this is for pages that 202will be pinned longterm, and whose data will be accessed. 203 204Unit testing 205============ 206This file:: 207 208 tools/testing/selftests/vm/gup_benchmark.c 209 210has the following new calls to exercise the new pin*() wrapper functions: 211 212* PIN_FAST_BENCHMARK (./gup_benchmark -a) 213* PIN_BENCHMARK (./gup_benchmark -b) 214 215You can monitor how many total dma-pinned pages have been acquired and released 216since the system was booted, via two new /proc/vmstat entries: :: 217 218 /proc/vmstat/nr_foll_pin_requested 219 /proc/vmstat/nr_foll_pin_requested 220 221Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is 222because there is a noticeable performance drop in unpin_user_page(), when they 223are activated. 224 225References 226========== 227 228* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ 229* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ 230* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ 231 232John Hubbard, October, 2019 233