1.. SPDX-License-Identifier: GPL-2.0 2 3==================================================== 4pin_user_pages() and related calls 5==================================================== 6 7.. contents:: :local: 8 9Overview 10======== 11 12This document describes the following functions:: 13 14 pin_user_pages() 15 pin_user_pages_fast() 16 pin_user_pages_remote() 17 18Basic description of FOLL_PIN 19============================= 20 21FOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*() 22("gup") family of functions. FOLL_PIN has significant interactions and 23interdependencies with FOLL_LONGTERM, so both are covered here. 24 25FOLL_PIN is internal to gup, meaning that it should not appear at the gup call 26sites. This allows the associated wrapper functions (pin_user_pages*() and 27others) to set the correct combination of these flags, and to check for problems 28as well. 29 30FOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites. 31This is in order to avoid creating a large number of wrapper functions to cover 32all combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the 33pin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so 34that's a natural dividing line, and a good point to make separate wrapper calls. 35In other words, use pin_user_pages*() for DMA-pinned pages, and 36get_user_pages*() for other cases. There are four cases described later on in 37this document, to further clarify that concept. 38 39FOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However, 40multiple threads and call sites are free to pin the same struct pages, via both 41FOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the 42other, not the struct page(s). 43 44The FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN 45uses a different reference counting technique. 46 47FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is, 48FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN. 49 50Which flags are set by each wrapper 51=================================== 52 53For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup 54flags the caller provides. The caller is required to pass in a non-null struct 55pages* array, and the function then pins pages by incrementing each by a special 56value: GUP_PIN_COUNTING_BIAS. 57 58For huge pages (and in fact, any compound page of more than 2 pages), the 59GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting 60is achieved, by using the 3rd struct page in the compound page. A new struct 61page field, hpage_pinned_refcount, has been added in order to support this. 62 63This approach for compound pages avoids the counting upper limit problems that 64are discussed below. Those limitations would have been aggravated severely by 65huge pages, because each tail page adds a refcount to the head page. And in 66fact, testing revealed that, without a separate hpage_pinned_refcount field, 67page overflows were seen in some huge page stress tests. 68 69This also means that huge pages and compound pages (of order > 1) do not suffer 70from the false positives problem that is mentioned below.:: 71 72 Function 73 -------- 74 pin_user_pages FOLL_PIN is always set internally by this function. 75 pin_user_pages_fast FOLL_PIN is always set internally by this function. 76 pin_user_pages_remote FOLL_PIN is always set internally by this function. 77 78For these get_user_pages*() functions, FOLL_GET might not even be specified. 79Behavior is a little more complex than above. If FOLL_GET was *not* specified, 80but the caller passed in a non-null struct pages* array, then the function 81sets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount 82of each page by +1.:: 83 84 Function 85 -------- 86 get_user_pages FOLL_GET is sometimes set internally by this function. 87 get_user_pages_fast FOLL_GET is sometimes set internally by this function. 88 get_user_pages_remote FOLL_GET is sometimes set internally by this function. 89 90Tracking dma-pinned pages 91========================= 92 93Some of the key design constraints, and solutions, for tracking dma-pinned 94pages: 95 96* An actual reference count, per struct page, is required. This is because 97 multiple processes may pin and unpin a page. 98 99* False positives (reporting that a page is dma-pinned, when in fact it is not) 100 are acceptable, but false negatives are not. 101 102* struct page may not be increased in size for this, and all fields are already 103 used. 104 105* Given the above, we can overload the page->_refcount field by using, sort of, 106 the upper bits in that field for a dma-pinned count. "Sort of", means that, 107 rather than dividing page->_refcount into bit fields, we simple add a medium- 108 large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to 109 page->_refcount. This provides fuzzy behavior: if a page has get_page() called 110 on it 1024 times, then it will appear to have a single dma-pinned count. 111 And again, that's acceptable. 112 113This also leads to limitations: there are only 31-10==21 bits available for a 114counter that increments 10 bits at a time. 115 116* Callers must specifically request "dma-pinned tracking of pages". In other 117 words, just calling get_user_pages() will not suffice; a new set of functions, 118 pin_user_page() and related, must be used. 119 120FOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags 121========================================================== 122 123Thanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing 124these categories: 125 126CASE 1: Direct IO (DIO) 127----------------------- 128There are GUP references to pages that are serving 129as DIO buffers. These buffers are needed for a relatively short time (so they 130are not "long term"). No special synchronization with page_mkclean() or 131munmap() is provided. Therefore, flags to set at the call site are: :: 132 133 FOLL_PIN 134 135...but rather than setting FOLL_PIN directly, call sites should use one of 136the pin_user_pages*() routines that set FOLL_PIN. 137 138CASE 2: RDMA 139------------ 140There are GUP references to pages that are serving as DMA 141buffers. These buffers are needed for a long time ("long term"). No special 142synchronization with page_mkclean() or munmap() is provided. Therefore, flags 143to set at the call site are: :: 144 145 FOLL_PIN | FOLL_LONGTERM 146 147NOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's 148because DAX pages do not have a separate page cache, and so "pinning" implies 149locking down file system blocks, which is not (yet) supported in that way. 150 151CASE 3: Hardware with page faulting support 152------------------------------------------- 153Here, a well-written driver doesn't normally need to pin pages at all. However, 154if the driver does choose to do so, it can register MMU notifiers for the range, 155and will be called back upon invalidation. Either way (avoiding page pinning, or 156using MMU notifiers to unpin upon request), there is proper synchronization with 157both filesystem and mm (page_mkclean(), munmap(), etc). 158 159Therefore, neither flag needs to be set. 160 161In this case, ideally, neither get_user_pages() nor pin_user_pages() should be 162called. Instead, the software should be written so that it does not pin pages. 163This allows mm and filesystems to operate more efficiently and reliably. 164 165CASE 4: Pinning for struct page manipulation only 166------------------------------------------------- 167Here, normal GUP calls are sufficient, so neither flag needs to be set. 168 169page_maybe_dma_pinned(): the whole point of pinning 170=================================================== 171 172The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able 173to query, "is this page DMA-pinned?" That allows code such as page_mkclean() 174(and file system writeback code in general) to make informed decisions about 175what to do when a page cannot be unmapped due to such pins. 176 177What to do in those cases is the subject of a years-long series of discussions 178and debates (see the References at the end of this document). It's a TODO item 179here: fill in the details once that's worked out. Meanwhile, it's safe to say 180that having this available: :: 181 182 static inline bool page_maybe_dma_pinned(struct page *page) 183 184...is a prerequisite to solving the long-running gup+DMA problem. 185 186Another way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM 187=================================================================== 188 189Another way of thinking about these flags is as a progression of restrictions: 190FOLL_GET is for struct page manipulation, without affecting the data that the 191struct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for 192short term pins on pages whose data *will* get accessed. As such, FOLL_PIN is 193a "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more 194restrictive case that has FOLL_PIN as a prerequisite: this is for pages that 195will be pinned longterm, and whose data will be accessed. 196 197Unit testing 198============ 199This file:: 200 201 tools/testing/selftests/vm/gup_benchmark.c 202 203has the following new calls to exercise the new pin*() wrapper functions: 204 205* PIN_FAST_BENCHMARK (./gup_benchmark -a) 206* PIN_BENCHMARK (./gup_benchmark -b) 207 208You can monitor how many total dma-pinned pages have been acquired and released 209since the system was booted, via two new /proc/vmstat entries: :: 210 211 /proc/vmstat/nr_foll_pin_acquired 212 /proc/vmstat/nr_foll_pin_released 213 214Under normal conditions, these two values will be equal unless there are any 215long-term [R]DMA pins in place, or during pin/unpin transitions. 216 217* nr_foll_pin_acquired: This is the number of logical pins that have been 218 acquired since the system was powered on. For huge pages, the head page is 219 pinned once for each page (head page and each tail page) within the huge page. 220 This follows the same sort of behavior that get_user_pages() uses for huge 221 pages: the head page is refcounted once for each tail or head page in the huge 222 page, when get_user_pages() is applied to a huge page. 223 224* nr_foll_pin_released: The number of logical pins that have been released since 225 the system was powered on. Note that pages are released (unpinned) on a 226 PAGE_SIZE granularity, even if the original pin was applied to a huge page. 227 Becaused of the pin count behavior described above in "nr_foll_pin_acquired", 228 the accounting balances out, so that after doing this:: 229 230 pin_user_pages(huge_page); 231 for (each page in huge_page) 232 unpin_user_page(page); 233 234...the following is expected:: 235 236 nr_foll_pin_released == nr_foll_pin_acquired 237 238(...unless it was already out of balance due to a long-term RDMA pin being in 239place.) 240 241Other diagnostics 242================= 243 244dump_page() has been enhanced slightly, to handle these new counting fields, and 245to better report on compound pages in general. Specifically, for compound pages 246with order > 1, the exact (hpage_pinned_refcount) pincount is reported. 247 248References 249========== 250 251* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_ 252* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_ 253* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_ 254* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_ 255 256John Hubbard, October, 2019 257