xref: /openbmc/linux/Documentation/core-api/pin_user_pages.rst (revision 2612e3bbc0386368a850140a6c9b990cd496a5ec)
1eddb1c22SJohn Hubbard.. SPDX-License-Identifier: GPL-2.0
2eddb1c22SJohn Hubbard
3eddb1c22SJohn Hubbard====================================================
4eddb1c22SJohn Hubbardpin_user_pages() and related calls
5eddb1c22SJohn Hubbard====================================================
6eddb1c22SJohn Hubbard
7eddb1c22SJohn Hubbard.. contents:: :local:
8eddb1c22SJohn Hubbard
9eddb1c22SJohn HubbardOverview
10eddb1c22SJohn Hubbard========
11eddb1c22SJohn Hubbard
12eddb1c22SJohn HubbardThis document describes the following functions::
13eddb1c22SJohn Hubbard
14eddb1c22SJohn Hubbard pin_user_pages()
15eddb1c22SJohn Hubbard pin_user_pages_fast()
16eddb1c22SJohn Hubbard pin_user_pages_remote()
17eddb1c22SJohn Hubbard
18eddb1c22SJohn HubbardBasic description of FOLL_PIN
19eddb1c22SJohn Hubbard=============================
20eddb1c22SJohn Hubbard
21eddb1c22SJohn HubbardFOLL_PIN and FOLL_LONGTERM are flags that can be passed to the get_user_pages*()
22eddb1c22SJohn Hubbard("gup") family of functions. FOLL_PIN has significant interactions and
23eddb1c22SJohn Hubbardinterdependencies with FOLL_LONGTERM, so both are covered here.
24eddb1c22SJohn Hubbard
25eddb1c22SJohn HubbardFOLL_PIN is internal to gup, meaning that it should not appear at the gup call
26eddb1c22SJohn Hubbardsites. This allows the associated wrapper functions  (pin_user_pages*() and
27eddb1c22SJohn Hubbardothers) to set the correct combination of these flags, and to check for problems
28eddb1c22SJohn Hubbardas well.
29eddb1c22SJohn Hubbard
30eddb1c22SJohn HubbardFOLL_LONGTERM, on the other hand, *is* allowed to be set at the gup call sites.
31eddb1c22SJohn HubbardThis is in order to avoid creating a large number of wrapper functions to cover
32eddb1c22SJohn Hubbardall combinations of get*(), pin*(), FOLL_LONGTERM, and more. Also, the
33eddb1c22SJohn Hubbardpin_user_pages*() APIs are clearly distinct from the get_user_pages*() APIs, so
34eddb1c22SJohn Hubbardthat's a natural dividing line, and a good point to make separate wrapper calls.
35eddb1c22SJohn HubbardIn other words, use pin_user_pages*() for DMA-pinned pages, and
36f9e55970SSouptick Joarderget_user_pages*() for other cases. There are five cases described later on in
37eddb1c22SJohn Hubbardthis document, to further clarify that concept.
38eddb1c22SJohn Hubbard
39eddb1c22SJohn HubbardFOLL_PIN and FOLL_GET are mutually exclusive for a given gup call. However,
40eddb1c22SJohn Hubbardmultiple threads and call sites are free to pin the same struct pages, via both
41eddb1c22SJohn HubbardFOLL_PIN and FOLL_GET. It's just the call site that needs to choose one or the
42eddb1c22SJohn Hubbardother, not the struct page(s).
43eddb1c22SJohn Hubbard
44eddb1c22SJohn HubbardThe FOLL_PIN implementation is nearly the same as FOLL_GET, except that FOLL_PIN
45eddb1c22SJohn Hubbarduses a different reference counting technique.
46eddb1c22SJohn Hubbard
47eddb1c22SJohn HubbardFOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying that is,
48eddb1c22SJohn HubbardFOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.
49eddb1c22SJohn Hubbard
50eddb1c22SJohn HubbardWhich flags are set by each wrapper
51eddb1c22SJohn Hubbard===================================
52eddb1c22SJohn Hubbard
53eddb1c22SJohn HubbardFor these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
54eddb1c22SJohn Hubbardflags the caller provides. The caller is required to pass in a non-null struct
5547e29d32SJohn Hubbardpages* array, and the function then pins pages by incrementing each by a special
5647e29d32SJohn Hubbardvalue: GUP_PIN_COUNTING_BIAS.
5747e29d32SJohn Hubbard
5894688e8eSMatthew Wilcox (Oracle)For large folios, the GUP_PIN_COUNTING_BIAS scheme is not used. Instead,
5994688e8eSMatthew Wilcox (Oracle)the extra space available in the struct folio is used to store the
6094688e8eSMatthew Wilcox (Oracle)pincount directly.
6147e29d32SJohn Hubbard
6294688e8eSMatthew Wilcox (Oracle)This approach for large folios avoids the counting upper limit problems
6394688e8eSMatthew Wilcox (Oracle)that are discussed below. Those limitations would have been aggravated
6494688e8eSMatthew Wilcox (Oracle)severely by huge pages, because each tail page adds a refcount to the
6594688e8eSMatthew Wilcox (Oracle)head page. And in fact, testing revealed that, without a separate pincount
6694688e8eSMatthew Wilcox (Oracle)field, refcount overflows were seen in some huge page stress tests.
6747e29d32SJohn Hubbard
6894688e8eSMatthew Wilcox (Oracle)This also means that huge pages and large folios do not suffer
6947e29d32SJohn Hubbardfrom the false positives problem that is mentioned below.::
70eddb1c22SJohn Hubbard
71eddb1c22SJohn Hubbard Function
72eddb1c22SJohn Hubbard --------
73eddb1c22SJohn Hubbard pin_user_pages          FOLL_PIN is always set internally by this function.
74eddb1c22SJohn Hubbard pin_user_pages_fast     FOLL_PIN is always set internally by this function.
75eddb1c22SJohn Hubbard pin_user_pages_remote   FOLL_PIN is always set internally by this function.
76eddb1c22SJohn Hubbard
77eddb1c22SJohn HubbardFor these get_user_pages*() functions, FOLL_GET might not even be specified.
78eddb1c22SJohn HubbardBehavior is a little more complex than above. If FOLL_GET was *not* specified,
79eddb1c22SJohn Hubbardbut the caller passed in a non-null struct pages* array, then the function
80eddb1c22SJohn Hubbardsets FOLL_GET for you, and proceeds to pin pages by incrementing the refcount
81eddb1c22SJohn Hubbardof each page by +1.::
82eddb1c22SJohn Hubbard
83eddb1c22SJohn Hubbard Function
84eddb1c22SJohn Hubbard --------
85eddb1c22SJohn Hubbard get_user_pages           FOLL_GET is sometimes set internally by this function.
86eddb1c22SJohn Hubbard get_user_pages_fast      FOLL_GET is sometimes set internally by this function.
87eddb1c22SJohn Hubbard get_user_pages_remote    FOLL_GET is sometimes set internally by this function.
88eddb1c22SJohn Hubbard
89eddb1c22SJohn HubbardTracking dma-pinned pages
90eddb1c22SJohn Hubbard=========================
91eddb1c22SJohn Hubbard
92eddb1c22SJohn HubbardSome of the key design constraints, and solutions, for tracking dma-pinned
93eddb1c22SJohn Hubbardpages:
94eddb1c22SJohn Hubbard
95eddb1c22SJohn Hubbard* An actual reference count, per struct page, is required. This is because
96eddb1c22SJohn Hubbard  multiple processes may pin and unpin a page.
97eddb1c22SJohn Hubbard
98eddb1c22SJohn Hubbard* False positives (reporting that a page is dma-pinned, when in fact it is not)
99eddb1c22SJohn Hubbard  are acceptable, but false negatives are not.
100eddb1c22SJohn Hubbard
101eddb1c22SJohn Hubbard* struct page may not be increased in size for this, and all fields are already
102eddb1c22SJohn Hubbard  used.
103eddb1c22SJohn Hubbard
104eddb1c22SJohn Hubbard* Given the above, we can overload the page->_refcount field by using, sort of,
105eddb1c22SJohn Hubbard  the upper bits in that field for a dma-pinned count. "Sort of", means that,
106eddb1c22SJohn Hubbard  rather than dividing page->_refcount into bit fields, we simple add a medium-
107eddb1c22SJohn Hubbard  large value (GUP_PIN_COUNTING_BIAS, initially chosen to be 1024: 10 bits) to
108eddb1c22SJohn Hubbard  page->_refcount. This provides fuzzy behavior: if a page has get_page() called
109eddb1c22SJohn Hubbard  on it 1024 times, then it will appear to have a single dma-pinned count.
110eddb1c22SJohn Hubbard  And again, that's acceptable.
111eddb1c22SJohn Hubbard
112eddb1c22SJohn HubbardThis also leads to limitations: there are only 31-10==21 bits available for a
113eddb1c22SJohn Hubbardcounter that increments 10 bits at a time.
114eddb1c22SJohn Hubbard
115*c8070b78SDavid Howells* Because of that limitation, special handling is applied to the zero pages
116*c8070b78SDavid Howells  when using FOLL_PIN.  We only pretend to pin a zero page - we don't alter its
117*c8070b78SDavid Howells  refcount or pincount at all (it is permanent, so there's no need).  The
118*c8070b78SDavid Howells  unpinning functions also don't do anything to a zero page.  This is
119*c8070b78SDavid Howells  transparent to the caller.
120*c8070b78SDavid Howells
121eddb1c22SJohn Hubbard* Callers must specifically request "dma-pinned tracking of pages". In other
122eddb1c22SJohn Hubbard  words, just calling get_user_pages() will not suffice; a new set of functions,
123eddb1c22SJohn Hubbard  pin_user_page() and related, must be used.
124eddb1c22SJohn Hubbard
125eddb1c22SJohn HubbardFOLL_PIN, FOLL_GET, FOLL_LONGTERM: when to use which flags
126eddb1c22SJohn Hubbard==========================================================
127eddb1c22SJohn Hubbard
128eddb1c22SJohn HubbardThanks to Jan Kara, Vlastimil Babka and several other -mm people, for describing
129eddb1c22SJohn Hubbardthese categories:
130eddb1c22SJohn Hubbard
131eddb1c22SJohn HubbardCASE 1: Direct IO (DIO)
132eddb1c22SJohn Hubbard-----------------------
133eddb1c22SJohn HubbardThere are GUP references to pages that are serving
134eddb1c22SJohn Hubbardas DIO buffers. These buffers are needed for a relatively short time (so they
135eddb1c22SJohn Hubbardare not "long term"). No special synchronization with page_mkclean() or
136eddb1c22SJohn Hubbardmunmap() is provided. Therefore, flags to set at the call site are: ::
137eddb1c22SJohn Hubbard
138eddb1c22SJohn Hubbard    FOLL_PIN
139eddb1c22SJohn Hubbard
140eddb1c22SJohn Hubbard...but rather than setting FOLL_PIN directly, call sites should use one of
141eddb1c22SJohn Hubbardthe pin_user_pages*() routines that set FOLL_PIN.
142eddb1c22SJohn Hubbard
143eddb1c22SJohn HubbardCASE 2: RDMA
144eddb1c22SJohn Hubbard------------
145eddb1c22SJohn HubbardThere are GUP references to pages that are serving as DMA
146eddb1c22SJohn Hubbardbuffers. These buffers are needed for a long time ("long term"). No special
147eddb1c22SJohn Hubbardsynchronization with page_mkclean() or munmap() is provided. Therefore, flags
148eddb1c22SJohn Hubbardto set at the call site are: ::
149eddb1c22SJohn Hubbard
150eddb1c22SJohn Hubbard    FOLL_PIN | FOLL_LONGTERM
151eddb1c22SJohn Hubbard
152eddb1c22SJohn HubbardNOTE: Some pages, such as DAX pages, cannot be pinned with longterm pins. That's
153eddb1c22SJohn Hubbardbecause DAX pages do not have a separate page cache, and so "pinning" implies
154eddb1c22SJohn Hubbardlocking down file system blocks, which is not (yet) supported in that way.
155eddb1c22SJohn Hubbard
156a8f80f53SJohn HubbardCASE 3: MMU notifier registration, with or without page faulting hardware
157a8f80f53SJohn Hubbard-------------------------------------------------------------------------
158a8f80f53SJohn HubbardDevice drivers can pin pages via get_user_pages*(), and register for mmu
159a8f80f53SJohn Hubbardnotifier callbacks for the memory range. Then, upon receiving a notifier
160a8f80f53SJohn Hubbard"invalidate range" callback , stop the device from using the range, and unpin
161a8f80f53SJohn Hubbardthe pages. There may be other possible schemes, such as for example explicitly
162a8f80f53SJohn Hubbardsynchronizing against pending IO, that accomplish approximately the same thing.
163eddb1c22SJohn Hubbard
164a8f80f53SJohn HubbardOr, if the hardware supports replayable page faults, then the device driver can
165a8f80f53SJohn Hubbardavoid pinning entirely (this is ideal), as follows: register for mmu notifier
166a8f80f53SJohn Hubbardcallbacks as above, but instead of stopping the device and unpinning in the
167a8f80f53SJohn Hubbardcallback, simply remove the range from the device's page tables.
168eddb1c22SJohn Hubbard
169a8f80f53SJohn HubbardEither way, as long as the driver unpins the pages upon mmu notifier callback,
170a8f80f53SJohn Hubbardthen there is proper synchronization with both filesystem and mm
171a8f80f53SJohn Hubbard(page_mkclean(), munmap(), etc). Therefore, neither flag needs to be set.
172eddb1c22SJohn Hubbard
173eddb1c22SJohn HubbardCASE 4: Pinning for struct page manipulation only
174eddb1c22SJohn Hubbard-------------------------------------------------
175a8f80f53SJohn HubbardIf only struct page data (as opposed to the actual memory contents that a page
176a8f80f53SJohn Hubbardis tracking) is affected, then normal GUP calls are sufficient, and neither flag
177a8f80f53SJohn Hubbardneeds to be set.
178eddb1c22SJohn Hubbard
179eaf4d22aSJohn HubbardCASE 5: Pinning in order to write to the data within the page
180eaf4d22aSJohn Hubbard-------------------------------------------------------------
181eaf4d22aSJohn HubbardEven though neither DMA nor Direct IO is involved, just a simple case of "pin,
182eaf4d22aSJohn Hubbardwrite to a page's data, unpin" can cause a problem. Case 5 may be considered a
183eaf4d22aSJohn Hubbardsuperset of Case 1, plus Case 2, plus anything that invokes that pattern. In
184eaf4d22aSJohn Hubbardother words, if the code is neither Case 1 nor Case 2, it may still require
185eaf4d22aSJohn HubbardFOLL_PIN, for patterns like this:
186eaf4d22aSJohn Hubbard
187eaf4d22aSJohn HubbardCorrect (uses FOLL_PIN calls):
188eaf4d22aSJohn Hubbard    pin_user_pages()
189eaf4d22aSJohn Hubbard    write to the data within the pages
190eaf4d22aSJohn Hubbard    unpin_user_pages()
191eaf4d22aSJohn Hubbard
192eaf4d22aSJohn HubbardINCORRECT (uses FOLL_GET calls):
193eaf4d22aSJohn Hubbard    get_user_pages()
194eaf4d22aSJohn Hubbard    write to the data within the pages
195eaf4d22aSJohn Hubbard    put_page()
196eaf4d22aSJohn Hubbard
1973faa52c0SJohn Hubbardpage_maybe_dma_pinned(): the whole point of pinning
1983faa52c0SJohn Hubbard===================================================
199eddb1c22SJohn Hubbard
200eddb1c22SJohn HubbardThe whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
201eddb1c22SJohn Hubbardto query, "is this page DMA-pinned?" That allows code such as page_mkclean()
202eddb1c22SJohn Hubbard(and file system writeback code in general) to make informed decisions about
203eddb1c22SJohn Hubbardwhat to do when a page cannot be unmapped due to such pins.
204eddb1c22SJohn Hubbard
205eddb1c22SJohn HubbardWhat to do in those cases is the subject of a years-long series of discussions
206eddb1c22SJohn Hubbardand debates (see the References at the end of this document). It's a TODO item
207eddb1c22SJohn Hubbardhere: fill in the details once that's worked out. Meanwhile, it's safe to say
208eddb1c22SJohn Hubbardthat having this available: ::
209eddb1c22SJohn Hubbard
2103faa52c0SJohn Hubbard        static inline bool page_maybe_dma_pinned(struct page *page)
211eddb1c22SJohn Hubbard
212eddb1c22SJohn Hubbard...is a prerequisite to solving the long-running gup+DMA problem.
213eddb1c22SJohn Hubbard
214eddb1c22SJohn HubbardAnother way of thinking about FOLL_GET, FOLL_PIN, and FOLL_LONGTERM
215eddb1c22SJohn Hubbard===================================================================
216eddb1c22SJohn Hubbard
217eddb1c22SJohn HubbardAnother way of thinking about these flags is as a progression of restrictions:
218eddb1c22SJohn HubbardFOLL_GET is for struct page manipulation, without affecting the data that the
219eddb1c22SJohn Hubbardstruct page refers to. FOLL_PIN is a *replacement* for FOLL_GET, and is for
220eddb1c22SJohn Hubbardshort term pins on pages whose data *will* get accessed. As such, FOLL_PIN is
221eddb1c22SJohn Hubbarda "more severe" form of pinning. And finally, FOLL_LONGTERM is an even more
222eddb1c22SJohn Hubbardrestrictive case that has FOLL_PIN as a prerequisite: this is for pages that
223eddb1c22SJohn Hubbardwill be pinned longterm, and whose data will be accessed.
224eddb1c22SJohn Hubbard
225eddb1c22SJohn HubbardUnit testing
226eddb1c22SJohn Hubbard============
227eddb1c22SJohn HubbardThis file::
228eddb1c22SJohn Hubbard
229baa489faSSeongJae Park tools/testing/selftests/mm/gup_test.c
230eddb1c22SJohn Hubbard
231eddb1c22SJohn Hubbardhas the following new calls to exercise the new pin*() wrapper functions:
232eddb1c22SJohn Hubbard
2339c84f229SJohn Hubbard* PIN_FAST_BENCHMARK (./gup_test -a)
234a9bed1e1SJohn Hubbard* PIN_BASIC_TEST (./gup_test -b)
235eddb1c22SJohn Hubbard
236eddb1c22SJohn HubbardYou can monitor how many total dma-pinned pages have been acquired and released
237eddb1c22SJohn Hubbardsince the system was booted, via two new /proc/vmstat entries: ::
238eddb1c22SJohn Hubbard
2391970dc6fSJohn Hubbard    /proc/vmstat/nr_foll_pin_acquired
2401970dc6fSJohn Hubbard    /proc/vmstat/nr_foll_pin_released
241eddb1c22SJohn Hubbard
2421970dc6fSJohn HubbardUnder normal conditions, these two values will be equal unless there are any
2431970dc6fSJohn Hubbardlong-term [R]DMA pins in place, or during pin/unpin transitions.
2441970dc6fSJohn Hubbard
2451970dc6fSJohn Hubbard* nr_foll_pin_acquired: This is the number of logical pins that have been
2461970dc6fSJohn Hubbard  acquired since the system was powered on. For huge pages, the head page is
2471970dc6fSJohn Hubbard  pinned once for each page (head page and each tail page) within the huge page.
2481970dc6fSJohn Hubbard  This follows the same sort of behavior that get_user_pages() uses for huge
2491970dc6fSJohn Hubbard  pages: the head page is refcounted once for each tail or head page in the huge
2501970dc6fSJohn Hubbard  page, when get_user_pages() is applied to a huge page.
2511970dc6fSJohn Hubbard
2521970dc6fSJohn Hubbard* nr_foll_pin_released: The number of logical pins that have been released since
2531970dc6fSJohn Hubbard  the system was powered on. Note that pages are released (unpinned) on a
2541970dc6fSJohn Hubbard  PAGE_SIZE granularity, even if the original pin was applied to a huge page.
2551970dc6fSJohn Hubbard  Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
2561970dc6fSJohn Hubbard  the accounting balances out, so that after doing this::
2571970dc6fSJohn Hubbard
2581970dc6fSJohn Hubbard    pin_user_pages(huge_page);
2591970dc6fSJohn Hubbard    for (each page in huge_page)
2601970dc6fSJohn Hubbard        unpin_user_page(page);
2611970dc6fSJohn Hubbard
2621970dc6fSJohn Hubbard...the following is expected::
2631970dc6fSJohn Hubbard
2641970dc6fSJohn Hubbard    nr_foll_pin_released == nr_foll_pin_acquired
2651970dc6fSJohn Hubbard
2661970dc6fSJohn Hubbard(...unless it was already out of balance due to a long-term RDMA pin being in
2671970dc6fSJohn Hubbardplace.)
268eddb1c22SJohn Hubbard
269dc8fb2f2SJohn HubbardOther diagnostics
270dc8fb2f2SJohn Hubbard=================
271dc8fb2f2SJohn Hubbard
27294688e8eSMatthew Wilcox (Oracle)dump_page() has been enhanced slightly to handle these new counting
27394688e8eSMatthew Wilcox (Oracle)fields, and to better report on large folios in general.  Specifically,
27494688e8eSMatthew Wilcox (Oracle)for large folios, the exact pincount is reported.
275dc8fb2f2SJohn Hubbard
276eddb1c22SJohn HubbardReferences
277eddb1c22SJohn Hubbard==========
278eddb1c22SJohn Hubbard
279eddb1c22SJohn Hubbard* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
280eddb1c22SJohn Hubbard* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
281eddb1c22SJohn Hubbard* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
28247e29d32SJohn Hubbard* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_
283eddb1c22SJohn Hubbard
284eddb1c22SJohn HubbardJohn Hubbard, October, 2019
285