1ee65728eSMike Rapoport============================ 2ee65728eSMike RapoportTransparent Hugepage Support 3ee65728eSMike Rapoport============================ 4ee65728eSMike Rapoport 5ee65728eSMike RapoportThis document describes design principles for Transparent Hugepage (THP) 6ee65728eSMike Rapoportsupport and its interaction with other parts of the memory management 7ee65728eSMike Rapoportsystem. 8ee65728eSMike Rapoport 9ee65728eSMike RapoportDesign principles 10ee65728eSMike Rapoport================= 11ee65728eSMike Rapoport 12ee65728eSMike Rapoport- "graceful fallback": mm components which don't have transparent hugepage 13ee65728eSMike Rapoport knowledge fall back to breaking huge pmd mapping into table of ptes and, 14ee65728eSMike Rapoport if necessary, split a transparent hugepage. Therefore these components 15ee65728eSMike Rapoport can continue working on the regular pages or regular pte mappings. 16ee65728eSMike Rapoport 17ee65728eSMike Rapoport- if a hugepage allocation fails because of memory fragmentation, 18ee65728eSMike Rapoport regular pages should be gracefully allocated instead and mixed in 19ee65728eSMike Rapoport the same vma without any failure or significant delay and without 20ee65728eSMike Rapoport userland noticing 21ee65728eSMike Rapoport 22ee65728eSMike Rapoport- if some task quits and more hugepages become available (either 23ee65728eSMike Rapoport immediately in the buddy or through the VM), guest physical memory 24ee65728eSMike Rapoport backed by regular pages should be relocated on hugepages 25ee65728eSMike Rapoport automatically (with khugepaged) 26ee65728eSMike Rapoport 27ee65728eSMike Rapoport- it doesn't require memory reservation and in turn it uses hugepages 28ee65728eSMike Rapoport whenever possible (the only possible reservation here is kernelcore= 29ee65728eSMike Rapoport to avoid unmovable pages to fragment all the memory but such a tweak 30ee65728eSMike Rapoport is not specific to transparent hugepage support and it's a generic 31ee65728eSMike Rapoport feature that applies to all dynamic high order allocations in the 32ee65728eSMike Rapoport kernel) 33ee65728eSMike Rapoport 34ee65728eSMike Rapoportget_user_pages and follow_page 35ee65728eSMike Rapoport============================== 36ee65728eSMike Rapoport 37ee65728eSMike Rapoportget_user_pages and follow_page if run on a hugepage, will return the 38ee65728eSMike Rapoporthead or tail pages as usual (exactly as they would do on 39ee65728eSMike Rapoporthugetlbfs). Most GUP users will only care about the actual physical 40ee65728eSMike Rapoportaddress of the page and its temporary pinning to release after the I/O 41ee65728eSMike Rapoportis complete, so they won't ever notice the fact the page is huge. But 42ee65728eSMike Rapoportif any driver is going to mangle over the page structure of the tail 43ee65728eSMike Rapoportpage (like for checking page->mapping or other bits that are relevant 44ee65728eSMike Rapoportfor the head page and not the tail page), it should be updated to jump 45ee65728eSMike Rapoportto check head page instead. Taking a reference on any head/tail page would 46ee65728eSMike Rapoportprevent the page from being split by anyone. 47ee65728eSMike Rapoport 48ee65728eSMike Rapoport.. note:: 49ee65728eSMike Rapoport these aren't new constraints to the GUP API, and they match the 50ee65728eSMike Rapoport same constraints that apply to hugetlbfs too, so any driver capable 51ee65728eSMike Rapoport of handling GUP on hugetlbfs will also work fine on transparent 52ee65728eSMike Rapoport hugepage backed mappings. 53ee65728eSMike Rapoport 54ee65728eSMike RapoportGraceful fallback 55ee65728eSMike Rapoport================= 56ee65728eSMike Rapoport 57ee65728eSMike RapoportCode walking pagetables but unaware about huge pmds can simply call 58ee65728eSMike Rapoportsplit_huge_pmd(vma, pmd, addr) where the pmd is the one returned by 59ee65728eSMike Rapoportpmd_offset. It's trivial to make the code transparent hugepage aware 60ee65728eSMike Rapoportby just grepping for "pmd_offset" and adding split_huge_pmd where 61ee65728eSMike Rapoportmissing after pmd_offset returns the pmd. Thanks to the graceful 62ee65728eSMike Rapoportfallback design, with a one liner change, you can avoid to write 63ee65728eSMike Rapoporthundreds if not thousands of lines of complex code to make your code 64ee65728eSMike Rapoporthugepage aware. 65ee65728eSMike Rapoport 66ee65728eSMike RapoportIf you're not walking pagetables but you run into a physical hugepage 67ee65728eSMike Rapoportthat you can't handle natively in your code, you can split it by 68ee65728eSMike Rapoportcalling split_huge_page(page). This is what the Linux VM does before 69ee65728eSMike Rapoportit tries to swapout the hugepage for example. split_huge_page() can fail 70ee65728eSMike Rapoportif the page is pinned and you must handle this correctly. 71ee65728eSMike Rapoport 72ee65728eSMike RapoportExample to make mremap.c transparent hugepage aware with a one liner 73ee65728eSMike Rapoportchange:: 74ee65728eSMike Rapoport 75ee65728eSMike Rapoport diff --git a/mm/mremap.c b/mm/mremap.c 76ee65728eSMike Rapoport --- a/mm/mremap.c 77ee65728eSMike Rapoport +++ b/mm/mremap.c 78ee65728eSMike Rapoport @@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru 79ee65728eSMike Rapoport return NULL; 80ee65728eSMike Rapoport 81ee65728eSMike Rapoport pmd = pmd_offset(pud, addr); 82ee65728eSMike Rapoport + split_huge_pmd(vma, pmd, addr); 83ee65728eSMike Rapoport if (pmd_none_or_clear_bad(pmd)) 84ee65728eSMike Rapoport return NULL; 85ee65728eSMike Rapoport 86ee65728eSMike RapoportLocking in hugepage aware code 87ee65728eSMike Rapoport============================== 88ee65728eSMike Rapoport 89ee65728eSMike RapoportWe want as much code as possible hugepage aware, as calling 90ee65728eSMike Rapoportsplit_huge_page() or split_huge_pmd() has a cost. 91ee65728eSMike Rapoport 92ee65728eSMike RapoportTo make pagetable walks huge pmd aware, all you need to do is to call 93ee65728eSMike Rapoportpmd_trans_huge() on the pmd returned by pmd_offset. You must hold the 94ee65728eSMike Rapoportmmap_lock in read (or write) mode to be sure a huge pmd cannot be 95ee65728eSMike Rapoportcreated from under you by khugepaged (khugepaged collapse_huge_page 96ee65728eSMike Rapoporttakes the mmap_lock in write mode in addition to the anon_vma lock). If 97ee65728eSMike Rapoportpmd_trans_huge returns false, you just fallback in the old code 98ee65728eSMike Rapoportpaths. If instead pmd_trans_huge returns true, you have to take the 99ee65728eSMike Rapoportpage table lock (pmd_lock()) and re-run pmd_trans_huge. Taking the 100ee65728eSMike Rapoportpage table lock will prevent the huge pmd being converted into a 101ee65728eSMike Rapoportregular pmd from under you (split_huge_pmd can run in parallel to the 102ee65728eSMike Rapoportpagetable walk). If the second pmd_trans_huge returns false, you 103ee65728eSMike Rapoportshould just drop the page table lock and fallback to the old code as 104ee65728eSMike Rapoportbefore. Otherwise, you can proceed to process the huge pmd and the 105ee65728eSMike Rapoporthugepage natively. Once finished, you can drop the page table lock. 106ee65728eSMike Rapoport 107ee65728eSMike RapoportRefcounts and transparent huge pages 108ee65728eSMike Rapoport==================================== 109ee65728eSMike Rapoport 110ee65728eSMike RapoportRefcounting on THP is mostly consistent with refcounting on other compound 111ee65728eSMike Rapoportpages: 112ee65728eSMike Rapoport 1136eee1a00SMatthew Wilcox (Oracle) - get_page()/put_page() and GUP operate on the folio->_refcount. 114ee65728eSMike Rapoport 115ee65728eSMike Rapoport - ->_refcount in tail pages is always zero: get_page_unless_zero() never 116ee65728eSMike Rapoport succeeds on tail pages. 117ee65728eSMike Rapoport 1186eee1a00SMatthew Wilcox (Oracle) - map/unmap of a PMD entry for the whole THP increment/decrement 1196eee1a00SMatthew Wilcox (Oracle) folio->_entire_mapcount and also increment/decrement 1206eee1a00SMatthew Wilcox (Oracle) folio->_nr_pages_mapped by COMPOUND_MAPPED when _entire_mapcount 1216eee1a00SMatthew Wilcox (Oracle) goes from -1 to 0 or 0 to -1. 122ee65728eSMike Rapoport 1236eee1a00SMatthew Wilcox (Oracle) - map/unmap of individual pages with PTE entry increment/decrement 1246eee1a00SMatthew Wilcox (Oracle) page->_mapcount and also increment/decrement folio->_nr_pages_mapped 1256eee1a00SMatthew Wilcox (Oracle) when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts 1266eee1a00SMatthew Wilcox (Oracle) the number of pages mapped by PTE. 127ee65728eSMike Rapoport 128ee65728eSMike Rapoportsplit_huge_page internally has to distribute the refcounts in the head 129ee65728eSMike Rapoportpage to the tail pages before clearing all PG_head/tail bits from the page 130ee65728eSMike Rapoportstructures. It can be done easily for refcounts taken by page table 131ee65728eSMike Rapoportentries, but we don't have enough information on how to distribute any 132ee65728eSMike Rapoportadditional pins (i.e. from get_user_pages). split_huge_page() fails any 133ee65728eSMike Rapoportrequests to split pinned huge pages: it expects page count to be equal to 134ee65728eSMike Rapoportthe sum of mapcount of all sub-pages plus one (split_huge_page caller must 135ee65728eSMike Rapoporthave a reference to the head page). 136ee65728eSMike Rapoport 137ee65728eSMike Rapoportsplit_huge_page uses migration entries to stabilize page->_refcount and 138ee65728eSMike Rapoportpage->_mapcount of anonymous pages. File pages just get unmapped. 139ee65728eSMike Rapoport 140ee65728eSMike RapoportWe are safe against physical memory scanners too: the only legitimate way 141ee65728eSMike Rapoporta scanner can get a reference to a page is get_page_unless_zero(). 142ee65728eSMike Rapoport 143ee65728eSMike RapoportAll tail pages have zero ->_refcount until atomic_add(). This prevents the 144ee65728eSMike Rapoportscanner from getting a reference to the tail page up to that point. After the 145ee65728eSMike Rapoportatomic_add() we don't care about the ->_refcount value. We already know how 146ee65728eSMike Rapoportmany references should be uncharged from the head page. 147ee65728eSMike Rapoport 148ee65728eSMike RapoportFor head page get_page_unless_zero() will succeed and we don't mind. It's 149ee65728eSMike Rapoportclear where references should go after split: it will stay on the head page. 150ee65728eSMike Rapoport 151ee65728eSMike RapoportNote that split_huge_pmd() doesn't have any limitations on refcounting: 152ee65728eSMike Rapoportpmd can be split at any point and never fails. 153ee65728eSMike Rapoport 154*f158ed61SMatthew Wilcox (Oracle)Partial unmap and deferred_split_folio() 155*f158ed61SMatthew Wilcox (Oracle)======================================== 156ee65728eSMike Rapoport 157ee65728eSMike RapoportUnmapping part of THP (with munmap() or other way) is not going to free 158ee65728eSMike Rapoportmemory immediately. Instead, we detect that a subpage of THP is not in use 159ee65728eSMike Rapoportin page_remove_rmap() and queue the THP for splitting if memory pressure 160ee65728eSMike Rapoportcomes. Splitting will free up unused subpages. 161ee65728eSMike Rapoport 162ee65728eSMike RapoportSplitting the page right away is not an option due to locking context in 163ee65728eSMike Rapoportthe place where we can detect partial unmap. It also might be 164ee65728eSMike Rapoportcounterproductive since in many cases partial unmap happens during exit(2) if 165ee65728eSMike Rapoporta THP crosses a VMA boundary. 166ee65728eSMike Rapoport 167*f158ed61SMatthew Wilcox (Oracle)The function deferred_split_folio() is used to queue a folio for splitting. 168ee65728eSMike RapoportThe splitting itself will happen when we get memory pressure via shrinker 169ee65728eSMike Rapoportinterface. 170