1ee65728eSMike Rapoport=====================
2ee65728eSMike RapoportSplit page table lock
3ee65728eSMike Rapoport=====================
4ee65728eSMike Rapoport
5ee65728eSMike RapoportOriginally, mm->page_table_lock spinlock protected all page tables of the
6ee65728eSMike Rapoportmm_struct. But this approach leads to poor page fault scalability of
7ee65728eSMike Rapoportmulti-threaded applications due high contention on the lock. To improve
8ee65728eSMike Rapoportscalability, split page table lock was introduced.
9ee65728eSMike Rapoport
10ee65728eSMike RapoportWith split page table lock we have separate per-table lock to serialize
11ee65728eSMike Rapoportaccess to the table. At the moment we use split lock for PTE and PMD
12ee65728eSMike Rapoporttables. Access to higher level tables protected by mm->page_table_lock.
13ee65728eSMike Rapoport
14ee65728eSMike RapoportThere are helpers to lock/unlock a table and other accessor functions:
15ee65728eSMike Rapoport
16ee65728eSMike Rapoport - pte_offset_map_lock()
170d940a9bSHugh Dickins	maps PTE and takes PTE table lock, returns pointer to PTE with
180d940a9bSHugh Dickins	pointer to its PTE table lock, or returns NULL if no PTE table;
190d940a9bSHugh Dickins - pte_offset_map_nolock()
200d940a9bSHugh Dickins	maps PTE, returns pointer to PTE with pointer to its PTE table
210d940a9bSHugh Dickins	lock (not taken), or returns NULL if no PTE table;
220d940a9bSHugh Dickins - pte_offset_map()
230d940a9bSHugh Dickins	maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
240d940a9bSHugh Dickins - pte_unmap()
250d940a9bSHugh Dickins	unmaps PTE table;
26ee65728eSMike Rapoport - pte_unmap_unlock()
27ee65728eSMike Rapoport	unlocks and unmaps PTE table;
28ee65728eSMike Rapoport - pte_alloc_map_lock()
290d940a9bSHugh Dickins	allocates PTE table if needed and takes its lock, returns pointer to
300d940a9bSHugh Dickins	PTE with pointer to its lock, or returns NULL if allocation failed;
31ee65728eSMike Rapoport - pmd_lock()
32ee65728eSMike Rapoport	takes PMD table lock, returns pointer to taken lock;
33ee65728eSMike Rapoport - pmd_lockptr()
34ee65728eSMike Rapoport	returns pointer to PMD table lock;
35ee65728eSMike Rapoport
36ee65728eSMike RapoportSplit page table lock for PTE tables is enabled compile-time if
37ee65728eSMike RapoportCONFIG_SPLIT_PTLOCK_CPUS (usually 4) is less or equal to NR_CPUS.
38ee65728eSMike RapoportIf split lock is disabled, all tables are guarded by mm->page_table_lock.
39ee65728eSMike Rapoport
40ee65728eSMike RapoportSplit page table lock for PMD tables is enabled, if it's enabled for PTE
41ee65728eSMike Rapoporttables and the architecture supports it (see below).
42ee65728eSMike Rapoport
43ee65728eSMike RapoportHugetlb and split page table lock
44ee65728eSMike Rapoport=================================
45ee65728eSMike Rapoport
46ee65728eSMike RapoportHugetlb can support several page sizes. We use split lock only for PMD
47ee65728eSMike Rapoportlevel, but not for PUD.
48ee65728eSMike Rapoport
49ee65728eSMike RapoportHugetlb-specific helpers:
50ee65728eSMike Rapoport
51ee65728eSMike Rapoport - huge_pte_lock()
52ee65728eSMike Rapoport	takes pmd split lock for PMD_SIZE page, mm->page_table_lock
53ee65728eSMike Rapoport	otherwise;
54ee65728eSMike Rapoport - huge_pte_lockptr()
55ee65728eSMike Rapoport	returns pointer to table lock;
56ee65728eSMike Rapoport
57ee65728eSMike RapoportSupport of split page table lock by an architecture
58ee65728eSMike Rapoport===================================================
59ee65728eSMike Rapoport
60ee65728eSMike RapoportThere's no need in special enabling of PTE split page table lock: everything
61*9a4bbd8dSVishal Moola (Oracle)required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which
62ee65728eSMike Rapoportmust be called on PTE table allocation / freeing.
63ee65728eSMike Rapoport
64ee65728eSMike RapoportMake sure the architecture doesn't use slab allocator for page table
65ee65728eSMike Rapoportallocation: slab uses page->slab_cache for its pages.
66ee65728eSMike RapoportThis field shares storage with page->ptl.
67ee65728eSMike Rapoport
68ee65728eSMike RapoportPMD split lock only makes sense if you have more than two page table
69ee65728eSMike Rapoportlevels.
70ee65728eSMike Rapoport
71*9a4bbd8dSVishal Moola (Oracle)PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
72*9a4bbd8dSVishal Moola (Oracle)allocation and pagetable_pmd_dtor() on freeing.
73ee65728eSMike Rapoport
74ee65728eSMike RapoportAllocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
75ee65728eSMike Rapoportpmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
76ee65728eSMike Rapoportpaths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
77ee65728eSMike Rapoport
78ee65728eSMike RapoportWith everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
79ee65728eSMike Rapoport
80*9a4bbd8dSVishal Moola (Oracle)NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
81ee65728eSMike Rapoportbe handled properly.
82ee65728eSMike Rapoport
83ee65728eSMike Rapoportpage->ptl
84ee65728eSMike Rapoport=========
85ee65728eSMike Rapoport
86ee65728eSMike Rapoportpage->ptl is used to access split page table lock, where 'page' is struct
87ee65728eSMike Rapoportpage of page containing the table. It shares storage with page->private
88ee65728eSMike Rapoport(and few other fields in union).
89ee65728eSMike Rapoport
90ee65728eSMike RapoportTo avoid increasing size of struct page and have best performance, we use a
91ee65728eSMike Rapoporttrick:
92ee65728eSMike Rapoport
93ee65728eSMike Rapoport - if spinlock_t fits into long, we use page->ptr as spinlock, so we
94ee65728eSMike Rapoport   can avoid indirect access and save a cache line.
95ee65728eSMike Rapoport - if size of spinlock_t is bigger then size of long, we use page->ptl as
96ee65728eSMike Rapoport   pointer to spinlock_t and allocate it dynamically. This allows to use
97ee65728eSMike Rapoport   split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
98ee65728eSMike Rapoport   one more cache line for indirect access;
99ee65728eSMike Rapoport
100*9a4bbd8dSVishal Moola (Oracle)The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
101*9a4bbd8dSVishal Moola (Oracle)pagetable_pmd_ctor() for PMD table.
102ee65728eSMike Rapoport
103ee65728eSMike RapoportPlease, never access page->ptl directly -- use appropriate helper.
104