xref: /openbmc/linux/Documentation/mm/highmem.rst (revision ee65728e)
1*ee65728eSMike Rapoport.. _highmem:
2*ee65728eSMike Rapoport
3*ee65728eSMike Rapoport====================
4*ee65728eSMike RapoportHigh Memory Handling
5*ee65728eSMike Rapoport====================
6*ee65728eSMike Rapoport
7*ee65728eSMike RapoportBy: Peter Zijlstra <a.p.zijlstra@chello.nl>
8*ee65728eSMike Rapoport
9*ee65728eSMike Rapoport.. contents:: :local:
10*ee65728eSMike Rapoport
11*ee65728eSMike RapoportWhat Is High Memory?
12*ee65728eSMike Rapoport====================
13*ee65728eSMike Rapoport
14*ee65728eSMike RapoportHigh memory (highmem) is used when the size of physical memory approaches or
15*ee65728eSMike Rapoportexceeds the maximum size of virtual memory.  At that point it becomes
16*ee65728eSMike Rapoportimpossible for the kernel to keep all of the available physical memory mapped
17*ee65728eSMike Rapoportat all times.  This means the kernel needs to start using temporary mappings of
18*ee65728eSMike Rapoportthe pieces of physical memory that it wants to access.
19*ee65728eSMike Rapoport
20*ee65728eSMike RapoportThe part of (physical) memory not covered by a permanent mapping is what we
21*ee65728eSMike Rapoportrefer to as 'highmem'.  There are various architecture dependent constraints on
22*ee65728eSMike Rapoportwhere exactly that border lies.
23*ee65728eSMike Rapoport
24*ee65728eSMike RapoportIn the i386 arch, for example, we choose to map the kernel into every process's
25*ee65728eSMike RapoportVM space so that we don't have to pay the full TLB invalidation costs for
26*ee65728eSMike Rapoportkernel entry/exit.  This means the available virtual memory space (4GiB on
27*ee65728eSMike Rapoporti386) has to be divided between user and kernel space.
28*ee65728eSMike Rapoport
29*ee65728eSMike RapoportThe traditional split for architectures using this approach is 3:1, 3GiB for
30*ee65728eSMike Rapoportuserspace and the top 1GiB for kernel space::
31*ee65728eSMike Rapoport
32*ee65728eSMike Rapoport		+--------+ 0xffffffff
33*ee65728eSMike Rapoport		| Kernel |
34*ee65728eSMike Rapoport		+--------+ 0xc0000000
35*ee65728eSMike Rapoport		|        |
36*ee65728eSMike Rapoport		| User   |
37*ee65728eSMike Rapoport		|        |
38*ee65728eSMike Rapoport		+--------+ 0x00000000
39*ee65728eSMike Rapoport
40*ee65728eSMike RapoportThis means that the kernel can at most map 1GiB of physical memory at any one
41*ee65728eSMike Rapoporttime, but because we need virtual address space for other things - including
42*ee65728eSMike Rapoporttemporary maps to access the rest of the physical memory - the actual direct
43*ee65728eSMike Rapoportmap will typically be less (usually around ~896MiB).
44*ee65728eSMike Rapoport
45*ee65728eSMike RapoportOther architectures that have mm context tagged TLBs can have separate kernel
46*ee65728eSMike Rapoportand user maps.  Some hardware (like some ARMs), however, have limited virtual
47*ee65728eSMike Rapoportspace when they use mm context tags.
48*ee65728eSMike Rapoport
49*ee65728eSMike Rapoport
50*ee65728eSMike RapoportTemporary Virtual Mappings
51*ee65728eSMike Rapoport==========================
52*ee65728eSMike Rapoport
53*ee65728eSMike RapoportThe kernel contains several ways of creating temporary mappings. The following
54*ee65728eSMike Rapoportlist shows them in order of preference of use.
55*ee65728eSMike Rapoport
56*ee65728eSMike Rapoport* kmap_local_page().  This function is used to require short term mappings.
57*ee65728eSMike Rapoport  It can be invoked from any context (including interrupts) but the mappings
58*ee65728eSMike Rapoport  can only be used in the context which acquired them.
59*ee65728eSMike Rapoport
60*ee65728eSMike Rapoport  This function should be preferred, where feasible, over all the others.
61*ee65728eSMike Rapoport
62*ee65728eSMike Rapoport  These mappings are thread-local and CPU-local, meaning that the mapping
63*ee65728eSMike Rapoport  can only be accessed from within this thread and the thread is bound the
64*ee65728eSMike Rapoport  CPU while the mapping is active. Even if the thread is preempted (since
65*ee65728eSMike Rapoport  preemption is never disabled by the function) the CPU can not be
66*ee65728eSMike Rapoport  unplugged from the system via CPU-hotplug until the mapping is disposed.
67*ee65728eSMike Rapoport
68*ee65728eSMike Rapoport  It's valid to take pagefaults in a local kmap region, unless the context
69*ee65728eSMike Rapoport  in which the local mapping is acquired does not allow it for other reasons.
70*ee65728eSMike Rapoport
71*ee65728eSMike Rapoport  kmap_local_page() always returns a valid virtual address and it is assumed
72*ee65728eSMike Rapoport  that kunmap_local() will never fail.
73*ee65728eSMike Rapoport
74*ee65728eSMike Rapoport  Nesting kmap_local_page() and kmap_atomic() mappings is allowed to a certain
75*ee65728eSMike Rapoport  extent (up to KMAP_TYPE_NR) but their invocations have to be strictly ordered
76*ee65728eSMike Rapoport  because the map implementation is stack based. See kmap_local_page() kdocs
77*ee65728eSMike Rapoport  (included in the "Functions" section) for details on how to manage nested
78*ee65728eSMike Rapoport  mappings.
79*ee65728eSMike Rapoport
80*ee65728eSMike Rapoport* kmap_atomic().  This permits a very short duration mapping of a single
81*ee65728eSMike Rapoport  page.  Since the mapping is restricted to the CPU that issued it, it
82*ee65728eSMike Rapoport  performs well, but the issuing task is therefore required to stay on that
83*ee65728eSMike Rapoport  CPU until it has finished, lest some other task displace its mappings.
84*ee65728eSMike Rapoport
85*ee65728eSMike Rapoport  kmap_atomic() may also be used by interrupt contexts, since it does not
86*ee65728eSMike Rapoport  sleep and the callers too may not sleep until after kunmap_atomic() is
87*ee65728eSMike Rapoport  called.
88*ee65728eSMike Rapoport
89*ee65728eSMike Rapoport  Each call of kmap_atomic() in the kernel creates a non-preemptible section
90*ee65728eSMike Rapoport  and disable pagefaults. This could be a source of unwanted latency. Therefore
91*ee65728eSMike Rapoport  users should prefer kmap_local_page() instead of kmap_atomic().
92*ee65728eSMike Rapoport
93*ee65728eSMike Rapoport  It is assumed that k[un]map_atomic() won't fail.
94*ee65728eSMike Rapoport
95*ee65728eSMike Rapoport* kmap().  This should be used to make short duration mapping of a single
96*ee65728eSMike Rapoport  page with no restrictions on preemption or migration. It comes with an
97*ee65728eSMike Rapoport  overhead as mapping space is restricted and protected by a global lock
98*ee65728eSMike Rapoport  for synchronization. When mapping is no longer needed, the address that
99*ee65728eSMike Rapoport  the page was mapped to must be released with kunmap().
100*ee65728eSMike Rapoport
101*ee65728eSMike Rapoport  Mapping changes must be propagated across all the CPUs. kmap() also
102*ee65728eSMike Rapoport  requires global TLB invalidation when the kmap's pool wraps and it might
103*ee65728eSMike Rapoport  block when the mapping space is fully utilized until a slot becomes
104*ee65728eSMike Rapoport  available. Therefore, kmap() is only callable from preemptible context.
105*ee65728eSMike Rapoport
106*ee65728eSMike Rapoport  All the above work is necessary if a mapping must last for a relatively
107*ee65728eSMike Rapoport  long time but the bulk of high-memory mappings in the kernel are
108*ee65728eSMike Rapoport  short-lived and only used in one place. This means that the cost of
109*ee65728eSMike Rapoport  kmap() is mostly wasted in such cases. kmap() was not intended for long
110*ee65728eSMike Rapoport  term mappings but it has morphed in that direction and its use is
111*ee65728eSMike Rapoport  strongly discouraged in newer code and the set of the preceding functions
112*ee65728eSMike Rapoport  should be preferred.
113*ee65728eSMike Rapoport
114*ee65728eSMike Rapoport  On 64-bit systems, calls to kmap_local_page(), kmap_atomic() and kmap() have
115*ee65728eSMike Rapoport  no real work to do because a 64-bit address space is more than sufficient to
116*ee65728eSMike Rapoport  address all the physical memory whose pages are permanently mapped.
117*ee65728eSMike Rapoport
118*ee65728eSMike Rapoport* vmap().  This can be used to make a long duration mapping of multiple
119*ee65728eSMike Rapoport  physical pages into a contiguous virtual space.  It needs global
120*ee65728eSMike Rapoport  synchronization to unmap.
121*ee65728eSMike Rapoport
122*ee65728eSMike Rapoport
123*ee65728eSMike RapoportCost of Temporary Mappings
124*ee65728eSMike Rapoport==========================
125*ee65728eSMike Rapoport
126*ee65728eSMike RapoportThe cost of creating temporary mappings can be quite high.  The arch has to
127*ee65728eSMike Rapoportmanipulate the kernel's page tables, the data TLB and/or the MMU's registers.
128*ee65728eSMike Rapoport
129*ee65728eSMike RapoportIf CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping
130*ee65728eSMike Rapoportsimply with a bit of arithmetic that will convert the page struct address into
131*ee65728eSMike Rapoporta pointer to the page contents rather than juggling mappings about.  In such a
132*ee65728eSMike Rapoportcase, the unmap operation may be a null operation.
133*ee65728eSMike Rapoport
134*ee65728eSMike RapoportIf CONFIG_MMU is not set, then there can be no temporary mappings and no
135*ee65728eSMike Rapoporthighmem.  In such a case, the arithmetic approach will also be used.
136*ee65728eSMike Rapoport
137*ee65728eSMike Rapoport
138*ee65728eSMike Rapoporti386 PAE
139*ee65728eSMike Rapoport========
140*ee65728eSMike Rapoport
141*ee65728eSMike RapoportThe i386 arch, under some circumstances, will permit you to stick up to 64GiB
142*ee65728eSMike Rapoportof RAM into your 32-bit machine.  This has a number of consequences:
143*ee65728eSMike Rapoport
144*ee65728eSMike Rapoport* Linux needs a page-frame structure for each page in the system and the
145*ee65728eSMike Rapoport  pageframes need to live in the permanent mapping, which means:
146*ee65728eSMike Rapoport
147*ee65728eSMike Rapoport* you can have 896M/sizeof(struct page) page-frames at most; with struct
148*ee65728eSMike Rapoport  page being 32-bytes that would end up being something in the order of 112G
149*ee65728eSMike Rapoport  worth of pages; the kernel, however, needs to store more than just
150*ee65728eSMike Rapoport  page-frames in that memory...
151*ee65728eSMike Rapoport
152*ee65728eSMike Rapoport* PAE makes your page tables larger - which slows the system down as more
153*ee65728eSMike Rapoport  data has to be accessed to traverse in TLB fills and the like.  One
154*ee65728eSMike Rapoport  advantage is that PAE has more PTE bits and can provide advanced features
155*ee65728eSMike Rapoport  like NX and PAT.
156*ee65728eSMike Rapoport
157*ee65728eSMike RapoportThe general recommendation is that you don't use more than 8GiB on a 32-bit
158*ee65728eSMike Rapoportmachine - although more might work for you and your workload, you're pretty
159*ee65728eSMike Rapoportmuch on your own - don't expect kernel developers to really care much if things
160*ee65728eSMike Rapoportcome apart.
161*ee65728eSMike Rapoport
162*ee65728eSMike Rapoport
163*ee65728eSMike RapoportFunctions
164*ee65728eSMike Rapoport=========
165*ee65728eSMike Rapoport
166*ee65728eSMike Rapoport.. kernel-doc:: include/linux/highmem.h
167*ee65728eSMike Rapoport.. kernel-doc:: include/linux/highmem-internal.h
168