xref: /openbmc/linux/Documentation/admin-guide/mm/nommu-mmap.rst (revision 4b4193256c8d3bc3a5397b5cd9494c2ad386317d)
1*800c02f5SMauro Carvalho Chehab=============================
2*800c02f5SMauro Carvalho ChehabNo-MMU memory mapping support
3*800c02f5SMauro Carvalho Chehab=============================
4*800c02f5SMauro Carvalho Chehab
5*800c02f5SMauro Carvalho ChehabThe kernel has limited support for memory mapping under no-MMU conditions, such
6*800c02f5SMauro Carvalho Chehabas are used in uClinux environments. From the userspace point of view, memory
7*800c02f5SMauro Carvalho Chehabmapping is made use of in conjunction with the mmap() system call, the shmat()
8*800c02f5SMauro Carvalho Chehabcall and the execve() system call. From the kernel's point of view, execve()
9*800c02f5SMauro Carvalho Chehabmapping is actually performed by the binfmt drivers, which call back into the
10*800c02f5SMauro Carvalho Chehabmmap() routines to do the actual work.
11*800c02f5SMauro Carvalho Chehab
12*800c02f5SMauro Carvalho ChehabMemory mapping behaviour also involves the way fork(), vfork(), clone() and
13*800c02f5SMauro Carvalho Chehabptrace() work. Under uClinux there is no fork(), and clone() must be supplied
14*800c02f5SMauro Carvalho Chehabthe CLONE_VM flag.
15*800c02f5SMauro Carvalho Chehab
16*800c02f5SMauro Carvalho ChehabThe behaviour is similar between the MMU and no-MMU cases, but not identical;
17*800c02f5SMauro Carvalho Chehaband it's also much more restricted in the latter case:
18*800c02f5SMauro Carvalho Chehab
19*800c02f5SMauro Carvalho Chehab (#) Anonymous mapping, MAP_PRIVATE
20*800c02f5SMauro Carvalho Chehab
21*800c02f5SMauro Carvalho Chehab	In the MMU case: VM regions backed by arbitrary pages; copy-on-write
22*800c02f5SMauro Carvalho Chehab	across fork.
23*800c02f5SMauro Carvalho Chehab
24*800c02f5SMauro Carvalho Chehab	In the no-MMU case: VM regions backed by arbitrary contiguous runs of
25*800c02f5SMauro Carvalho Chehab	pages.
26*800c02f5SMauro Carvalho Chehab
27*800c02f5SMauro Carvalho Chehab (#) Anonymous mapping, MAP_SHARED
28*800c02f5SMauro Carvalho Chehab
29*800c02f5SMauro Carvalho Chehab	These behave very much like private mappings, except that they're
30*800c02f5SMauro Carvalho Chehab	shared across fork() or clone() without CLONE_VM in the MMU case. Since
31*800c02f5SMauro Carvalho Chehab	the no-MMU case doesn't support these, behaviour is identical to
32*800c02f5SMauro Carvalho Chehab	MAP_PRIVATE there.
33*800c02f5SMauro Carvalho Chehab
34*800c02f5SMauro Carvalho Chehab (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, !PROT_WRITE
35*800c02f5SMauro Carvalho Chehab
36*800c02f5SMauro Carvalho Chehab	In the MMU case: VM regions backed by pages read from file; changes to
37*800c02f5SMauro Carvalho Chehab	the underlying file are reflected in the mapping; copied across fork.
38*800c02f5SMauro Carvalho Chehab
39*800c02f5SMauro Carvalho Chehab	In the no-MMU case:
40*800c02f5SMauro Carvalho Chehab
41*800c02f5SMauro Carvalho Chehab         - If one exists, the kernel will re-use an existing mapping to the
42*800c02f5SMauro Carvalho Chehab           same segment of the same file if that has compatible permissions,
43*800c02f5SMauro Carvalho Chehab           even if this was created by another process.
44*800c02f5SMauro Carvalho Chehab
45*800c02f5SMauro Carvalho Chehab         - If possible, the file mapping will be directly on the backing device
46*800c02f5SMauro Carvalho Chehab           if the backing device has the NOMMU_MAP_DIRECT capability and
47*800c02f5SMauro Carvalho Chehab           appropriate mapping protection capabilities. Ramfs, romfs, cramfs
48*800c02f5SMauro Carvalho Chehab           and mtd might all permit this.
49*800c02f5SMauro Carvalho Chehab
50*800c02f5SMauro Carvalho Chehab	 - If the backing device can't or won't permit direct sharing,
51*800c02f5SMauro Carvalho Chehab           but does have the NOMMU_MAP_COPY capability, then a copy of the
52*800c02f5SMauro Carvalho Chehab           appropriate bit of the file will be read into a contiguous bit of
53*800c02f5SMauro Carvalho Chehab           memory and any extraneous space beyond the EOF will be cleared
54*800c02f5SMauro Carvalho Chehab
55*800c02f5SMauro Carvalho Chehab	 - Writes to the file do not affect the mapping; writes to the mapping
56*800c02f5SMauro Carvalho Chehab	   are visible in other processes (no MMU protection), but should not
57*800c02f5SMauro Carvalho Chehab	   happen.
58*800c02f5SMauro Carvalho Chehab
59*800c02f5SMauro Carvalho Chehab (#) File, MAP_PRIVATE, PROT_READ / PROT_EXEC, PROT_WRITE
60*800c02f5SMauro Carvalho Chehab
61*800c02f5SMauro Carvalho Chehab	In the MMU case: like the non-PROT_WRITE case, except that the pages in
62*800c02f5SMauro Carvalho Chehab	question get copied before the write actually happens. From that point
63*800c02f5SMauro Carvalho Chehab	on writes to the file underneath that page no longer get reflected into
64*800c02f5SMauro Carvalho Chehab	the mapping's backing pages. The page is then backed by swap instead.
65*800c02f5SMauro Carvalho Chehab
66*800c02f5SMauro Carvalho Chehab	In the no-MMU case: works much like the non-PROT_WRITE case, except
67*800c02f5SMauro Carvalho Chehab	that a copy is always taken and never shared.
68*800c02f5SMauro Carvalho Chehab
69*800c02f5SMauro Carvalho Chehab (#) Regular file / blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
70*800c02f5SMauro Carvalho Chehab
71*800c02f5SMauro Carvalho Chehab	In the MMU case: VM regions backed by pages read from file; changes to
72*800c02f5SMauro Carvalho Chehab	pages written back to file; writes to file reflected into pages backing
73*800c02f5SMauro Carvalho Chehab	mapping; shared across fork.
74*800c02f5SMauro Carvalho Chehab
75*800c02f5SMauro Carvalho Chehab	In the no-MMU case: not supported.
76*800c02f5SMauro Carvalho Chehab
77*800c02f5SMauro Carvalho Chehab (#) Memory backed regular file, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
78*800c02f5SMauro Carvalho Chehab
79*800c02f5SMauro Carvalho Chehab	In the MMU case: As for ordinary regular files.
80*800c02f5SMauro Carvalho Chehab
81*800c02f5SMauro Carvalho Chehab	In the no-MMU case: The filesystem providing the memory-backed file
82*800c02f5SMauro Carvalho Chehab	(such as ramfs or tmpfs) may choose to honour an open, truncate, mmap
83*800c02f5SMauro Carvalho Chehab	sequence by providing a contiguous sequence of pages to map. In that
84*800c02f5SMauro Carvalho Chehab	case, a shared-writable memory mapping will be possible. It will work
85*800c02f5SMauro Carvalho Chehab	as for the MMU case. If the filesystem does not provide any such
86*800c02f5SMauro Carvalho Chehab	support, then the mapping request will be denied.
87*800c02f5SMauro Carvalho Chehab
88*800c02f5SMauro Carvalho Chehab (#) Memory backed blockdev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
89*800c02f5SMauro Carvalho Chehab
90*800c02f5SMauro Carvalho Chehab	In the MMU case: As for ordinary regular files.
91*800c02f5SMauro Carvalho Chehab
92*800c02f5SMauro Carvalho Chehab	In the no-MMU case: As for memory backed regular files, but the
93*800c02f5SMauro Carvalho Chehab	blockdev must be able to provide a contiguous run of pages without
94*800c02f5SMauro Carvalho Chehab	truncate being called. The ramdisk driver could do this if it allocated
95*800c02f5SMauro Carvalho Chehab	all its memory as a contiguous array upfront.
96*800c02f5SMauro Carvalho Chehab
97*800c02f5SMauro Carvalho Chehab (#) Memory backed chardev, MAP_SHARED, PROT_READ / PROT_EXEC / PROT_WRITE
98*800c02f5SMauro Carvalho Chehab
99*800c02f5SMauro Carvalho Chehab	In the MMU case: As for ordinary regular files.
100*800c02f5SMauro Carvalho Chehab
101*800c02f5SMauro Carvalho Chehab	In the no-MMU case: The character device driver may choose to honour
102*800c02f5SMauro Carvalho Chehab	the mmap() by providing direct access to the underlying device if it
103*800c02f5SMauro Carvalho Chehab	provides memory or quasi-memory that can be accessed directly. Examples
104*800c02f5SMauro Carvalho Chehab	of such are frame buffers and flash devices. If the driver does not
105*800c02f5SMauro Carvalho Chehab	provide any such support, then the mapping request will be denied.
106*800c02f5SMauro Carvalho Chehab
107*800c02f5SMauro Carvalho Chehab
108*800c02f5SMauro Carvalho ChehabFurther notes on no-MMU MMAP
109*800c02f5SMauro Carvalho Chehab============================
110*800c02f5SMauro Carvalho Chehab
111*800c02f5SMauro Carvalho Chehab (#) A request for a private mapping of a file may return a buffer that is not
112*800c02f5SMauro Carvalho Chehab     page-aligned.  This is because XIP may take place, and the data may not be
113*800c02f5SMauro Carvalho Chehab     paged aligned in the backing store.
114*800c02f5SMauro Carvalho Chehab
115*800c02f5SMauro Carvalho Chehab (#) A request for an anonymous mapping will always be page aligned.  If
116*800c02f5SMauro Carvalho Chehab     possible the size of the request should be a power of two otherwise some
117*800c02f5SMauro Carvalho Chehab     of the space may be wasted as the kernel must allocate a power-of-2
118*800c02f5SMauro Carvalho Chehab     granule but will only discard the excess if appropriately configured as
119*800c02f5SMauro Carvalho Chehab     this has an effect on fragmentation.
120*800c02f5SMauro Carvalho Chehab
121*800c02f5SMauro Carvalho Chehab (#) The memory allocated by a request for an anonymous mapping will normally
122*800c02f5SMauro Carvalho Chehab     be cleared by the kernel before being returned in accordance with the
123*800c02f5SMauro Carvalho Chehab     Linux man pages (ver 2.22 or later).
124*800c02f5SMauro Carvalho Chehab
125*800c02f5SMauro Carvalho Chehab     In the MMU case this can be achieved with reasonable performance as
126*800c02f5SMauro Carvalho Chehab     regions are backed by virtual pages, with the contents only being mapped
127*800c02f5SMauro Carvalho Chehab     to cleared physical pages when a write happens on that specific page
128*800c02f5SMauro Carvalho Chehab     (prior to which, the pages are effectively mapped to the global zero page
129*800c02f5SMauro Carvalho Chehab     from which reads can take place).  This spreads out the time it takes to
130*800c02f5SMauro Carvalho Chehab     initialize the contents of a page - depending on the write-usage of the
131*800c02f5SMauro Carvalho Chehab     mapping.
132*800c02f5SMauro Carvalho Chehab
133*800c02f5SMauro Carvalho Chehab     In the no-MMU case, however, anonymous mappings are backed by physical
134*800c02f5SMauro Carvalho Chehab     pages, and the entire map is cleared at allocation time.  This can cause
135*800c02f5SMauro Carvalho Chehab     significant delays during a userspace malloc() as the C library does an
136*800c02f5SMauro Carvalho Chehab     anonymous mapping and the kernel then does a memset for the entire map.
137*800c02f5SMauro Carvalho Chehab
138*800c02f5SMauro Carvalho Chehab     However, for memory that isn't required to be precleared - such as that
139*800c02f5SMauro Carvalho Chehab     returned by malloc() - mmap() can take a MAP_UNINITIALIZED flag to
140*800c02f5SMauro Carvalho Chehab     indicate to the kernel that it shouldn't bother clearing the memory before
141*800c02f5SMauro Carvalho Chehab     returning it.  Note that CONFIG_MMAP_ALLOW_UNINITIALIZED must be enabled
142*800c02f5SMauro Carvalho Chehab     to permit this, otherwise the flag will be ignored.
143*800c02f5SMauro Carvalho Chehab
144*800c02f5SMauro Carvalho Chehab     uClibc uses this to speed up malloc(), and the ELF-FDPIC binfmt uses this
145*800c02f5SMauro Carvalho Chehab     to allocate the brk and stack region.
146*800c02f5SMauro Carvalho Chehab
147*800c02f5SMauro Carvalho Chehab (#) A list of all the private copy and anonymous mappings on the system is
148*800c02f5SMauro Carvalho Chehab     visible through /proc/maps in no-MMU mode.
149*800c02f5SMauro Carvalho Chehab
150*800c02f5SMauro Carvalho Chehab (#) A list of all the mappings in use by a process is visible through
151*800c02f5SMauro Carvalho Chehab     /proc/<pid>/maps in no-MMU mode.
152*800c02f5SMauro Carvalho Chehab
153*800c02f5SMauro Carvalho Chehab (#) Supplying MAP_FIXED or a requesting a particular mapping address will
154*800c02f5SMauro Carvalho Chehab     result in an error.
155*800c02f5SMauro Carvalho Chehab
156*800c02f5SMauro Carvalho Chehab (#) Files mapped privately usually have to have a read method provided by the
157*800c02f5SMauro Carvalho Chehab     driver or filesystem so that the contents can be read into the memory
158*800c02f5SMauro Carvalho Chehab     allocated if mmap() chooses not to map the backing device directly. An
159*800c02f5SMauro Carvalho Chehab     error will result if they don't. This is most likely to be encountered
160*800c02f5SMauro Carvalho Chehab     with character device files, pipes, fifos and sockets.
161*800c02f5SMauro Carvalho Chehab
162*800c02f5SMauro Carvalho Chehab
163*800c02f5SMauro Carvalho ChehabInterprocess shared memory
164*800c02f5SMauro Carvalho Chehab==========================
165*800c02f5SMauro Carvalho Chehab
166*800c02f5SMauro Carvalho ChehabBoth SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU
167*800c02f5SMauro Carvalho Chehabmode.  The former through the usual mechanism, the latter through files created
168*800c02f5SMauro Carvalho Chehabon ramfs or tmpfs mounts.
169*800c02f5SMauro Carvalho Chehab
170*800c02f5SMauro Carvalho Chehab
171*800c02f5SMauro Carvalho ChehabFutexes
172*800c02f5SMauro Carvalho Chehab=======
173*800c02f5SMauro Carvalho Chehab
174*800c02f5SMauro Carvalho ChehabFutexes are supported in NOMMU mode if the arch supports them.  An error will
175*800c02f5SMauro Carvalho Chehabbe given if an address passed to the futex system call lies outside the
176*800c02f5SMauro Carvalho Chehabmappings made by a process or if the mapping in which the address lies does not
177*800c02f5SMauro Carvalho Chehabsupport futexes (such as an I/O chardev mapping).
178*800c02f5SMauro Carvalho Chehab
179*800c02f5SMauro Carvalho Chehab
180*800c02f5SMauro Carvalho ChehabNo-MMU mremap
181*800c02f5SMauro Carvalho Chehab=============
182*800c02f5SMauro Carvalho Chehab
183*800c02f5SMauro Carvalho ChehabThe mremap() function is partially supported.  It may change the size of a
184*800c02f5SMauro Carvalho Chehabmapping, and may move it [#]_ if MREMAP_MAYMOVE is specified and if the new size
185*800c02f5SMauro Carvalho Chehabof the mapping exceeds the size of the slab object currently occupied by the
186*800c02f5SMauro Carvalho Chehabmemory to which the mapping refers, or if a smaller slab object could be used.
187*800c02f5SMauro Carvalho Chehab
188*800c02f5SMauro Carvalho ChehabMREMAP_FIXED is not supported, though it is ignored if there's no change of
189*800c02f5SMauro Carvalho Chehabaddress and the object does not need to be moved.
190*800c02f5SMauro Carvalho Chehab
191*800c02f5SMauro Carvalho ChehabShared mappings may not be moved.  Shareable mappings may not be moved either,
192*800c02f5SMauro Carvalho Chehabeven if they are not currently shared.
193*800c02f5SMauro Carvalho Chehab
194*800c02f5SMauro Carvalho ChehabThe mremap() function must be given an exact match for base address and size of
195*800c02f5SMauro Carvalho Chehaba previously mapped object.  It may not be used to create holes in existing
196*800c02f5SMauro Carvalho Chehabmappings, move parts of existing mappings or resize parts of mappings.  It must
197*800c02f5SMauro Carvalho Chehabact on a complete mapping.
198*800c02f5SMauro Carvalho Chehab
199*800c02f5SMauro Carvalho Chehab.. [#] Not currently supported.
200*800c02f5SMauro Carvalho Chehab
201*800c02f5SMauro Carvalho Chehab
202*800c02f5SMauro Carvalho ChehabProviding shareable character device support
203*800c02f5SMauro Carvalho Chehab============================================
204*800c02f5SMauro Carvalho Chehab
205*800c02f5SMauro Carvalho ChehabTo provide shareable character device support, a driver must provide a
206*800c02f5SMauro Carvalho Chehabfile->f_op->get_unmapped_area() operation. The mmap() routines will call this
207*800c02f5SMauro Carvalho Chehabto get a proposed address for the mapping. This may return an error if it
208*800c02f5SMauro Carvalho Chehabdoesn't wish to honour the mapping because it's too long, at a weird offset,
209*800c02f5SMauro Carvalho Chehabunder some unsupported combination of flags or whatever.
210*800c02f5SMauro Carvalho Chehab
211*800c02f5SMauro Carvalho ChehabThe driver should also provide backing device information with capabilities set
212*800c02f5SMauro Carvalho Chehabto indicate the permitted types of mapping on such devices. The default is
213*800c02f5SMauro Carvalho Chehabassumed to be readable and writable, not executable, and only shareable
214*800c02f5SMauro Carvalho Chehabdirectly (can't be copied).
215*800c02f5SMauro Carvalho Chehab
216*800c02f5SMauro Carvalho ChehabThe file->f_op->mmap() operation will be called to actually inaugurate the
217*800c02f5SMauro Carvalho Chehabmapping. It can be rejected at that point. Returning the ENOSYS error will
218*800c02f5SMauro Carvalho Chehabcause the mapping to be copied instead if NOMMU_MAP_COPY is specified.
219*800c02f5SMauro Carvalho Chehab
220*800c02f5SMauro Carvalho ChehabThe vm_ops->close() routine will be invoked when the last mapping on a chardev
221*800c02f5SMauro Carvalho Chehabis removed. An existing mapping will be shared, partially or not, if possible
222*800c02f5SMauro Carvalho Chehabwithout notifying the driver.
223*800c02f5SMauro Carvalho Chehab
224*800c02f5SMauro Carvalho ChehabIt is permitted also for the file->f_op->get_unmapped_area() operation to
225*800c02f5SMauro Carvalho Chehabreturn -ENOSYS. This will be taken to mean that this operation just doesn't
226*800c02f5SMauro Carvalho Chehabwant to handle it, despite the fact it's got an operation. For instance, it
227*800c02f5SMauro Carvalho Chehabmight try directing the call to a secondary driver which turns out not to
228*800c02f5SMauro Carvalho Chehabimplement it. Such is the case for the framebuffer driver which attempts to
229*800c02f5SMauro Carvalho Chehabdirect the call to the device-specific driver. Under such circumstances, the
230*800c02f5SMauro Carvalho Chehabmapping request will be rejected if NOMMU_MAP_COPY is not specified, and a
231*800c02f5SMauro Carvalho Chehabcopy mapped otherwise.
232*800c02f5SMauro Carvalho Chehab
233*800c02f5SMauro Carvalho Chehab.. important::
234*800c02f5SMauro Carvalho Chehab
235*800c02f5SMauro Carvalho Chehab	Some types of device may present a different appearance to anyone
236*800c02f5SMauro Carvalho Chehab	looking at them in certain modes. Flash chips can be like this; for
237*800c02f5SMauro Carvalho Chehab	instance if they're in programming or erase mode, you might see the
238*800c02f5SMauro Carvalho Chehab	status reflected in the mapping, instead of the data.
239*800c02f5SMauro Carvalho Chehab
240*800c02f5SMauro Carvalho Chehab	In such a case, care must be taken lest userspace see a shared or a
241*800c02f5SMauro Carvalho Chehab	private mapping showing such information when the driver is busy
242*800c02f5SMauro Carvalho Chehab	controlling the device. Remember especially: private executable
243*800c02f5SMauro Carvalho Chehab	mappings may still be mapped directly off the device under some
244*800c02f5SMauro Carvalho Chehab	circumstances!
245*800c02f5SMauro Carvalho Chehab
246*800c02f5SMauro Carvalho Chehab
247*800c02f5SMauro Carvalho ChehabProviding shareable memory-backed file support
248*800c02f5SMauro Carvalho Chehab==============================================
249*800c02f5SMauro Carvalho Chehab
250*800c02f5SMauro Carvalho ChehabProvision of shared mappings on memory backed files is similar to the provision
251*800c02f5SMauro Carvalho Chehabof support for shared mapped character devices. The main difference is that the
252*800c02f5SMauro Carvalho Chehabfilesystem providing the service will probably allocate a contiguous collection
253*800c02f5SMauro Carvalho Chehabof pages and permit mappings to be made on that.
254*800c02f5SMauro Carvalho Chehab
255*800c02f5SMauro Carvalho ChehabIt is recommended that a truncate operation applied to such a file that
256*800c02f5SMauro Carvalho Chehabincreases the file size, if that file is empty, be taken as a request to gather
257*800c02f5SMauro Carvalho Chehabenough pages to honour a mapping. This is required to support POSIX shared
258*800c02f5SMauro Carvalho Chehabmemory.
259*800c02f5SMauro Carvalho Chehab
260*800c02f5SMauro Carvalho ChehabMemory backed devices are indicated by the mapping's backing device info having
261*800c02f5SMauro Carvalho Chehabthe memory_backed flag set.
262*800c02f5SMauro Carvalho Chehab
263*800c02f5SMauro Carvalho Chehab
264*800c02f5SMauro Carvalho ChehabProviding shareable block device support
265*800c02f5SMauro Carvalho Chehab========================================
266*800c02f5SMauro Carvalho Chehab
267*800c02f5SMauro Carvalho ChehabProvision of shared mappings on block device files is exactly the same as for
268*800c02f5SMauro Carvalho Chehabcharacter devices. If there isn't a real device underneath, then the driver
269*800c02f5SMauro Carvalho Chehabshould allocate sufficient contiguous memory to honour any supported mapping.
270*800c02f5SMauro Carvalho Chehab
271*800c02f5SMauro Carvalho Chehab
272*800c02f5SMauro Carvalho ChehabAdjusting page trimming behaviour
273*800c02f5SMauro Carvalho Chehab=================================
274*800c02f5SMauro Carvalho Chehab
275*800c02f5SMauro Carvalho ChehabNOMMU mmap automatically rounds up to the nearest power-of-2 number of pages
276*800c02f5SMauro Carvalho Chehabwhen performing an allocation.  This can have adverse effects on memory
277*800c02f5SMauro Carvalho Chehabfragmentation, and as such, is left configurable.  The default behaviour is to
278*800c02f5SMauro Carvalho Chehabaggressively trim allocations and discard any excess pages back in to the page
279*800c02f5SMauro Carvalho Chehaballocator.  In order to retain finer-grained control over fragmentation, this
280*800c02f5SMauro Carvalho Chehabbehaviour can either be disabled completely, or bumped up to a higher page
281*800c02f5SMauro Carvalho Chehabwatermark where trimming begins.
282*800c02f5SMauro Carvalho Chehab
283*800c02f5SMauro Carvalho ChehabPage trimming behaviour is configurable via the sysctl ``vm.nr_trim_pages``.
284