xref: /openbmc/linux/Documentation/filesystems/idmappings.rst (revision b1c8ea3c09db24a55ff84ac047cb2e9d9f644bf9)
1.. SPDX-License-Identifier: GPL-2.0
2
3Idmappings
4==========
5
6Most filesystem developers will have encountered idmappings. They are used when
7reading from or writing ownership to disk, reporting ownership to userspace, or
8for permission checking. This document is aimed at filesystem developers that
9want to know how idmappings work.
10
11Formal notes
12------------
13
14An idmapping is essentially a translation of a range of ids into another or the
15same range of ids. The notational convention for idmappings that is widely used
16in userspace is::
17
18 u:k:r
19
20``u`` indicates the first element in the upper idmapset ``U`` and ``k``
21indicates the first element in the lower idmapset ``K``. The ``r`` parameter
22indicates the range of the idmapping, i.e. how many ids are mapped. From now
23on, we will always prefix ids with ``u`` or ``k`` to make it clear whether
24we're talking about an id in the upper or lower idmapset.
25
26To see what this looks like in practice, let's take the following idmapping::
27
28 u22:k10000:r3
29
30and write down the mappings it will generate::
31
32 u22 -> k10000
33 u23 -> k10001
34 u24 -> k10002
35
36From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an
37idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are
38order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of
39the set of all possible ids useable on a given system.
40
41Looking at this mathematically briefly will help us highlight some properties
42that make it easier to understand how we can translate between idmappings. For
43example, we know that the inverse idmapping is an order isomorphism as well::
44
45 k10000 -> u22
46 k10001 -> u23
47 k10002 -> u24
48
49Given that we are dealing with order isomorphisms plus the fact that we're
50dealing with subsets we can embedd idmappings into each other, i.e. we can
51sensibly translate between different idmappings. For example, assume we've been
52given the three idmappings::
53
54 1. u0:k10000:r10000
55 2. u0:k20000:r10000
56 3. u0:k30000:r10000
57
58and id ``k11000`` which has been generated by the first idmapping by mapping
59``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset.
60
61Because we're dealing with order isomorphic subsets it is meaningful to ask
62what id ``k11000`` corresponds to in the second or third idmapping. The
63straightfoward algorithm to use is to apply the inverse of the first idmapping,
64mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using
65either the second idmapping mapping or third idmapping mapping. The second
66idmapping would map ``u1000`` down to ``21000``. The third idmapping would map
67``u1000`` down to ``u31000``.
68
69If we were given the same task for the following three idmappings::
70
71 1. u0:k10000:r10000
72 2. u0:k20000:r200
73 3. u0:k30000:r300
74
75we would fail to translate as the sets aren't order isomorphic over the full
76range of the first idmapping anymore (However they are order isomorphic over
77the full range of the second idmapping.). Neither the second or third idmapping
78contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having
79an id mapped. We can simply say that ``u1000`` is unmapped in the second and
80third idmapping. The kernel will report unmapped ids as the overflowuid
81``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace.
82
83The algorithm to calculate what a given id maps to is pretty simple. First, we
84need to verify that the range can contain our target id. We will skip this step
85for simplicity. After that if we want to know what ``id`` maps to we can do
86simple calculations:
87
88- If we want to map from left to right::
89
90   u:k:r
91   id - u + k = n
92
93- If we want to map from right to left::
94
95   u:k:r
96   id - k + u = n
97
98Instead of "left to right" we can also say "down" and instead of "right to
99left" we can also say "up". Obviously mapping down and up invert each other.
100
101To see whether the simple formulas above work, consider the following two
102idmappings::
103
104 1. u0:k20000:r10000
105 2. u500:k30000:r10000
106
107Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We
108want to know what id this was mapped from in the upper idmapset of the first
109idmapping. So we're mapping up in the first idmapping::
110
111 id     - k      + u  = n
112 k21000 - k20000 + u0 = u1000
113
114Now assume we are given the id ``u1100`` in the upper idmapset of the second
115idmapping and we want to know what this id maps down to in the lower idmapset
116of the second idmapping. This means we're mapping down in the second
117idmapping::
118
119 id    - u    + k      = n
120 u1100 - u500 + k30000 = k30600
121
122General notes
123-------------
124
125In the context of the kernel an idmapping can be interpreted as mapping a range
126of userspace ids into a range of kernel ids::
127
128 userspace-id:kernel-id:range
129
130A userspace id is always an element in the upper idmapset of an idmapping of
131type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower
132idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on
133"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t``
134types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``.
135
136The kernel is mostly concerned with kernel ids. They are used when performing
137permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field.
138A userspace id on the other hand is an id that is reported to userspace by the
139kernel, or is passed by userspace to the kernel, or a raw device id that is
140written or read from disk.
141
142Note that we are only concerned with idmappings as the kernel stores them not
143how userspace would specify them.
144
145For the rest of this document we will prefix all userspace ids with ``u`` and
146all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So
147an idmapping will be written as ``u0:k10000:r10000``.
148
149For example, the id ``u1000`` is an id in the upper idmapset or "userspace
150idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a
151kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``.
152
153A kernel id is always created by an idmapping. Such idmappings are associated
154with user namespaces. Since we mainly care about how idmappings work we're not
155going to be concerned with how idmappings are created nor how they are used
156outside of the filesystem context. This is best left to an explanation of user
157namespaces.
158
159The initial user namespace is special. It always has an idmapping of the
160following form::
161
162 u0:k0:r4294967295
163
164which is an identity idmapping over the full range of ids available on this
165system.
166
167Other user namespaces usually have non-identity idmappings such as::
168
169 u0:k10000:r10000
170
171When a process creates or wants to change ownership of a file, or when the
172ownership of a file is read from disk by a filesystem, the userspace id is
173immediately translated into a kernel id according to the idmapping associated
174with the relevant user namespace.
175
176For instance, consider a file that is stored on disk by a filesystem as being
177owned by ``u1000``:
178
179- If a filesystem were to be mounted in the initial user namespaces (as most
180  filesystems are) then the initial idmapping will be used. As we saw this is
181  simply the identity idmapping. This would mean id ``u1000`` read from disk
182  would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field
183  would contain ``k1000``.
184
185- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000``
186  then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's
187  ``i_uid`` and ``i_gid`` would contain ``k11000``.
188
189Translation algorithms
190----------------------
191
192We've already seen briefly that it is possible to translate between different
193idmappings. We'll now take a closer look how that works.
194
195Crossmapping
196~~~~~~~~~~~~
197
198This translation algorithm is used by the kernel in quite a few places. For
199example, it is used when reporting back the ownership of a file to userspace
200via the ``stat()`` system call family.
201
202If we've been given ``k11000`` from one idmapping we can map that id up in
203another idmapping. In order for this to work both idmappings need to contain
204the same kernel id in their kernel idmapsets. For example, consider the
205following idmappings::
206
207 1. u0:k10000:r10000
208 2. u20000:k10000:r10000
209
210and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can
211then translate ``k11000`` into a userspace id in the second idmapping using the
212kernel idmapset of the second idmapping::
213
214 /* Map the kernel id up into a userspace id in the second idmapping. */
215 from_kuid(u20000:k10000:r10000, k11000) = u21000
216
217Note, how we can get back to the kernel id in the first idmapping by inverting
218the algorithm::
219
220 /* Map the userspace id down into a kernel id in the second idmapping. */
221 make_kuid(u20000:k10000:r10000, u21000) = k11000
222
223 /* Map the kernel id up into a userspace id in the first idmapping. */
224 from_kuid(u0:k10000:r10000, k11000) = u1000
225
226This algorithm allows us to answer the question what userspace id a given
227kernel id corresponds to in a given idmapping. In order to be able to answer
228this question both idmappings need to contain the same kernel id in their
229respective kernel idmapsets.
230
231For example, when the kernel reads a raw userspace id from disk it maps it down
232into a kernel id according to the idmapping associated with the filesystem.
233Let's assume the filesystem was mounted with an idmapping of
234``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This
235means ``u1000`` will be mapped to ``k21000`` which is what will be stored in
236the inode's ``i_uid`` and ``i_gid`` field.
237
238When someone in userspace calls ``stat()`` or a related function to get
239ownership information about the file the kernel can't simply map the id back up
240according to the filesystem's idmapping as this would give the wrong owner if
241the caller is using an idmapping.
242
243So the kernel will map the id back up in the idmapping of the caller. Let's
244assume the caller has the somewhat unconventional idmapping
245``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``.
246Consequently the user would see that this file is owned by ``u4000``.
247
248Remapping
249~~~~~~~~~
250
251It is possible to translate a kernel id from one idmapping to another one via
252the userspace idmapset of the two idmappings. This is equivalent to remapping
253a kernel id.
254
255Let's look at an example. We are given the following two idmappings::
256
257 1. u0:k10000:r10000
258 2. u0:k20000:r10000
259
260and we are given ``k11000`` in the first idmapping. In order to translate this
261kernel id in the first idmapping into a kernel id in the second idmapping we
262need to perform two steps:
263
2641. Map the kernel id up into a userspace id in the first idmapping::
265
266    /* Map the kernel id up into a userspace id in the first idmapping. */
267    from_kuid(u0:k10000:r10000, k11000) = u1000
268
2692. Map the userspace id down into a kernel id in the second idmapping::
270
271    /* Map the userspace id down into a kernel id in the second idmapping. */
272    make_kuid(u0:k20000:r10000, u1000) = k21000
273
274As you can see we used the userspace idmapset in both idmappings to translate
275the kernel id in one idmapping to a kernel id in another idmapping.
276
277This allows us to answer the question what kernel id we would need to use to
278get the same userspace id in another idmapping. In order to be able to answer
279this question both idmappings need to contain the same userspace id in their
280respective userspace idmapsets.
281
282Note, how we can easily get back to the kernel id in the first idmapping by
283inverting the algorithm:
284
2851. Map the kernel id up into a userspace id in the second idmapping::
286
287    /* Map the kernel id up into a userspace id in the second idmapping. */
288    from_kuid(u0:k20000:r10000, k21000) = u1000
289
2902. Map the userspace id down into a kernel id in the first idmapping::
291
292    /* Map the userspace id down into a kernel id in the first idmapping. */
293    make_kuid(u0:k10000:r10000, u1000) = k11000
294
295Another way to look at this translation is to treat it as inverting one
296idmapping and applying another idmapping if both idmappings have the relevant
297userspace id mapped. This will come in handy when working with idmapped mounts.
298
299Invalid translations
300~~~~~~~~~~~~~~~~~~~~
301
302It is never valid to use an id in the kernel idmapset of one idmapping as the
303id in the userspace idmapset of another or the same idmapping. While the kernel
304idmapset always indicates an idmapset in the kernel id space the userspace
305idmapset indicates a userspace id. So the following translations are forbidden::
306
307 /* Map the userspace id down into a kernel id in the first idmapping. */
308 make_kuid(u0:k10000:r10000, u1000) = k11000
309
310 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */
311 make_kuid(u10000:k20000:r10000, k110000) = k21000
312                                 ~~~~~~~
313
314and equally wrong::
315
316 /* Map the kernel id up into a userspace id in the first idmapping. */
317 from_kuid(u0:k10000:r10000, k11000) = u1000
318
319 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */
320 from_kuid(u20000:k0:r10000, u1000) = k21000
321                             ~~~~~
322
323Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type
324``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are
325conflated. So the two examples above would cause a compilation failure.
326
327Idmappings when creating filesystem objects
328-------------------------------------------
329
330The concepts of mapping an id down or mapping an id up are expressed in the two
331kernel functions filesystem developers are rather familiar with and which we've
332already used in this document::
333
334 /* Map the userspace id down into a kernel id. */
335 make_kuid(idmapping, uid)
336
337 /* Map the kernel id up into a userspace id. */
338 from_kuid(idmapping, kuid)
339
340We will take an abbreviated look into how idmappings figure into creating
341filesystem objects. For simplicity we will only look at what happens when the
342VFS has already completed path lookup right before it calls into the filesystem
343itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is
344called. We will also assume that the directory we're creating filesystem
345objects in is readable and writable for everyone.
346
347When creating a filesystem object the caller will look at the caller's
348filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids
349but they are exclusively used when determining file ownership which is why they
350are called "filesystem ids". They are usually identical to the uid and gid of
351the caller but can differ. We will just assume they are always identical to not
352get lost in too many details.
353
354When the caller enters the kernel two things happen:
355
3561. Map the caller's userspace ids down into kernel ids in the caller's
357   idmapping.
358   (To be precise, the kernel will simply look at the kernel ids stashed in the
359   credentials of the current task but for our education we'll pretend this
360   translation happens just in time.)
3612. Verify that the caller's kernel ids can be mapped up to userspace ids in the
362   filesystem's idmapping.
363
364The second step is important as regular filesystem will ultimately need to map
365the kernel id back up into a userspace id when writing to disk.
366So with the second step the kernel guarantees that a valid userspace id can be
367written to disk. If it can't the kernel will refuse the creation request to not
368even remotely risk filesystem corruption.
369
370The astute reader will have realized that this is simply a varation of the
371crossmapping algorithm we mentioned above in a previous section. First, the
372kernel maps the caller's userspace id down into a kernel id according to the
373caller's idmapping and then maps that kernel id up according to the
374filesystem's idmapping.
375
376Let's see some examples with caller/filesystem idmapping but without mount
377idmappings. This will exhibit some problems we can hit. After that we will
378revisit/reconsider these examples, this time using mount idmappings, to see how
379they can solve the problems we observed before.
380
381Example 1
382~~~~~~~~~
383
384::
385
386 caller id:            u1000
387 caller idmapping:     u0:k0:r4294967295
388 filesystem idmapping: u0:k0:r4294967295
389
390Both the caller and the filesystem use the identity idmapping:
391
3921. Map the caller's userspace ids into kernel ids in the caller's idmapping::
393
394    make_kuid(u0:k0:r4294967295, u1000) = k1000
395
3962. Verify that the caller's kernel ids can be mapped to userspace ids in the
397   filesystem's idmapping.
398
399   For this second step the kernel will call the function
400   ``fsuidgid_has_mapping()`` which ultimately boils down to calling
401   ``from_kuid()``::
402
403    from_kuid(u0:k0:r4294967295, k1000) = u1000
404
405In this example both idmappings are the same so there's nothing exciting going
406on. Ultimately the userspace id that lands on disk will be ``u1000``.
407
408Example 2
409~~~~~~~~~
410
411::
412
413 caller id:            u1000
414 caller idmapping:     u0:k10000:r10000
415 filesystem idmapping: u0:k20000:r10000
416
4171. Map the caller's userspace ids down into kernel ids in the caller's
418   idmapping::
419
420    make_kuid(u0:k10000:r10000, u1000) = k11000
421
4222. Verify that the caller's kernel ids can be mapped up to userspace ids in the
423   filesystem's idmapping::
424
425    from_kuid(u0:k20000:r10000, k11000) = u-1
426
427It's immediately clear that while the caller's userspace id could be
428successfully mapped down into kernel ids in the caller's idmapping the kernel
429ids could not be mapped up according to the filesystem's idmapping. So the
430kernel will deny this creation request.
431
432Note that while this example is less common, because most filesystem can't be
433mounted with non-initial idmappings this is a general problem as we can see in
434the next examples.
435
436Example 3
437~~~~~~~~~
438
439::
440
441 caller id:            u1000
442 caller idmapping:     u0:k10000:r10000
443 filesystem idmapping: u0:k0:r4294967295
444
4451. Map the caller's userspace ids down into kernel ids in the caller's
446   idmapping::
447
448    make_kuid(u0:k10000:r10000, u1000) = k11000
449
4502. Verify that the caller's kernel ids can be mapped up to userspace ids in the
451   filesystem's idmapping::
452
453    from_kuid(u0:k0:r4294967295, k11000) = u11000
454
455We can see that the translation always succeeds. The userspace id that the
456filesystem will ultimately put to disk will always be identical to the value of
457the kernel id that was created in the caller's idmapping. This has mainly two
458consequences.
459
460First, that we can't allow a caller to ultimately write to disk with another
461userspace id. We could only do this if we were to mount the whole fileystem
462with the caller's or another idmapping. But that solution is limited to a few
463filesystems and not very flexible. But this is a use-case that is pretty
464important in containerized workloads.
465
466Second, the caller will usually not be able to create any files or access
467directories that have stricter permissions because none of the filesystem's
468kernel ids map up into valid userspace ids in the caller's idmapping
469
4701. Map raw userspace ids down to kernel ids in the filesystem's idmapping::
471
472    make_kuid(u0:k0:r4294967295, u1000) = k1000
473
4742. Map kernel ids up to userspace ids in the caller's idmapping::
475
476    from_kuid(u0:k10000:r10000, k1000) = u-1
477
478Example 4
479~~~~~~~~~
480
481::
482
483 file id:              u1000
484 caller idmapping:     u0:k10000:r10000
485 filesystem idmapping: u0:k0:r4294967295
486
487In order to report ownership to userspace the kernel uses the crossmapping
488algorithm introduced in a previous section:
489
4901. Map the userspace id on disk down into a kernel id in the filesystem's
491   idmapping::
492
493    make_kuid(u0:k0:r4294967295, u1000) = k1000
494
4952. Map the kernel id up into a userspace id in the caller's idmapping::
496
497    from_kuid(u0:k10000:r10000, k1000) = u-1
498
499The crossmapping algorithm fails in this case because the kernel id in the
500filesystem idmapping cannot be mapped up to a userspace id in the caller's
501idmapping. Thus, the kernel will report the ownership of this file as the
502overflowid.
503
504Example 5
505~~~~~~~~~
506
507::
508
509 file id:              u1000
510 caller idmapping:     u0:k10000:r10000
511 filesystem idmapping: u0:k20000:r10000
512
513In order to report ownership to userspace the kernel uses the crossmapping
514algorithm introduced in a previous section:
515
5161. Map the userspace id on disk down into a kernel id in the filesystem's
517   idmapping::
518
519    make_kuid(u0:k20000:r10000, u1000) = k21000
520
5212. Map the kernel id up into a userspace id in the caller's idmapping::
522
523    from_kuid(u0:k10000:r10000, k21000) = u-1
524
525Again, the crossmapping algorithm fails in this case because the kernel id in
526the filesystem idmapping cannot be mapped to a userspace id in the caller's
527idmapping. Thus, the kernel will report the ownership of this file as the
528overflowid.
529
530Note how in the last two examples things would be simple if the caller would be
531using the initial idmapping. For a filesystem mounted with the initial
532idmapping it would be trivial. So we only consider a filesystem with an
533idmapping of ``u0:k20000:r10000``:
534
5351. Map the userspace id on disk down into a kernel id in the filesystem's
536   idmapping::
537
538    make_kuid(u0:k20000:r10000, u1000) = k21000
539
5402. Map the kernel id up into a userspace id in the caller's idmapping::
541
542    from_kuid(u0:k0:r4294967295, k21000) = u21000
543
544Idmappings on idmapped mounts
545-----------------------------
546
547The examples we've seen in the previous section where the caller's idmapping
548and the filesystem's idmapping are incompatible causes various issues for
549workloads. For a more complex but common example, consider two containers
550started on the host. To completely prevent the two containers from affecting
551each other, an administrator may often use different non-overlapping idmappings
552for the two containers::
553
554 container1 idmapping:  u0:k10000:r10000
555 container2 idmapping:  u0:k20000:r10000
556 filesystem idmapping:  u0:k30000:r10000
557
558An administrator wanting to provide easy read-write access to the following set
559of files::
560
561 dir id:       u0
562 dir/file1 id: u1000
563 dir/file2 id: u2000
564
565to both containers currently can't.
566
567Of course the administrator has the option to recursively change ownership via
568``chown()``. For example, they could change ownership so that ``dir`` and all
569files below it can be crossmapped from the filesystem's into the container's
570idmapping. Let's assume they change ownership so it is compatible with the
571first container's idmapping::
572
573 dir id:       u10000
574 dir/file1 id: u11000
575 dir/file2 id: u12000
576
577This would still leave ``dir`` rather useless to the second container. In fact,
578``dir`` and all files below it would continue to appear owned by the overflowid
579for the second container.
580
581Or consider another increasingly popular example. Some service managers such as
582systemd implement a concept called "portable home directories". A user may want
583to use their home directories on different machines where they are assigned
584different login userspace ids. Most users will have ``u1000`` as the login id
585on their machine at home and all files in their home directory will usually be
586owned by ``u1000``. At uni or at work they may have another login id such as
587``u1125``. This makes it rather difficult to interact with their home directory
588on their work machine.
589
590In both cases changing ownership recursively has grave implications. The most
591obvious one is that ownership is changed globally and permanently. In the home
592directory case this change in ownership would even need to happen everytime the
593user switches from their home to their work machine. For really large sets of
594files this becomes increasingly costly.
595
596If the user is lucky, they are dealing with a filesystem that is mountable
597inside user namespaces. But this would also change ownership globally and the
598change in ownership is tied to the lifetime of the filesystem mount, i.e. the
599superblock. The only way to change ownership is to completely unmount the
600filesystem and mount it again in another user namespace. This is usually
601impossible because it would mean that all users currently accessing the
602filesystem can't anymore. And it means that ``dir`` still can't be shared
603between two containers with different idmappings.
604But usually the user doesn't even have this option since most filesystems
605aren't mountable inside containers. And not having them mountable might be
606desirable as it doesn't require the filesystem to deal with malicious
607filesystem images.
608
609But the usecases mentioned above and more can be handled by idmapped mounts.
610They allow to expose the same set of dentries with different ownership at
611different mounts. This is achieved by marking the mounts with a user namespace
612through the ``mount_setattr()`` system call. The idmapping associated with it
613is then used to translate from the caller's idmapping to the filesystem's
614idmapping and vica versa using the remapping algorithm we introduced above.
615
616Idmapped mounts make it possible to change ownership in a temporary and
617localized way. The ownership changes are restricted to a specific mount and the
618ownership changes are tied to the lifetime of the mount. All other users and
619locations where the filesystem is exposed are unaffected.
620
621Filesystems that support idmapped mounts don't have any real reason to support
622being mountable inside user namespaces. A filesystem could be exposed
623completely under an idmapped mount to get the same effect. This has the
624advantage that filesystems can leave the creation of the superblock to
625privileged users in the initial user namespace.
626
627However, it is perfectly possible to combine idmapped mounts with filesystems
628mountable inside user namespaces. We will touch on this further below.
629
630Filesystem types vs idmapped mount types
631~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
632
633With the introduction of idmapped mounts we need to distinguish between
634filesystem ownership and mount ownership of a VFS object such as an inode. The
635owner of a inode might be different when looked at from a filesystem
636perspective than when looked at from an idmapped mount. Such fundamental
637conceptual distinctions should almost always be clearly expressed in the code.
638So, to distinguish idmapped mount ownership from filesystem ownership separate
639types have been introduced.
640
641If a uid or gid has been generated using the filesystem or caller's idmapping
642then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid
643has been generated using a mount idmapping then we will be using the dedicated
644``vfsuid_t`` and ``vfsgid_t`` types.
645
646All VFS helpers that generate or take uids and gids as arguments use the
647``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler
648to catch errors that originate from conflating filesystem and VFS uids and gids.
649
650The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t``
651and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped
652from and to ``uid_t`` and ``gid_t`` types::
653
654 uid_t <--> kuid_t <--> vfsuid_t
655 gid_t <--> kgid_t <--> vfsgid_t
656
657Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type,
658e.g., during ``stat()``, or store ownership information in a shared VFS object
659based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can
660use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers.
661
662To illustrate why this helper currently exists, consider what happens when we
663change ownership of an inode from an idmapped mount. After we generated
664a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to
665this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesytem wide ownership.
666Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t``
667or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and
668``vfsgid_into_kgid()``.
669
670Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached
671``struct posix_acl``, stores ownership information a filesystem or "global"
672``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t``
673and ``vfsgid_t`` is specific to an idmapped mount.
674
675We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based
676on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based
677on filesystem idmappings. To prevent abusing filesystem idmappings to generate
678``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t``
679or ``kgid_t`` types filesystem idmappings and mount idmappings are different
680types as well.
681
682All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require
683a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing
684a filesystem or caller idmapping will cause a compilation error.
685
686Similar to how we prefix all userspace ids in this document with ``u`` and all
687kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount
688idmapping will be written as: ``u0:v10000:r10000``.
689
690Remapping helpers
691~~~~~~~~~~~~~~~~~
692
693Idmapping functions were added that translate between idmappings. They make use
694of the remapping algorithm we've introduced earlier. We're going to look at:
695
696- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()``
697
698  The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into
699  VFS ids in the mount's idmapping::
700
701   /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */
702   from_kuid(filesystem, kid) = uid
703
704   /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */
705   make_kuid(mount, uid) = kuid
706
707- ``mapped_fsuid()`` and ``mapped_fsgid()``
708
709  The ``mapped_fs*id()`` functions translate the caller's kernel ids into
710  kernel ids in the filesystem's idmapping. This translation is achieved by
711  remapping the caller's VFS ids using the mount's idmapping::
712
713   /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */
714   from_kuid(mount, kid) = uid
715
716   /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
717   make_kuid(filesystem, uid) = kuid
718
719- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()``
720
721   Whenever
722
723Note that these two functions invert each other. Consider the following
724idmappings::
725
726 caller idmapping:     u0:k10000:r10000
727 filesystem idmapping: u0:k20000:r10000
728 mount idmapping:      u0:v10000:r10000
729
730Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
731to ``k21000`` according to its idmapping. This is what is stored in the
732inode's ``i_uid`` and ``i_gid`` fields.
733
734When the caller queries the ownership of this file via ``stat()`` the kernel
735would usually simply use the crossmapping algorithm and map the filesystem's
736kernel id up to a userspace id in the caller's idmapping.
737
738But when the caller is accessing the file on an idmapped mount the kernel will
739first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel
740id into a VFS id in the mount's idmapping::
741
742 i_uid_into_vfsuid(k21000):
743   /* Map the filesystem's kernel id up into a userspace id. */
744   from_kuid(u0:k20000:r10000, k21000) = u1000
745
746   /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */
747   make_kuid(u0:v10000:r10000, u1000) = v11000
748
749Finally, when the kernel reports the owner to the caller it will turn the
750VFS id in the mount's idmapping into a userspace id in the caller's
751idmapping::
752
753  k11000 = vfsuid_into_kuid(v11000)
754  from_kuid(u0:k10000:r10000, k11000) = u1000
755
756We can test whether this algorithm really works by verifying what happens when
757we create a new file. Let's say the user is creating a file with ``u1000``.
758
759The kernel maps this to ``k11000`` in the caller's idmapping. Usually the
760kernel would now apply the crossmapping, verifying that ``k11000`` can be
761mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't
762be mapped up in the filesystem's idmapping directly this creation request
763fails.
764
765But when the caller is accessing the file on an idmapped mount the kernel will
766first call ``mapped_fs*id()`` thereby translating the caller's kernel id into
767a VFS id according to the mount's idmapping::
768
769 mapped_fsuid(k11000):
770    /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */
771    from_kuid(u0:k10000:r10000, k11000) = u1000
772
773    /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */
774    make_kuid(u0:v20000:r10000, u1000) = v21000
775
776When finally writing to disk the kernel will then map ``v21000`` up into a
777userspace id in the filesystem's idmapping::
778
779   k21000 = vfsuid_into_kuid(v21000)
780   from_kuid(u0:k20000:r10000, k21000) = u1000
781
782As we can see, we end up with an invertible and therefore information
783preserving algorithm. A file created from ``u1000`` on an idmapped mount will
784also be reported as being owned by ``u1000`` and vica versa.
785
786Let's now briefly reconsider the failing examples from earlier in the context
787of idmapped mounts.
788
789Example 2 reconsidered
790~~~~~~~~~~~~~~~~~~~~~~
791
792::
793
794 caller id:            u1000
795 caller idmapping:     u0:k10000:r10000
796 filesystem idmapping: u0:k20000:r10000
797 mount idmapping:      u0:v10000:r10000
798
799When the caller is using a non-initial idmapping the common case is to attach
800the same idmapping to the mount. We now perform three steps:
801
8021. Map the caller's userspace ids into kernel ids in the caller's idmapping::
803
804    make_kuid(u0:k10000:r10000, u1000) = k11000
805
8062. Translate the caller's VFS id into a kernel id in the filesystem's
807   idmapping::
808
809    mapped_fsuid(v11000):
810      /* Map the VFS id up into a userspace id in the mount's idmapping. */
811      from_kuid(u0:v10000:r10000, v11000) = u1000
812
813      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
814      make_kuid(u0:k20000:r10000, u1000) = k21000
815
8162. Verify that the caller's kernel ids can be mapped to userspace ids in the
817   filesystem's idmapping::
818
819    from_kuid(u0:k20000:r10000, k21000) = u1000
820
821So the ownership that lands on disk will be ``u1000``.
822
823Example 3 reconsidered
824~~~~~~~~~~~~~~~~~~~~~~
825
826::
827
828 caller id:            u1000
829 caller idmapping:     u0:k10000:r10000
830 filesystem idmapping: u0:k0:r4294967295
831 mount idmapping:      u0:v10000:r10000
832
833The same translation algorithm works with the third example.
834
8351. Map the caller's userspace ids into kernel ids in the caller's idmapping::
836
837    make_kuid(u0:k10000:r10000, u1000) = k11000
838
8392. Translate the caller's VFS id into a kernel id in the filesystem's
840   idmapping::
841
842    mapped_fsuid(v11000):
843       /* Map the VFS id up into a userspace id in the mount's idmapping. */
844       from_kuid(u0:v10000:r10000, v11000) = u1000
845
846       /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
847       make_kuid(u0:k0:r4294967295, u1000) = k1000
848
8492. Verify that the caller's kernel ids can be mapped to userspace ids in the
850   filesystem's idmapping::
851
852    from_kuid(u0:k0:r4294967295, k21000) = u1000
853
854So the ownership that lands on disk will be ``u1000``.
855
856Example 4 reconsidered
857~~~~~~~~~~~~~~~~~~~~~~
858
859::
860
861 file id:              u1000
862 caller idmapping:     u0:k10000:r10000
863 filesystem idmapping: u0:k0:r4294967295
864 mount idmapping:      u0:v10000:r10000
865
866In order to report ownership to userspace the kernel now does three steps using
867the translation algorithm we introduced earlier:
868
8691. Map the userspace id on disk down into a kernel id in the filesystem's
870   idmapping::
871
872    make_kuid(u0:k0:r4294967295, u1000) = k1000
873
8742. Translate the kernel id into a VFS id in the mount's idmapping::
875
876    i_uid_into_vfsuid(k1000):
877      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
878      from_kuid(u0:k0:r4294967295, k1000) = u1000
879
880      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
881      make_kuid(u0:v10000:r10000, u1000) = v11000
882
8833. Map the VFS id up into a userspace id in the caller's idmapping::
884
885    k11000 = vfsuid_into_kuid(v11000)
886    from_kuid(u0:k10000:r10000, k11000) = u1000
887
888Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's
889idmapping. With the idmapped mount in place it now can be crossmapped into the
890filesystem's idmapping via the mount's idmapping. The file will now be created
891with ``u1000`` according to the mount's idmapping.
892
893Example 5 reconsidered
894~~~~~~~~~~~~~~~~~~~~~~
895
896::
897
898 file id:              u1000
899 caller idmapping:     u0:k10000:r10000
900 filesystem idmapping: u0:k20000:r10000
901 mount idmapping:      u0:v10000:r10000
902
903Again, in order to report ownership to userspace the kernel now does three
904steps using the translation algorithm we introduced earlier:
905
9061. Map the userspace id on disk down into a kernel id in the filesystem's
907   idmapping::
908
909    make_kuid(u0:k20000:r10000, u1000) = k21000
910
9112. Translate the kernel id into a VFS id in the mount's idmapping::
912
913    i_uid_into_vfsuid(k21000):
914      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
915      from_kuid(u0:k20000:r10000, k21000) = u1000
916
917      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
918      make_kuid(u0:v10000:r10000, u1000) = v11000
919
9203. Map the VFS id up into a userspace id in the caller's idmapping::
921
922    k11000 = vfsuid_into_kuid(v11000)
923    from_kuid(u0:k10000:r10000, k11000) = u1000
924
925Earlier, the file's kernel id couldn't be crossmapped in the filesystems's
926idmapping. With the idmapped mount in place it now can be crossmapped into the
927filesystem's idmapping via the mount's idmapping. The file is now owned by
928``u1000`` according to the mount's idmapping.
929
930Changing ownership on a home directory
931~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
932
933We've seen above how idmapped mounts can be used to translate between
934idmappings when either the caller, the filesystem or both uses a non-initial
935idmapping. A wide range of usecases exist when the caller is using
936a non-initial idmapping. This mostly happens in the context of containerized
937workloads. The consequence is as we have seen that for both, filesystem's
938mounted with the initial idmapping and filesystems mounted with non-initial
939idmappings, access to the filesystem isn't working because the kernel ids can't
940be crossmapped between the caller's and the filesystem's idmapping.
941
942As we've seen above idmapped mounts provide a solution to this by remapping the
943caller's or filesystem's idmapping according to the mount's idmapping.
944
945Aside from containerized workloads, idmapped mounts have the advantage that
946they also work when both the caller and the filesystem use the initial
947idmapping which means users on the host can change the ownership of directories
948and files on a per-mount basis.
949
950Consider our previous example where a user has their home directory on portable
951storage. At home they have id ``u1000`` and all files in their home directory
952are owned by ``u1000`` whereas at uni or work they have login id ``u1125``.
953
954Taking their home directory with them becomes problematic. They can't easily
955access their files, they might not be able to write to disk without applying
956lax permissions or ACLs and even if they can, they will end up with an annoying
957mix of files and directories owned by ``u1000`` and ``u1125``.
958
959Idmapped mounts allow to solve this problem. A user can create an idmapped
960mount for their home directory on their work computer or their computer at home
961depending on what ownership they would prefer to end up on the portable storage
962itself.
963
964Let's assume they want all files on disk to belong to ``u1000``. When the user
965plugs in their portable storage at their work station they can setup a job that
966creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now
967when they create a file the kernel performs the following steps we already know
968from above:::
969
970 caller id:            u1125
971 caller idmapping:     u0:k0:r4294967295
972 filesystem idmapping: u0:k0:r4294967295
973 mount idmapping:      u1000:v1125:r1
974
9751. Map the caller's userspace ids into kernel ids in the caller's idmapping::
976
977    make_kuid(u0:k0:r4294967295, u1125) = k1125
978
9792. Translate the caller's VFS id into a kernel id in the filesystem's
980   idmapping::
981
982    mapped_fsuid(v1125):
983      /* Map the VFS id up into a userspace id in the mount's idmapping. */
984      from_kuid(u1000:v1125:r1, v1125) = u1000
985
986      /* Map the userspace id down into a kernel id in the filesystem's idmapping. */
987      make_kuid(u0:k0:r4294967295, u1000) = k1000
988
9892. Verify that the caller's filesystem ids can be mapped to userspace ids in the
990   filesystem's idmapping::
991
992    from_kuid(u0:k0:r4294967295, k1000) = u1000
993
994So ultimately the file will be created with ``u1000`` on disk.
995
996Now let's briefly look at what ownership the caller with id ``u1125`` will see
997on their work computer:
998
999::
1000
1001 file id:              u1000
1002 caller idmapping:     u0:k0:r4294967295
1003 filesystem idmapping: u0:k0:r4294967295
1004 mount idmapping:      u1000:v1125:r1
1005
10061. Map the userspace id on disk down into a kernel id in the filesystem's
1007   idmapping::
1008
1009    make_kuid(u0:k0:r4294967295, u1000) = k1000
1010
10112. Translate the kernel id into a VFS id in the mount's idmapping::
1012
1013    i_uid_into_vfsuid(k1000):
1014      /* Map the kernel id up into a userspace id in the filesystem's idmapping. */
1015      from_kuid(u0:k0:r4294967295, k1000) = u1000
1016
1017      /* Map the userspace id down into a VFS id in the mounts's idmapping. */
1018      make_kuid(u1000:v1125:r1, u1000) = v1125
1019
10203. Map the VFS id up into a userspace id in the caller's idmapping::
1021
1022    k1125 = vfsuid_into_kuid(v1125)
1023    from_kuid(u0:k0:r4294967295, k1125) = u1125
1024
1025So ultimately the caller will be reported that the file belongs to ``u1125``
1026which is the caller's userspace id on their workstation in our example.
1027
1028The raw userspace id that is put on disk is ``u1000`` so when the user takes
1029their home directory back to their home computer where they are assigned
1030``u1000`` using the initial idmapping and mount the filesystem with the initial
1031idmapping they will see all those files owned by ``u1000``.
1032