1.. SPDX-License-Identifier: GPL-2.0 2 3Idmappings 4========== 5 6Most filesystem developers will have encountered idmappings. They are used when 7reading from or writing ownership to disk, reporting ownership to userspace, or 8for permission checking. This document is aimed at filesystem developers that 9want to know how idmappings work. 10 11Formal notes 12------------ 13 14An idmapping is essentially a translation of a range of ids into another or the 15same range of ids. The notational convention for idmappings that is widely used 16in userspace is:: 17 18 u:k:r 19 20``u`` indicates the first element in the upper idmapset ``U`` and ``k`` 21indicates the first element in the lower idmapset ``K``. The ``r`` parameter 22indicates the range of the idmapping, i.e. how many ids are mapped. From now 23on, we will always prefix ids with ``u`` or ``k`` to make it clear whether 24we're talking about an id in the upper or lower idmapset. 25 26To see what this looks like in practice, let's take the following idmapping:: 27 28 u22:k10000:r3 29 30and write down the mappings it will generate:: 31 32 u22 -> k10000 33 u23 -> k10001 34 u24 -> k10002 35 36From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an 37idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are 38order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of 39the set of all possible ids useable on a given system. 40 41Looking at this mathematically briefly will help us highlight some properties 42that make it easier to understand how we can translate between idmappings. For 43example, we know that the inverse idmapping is an order isomorphism as well:: 44 45 k10000 -> u22 46 k10001 -> u23 47 k10002 -> u24 48 49Given that we are dealing with order isomorphisms plus the fact that we're 50dealing with subsets we can embedd idmappings into each other, i.e. we can 51sensibly translate between different idmappings. For example, assume we've been 52given the three idmappings:: 53 54 1. u0:k10000:r10000 55 2. u0:k20000:r10000 56 3. u0:k30000:r10000 57 58and id ``k11000`` which has been generated by the first idmapping by mapping 59``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. 60 61Because we're dealing with order isomorphic subsets it is meaningful to ask 62what id ``k11000`` corresponds to in the second or third idmapping. The 63straightfoward algorithm to use is to apply the inverse of the first idmapping, 64mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using 65either the second idmapping mapping or third idmapping mapping. The second 66idmapping would map ``u1000`` down to ``21000``. The third idmapping would map 67``u1000`` down to ``u31000``. 68 69If we were given the same task for the following three idmappings:: 70 71 1. u0:k10000:r10000 72 2. u0:k20000:r200 73 3. u0:k30000:r300 74 75we would fail to translate as the sets aren't order isomorphic over the full 76range of the first idmapping anymore (However they are order isomorphic over 77the full range of the second idmapping.). Neither the second or third idmapping 78contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having 79an id mapped. We can simply say that ``u1000`` is unmapped in the second and 80third idmapping. The kernel will report unmapped ids as the overflowuid 81``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. 82 83The algorithm to calculate what a given id maps to is pretty simple. First, we 84need to verify that the range can contain our target id. We will skip this step 85for simplicity. After that if we want to know what ``id`` maps to we can do 86simple calculations: 87 88- If we want to map from left to right:: 89 90 u:k:r 91 id - u + k = n 92 93- If we want to map from right to left:: 94 95 u:k:r 96 id - k + u = n 97 98Instead of "left to right" we can also say "down" and instead of "right to 99left" we can also say "up". Obviously mapping down and up invert each other. 100 101To see whether the simple formulas above work, consider the following two 102idmappings:: 103 104 1. u0:k20000:r10000 105 2. u500:k30000:r10000 106 107Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We 108want to know what id this was mapped from in the upper idmapset of the first 109idmapping. So we're mapping up in the first idmapping:: 110 111 id - k + u = n 112 k21000 - k20000 + u0 = u1000 113 114Now assume we are given the id ``u1100`` in the upper idmapset of the second 115idmapping and we want to know what this id maps down to in the lower idmapset 116of the second idmapping. This means we're mapping down in the second 117idmapping:: 118 119 id - u + k = n 120 u1100 - u500 + k30000 = k30600 121 122General notes 123------------- 124 125In the context of the kernel an idmapping can be interpreted as mapping a range 126of userspace ids into a range of kernel ids:: 127 128 userspace-id:kernel-id:range 129 130A userspace id is always an element in the upper idmapset of an idmapping of 131type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower 132idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on 133"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` 134types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. 135 136The kernel is mostly concerned with kernel ids. They are used when performing 137permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. 138A userspace id on the other hand is an id that is reported to userspace by the 139kernel, or is passed by userspace to the kernel, or a raw device id that is 140written or read from disk. 141 142Note that we are only concerned with idmappings as the kernel stores them not 143how userspace would specify them. 144 145For the rest of this document we will prefix all userspace ids with ``u`` and 146all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So 147an idmapping will be written as ``u0:k10000:r10000``. 148 149For example, the id ``u1000`` is an id in the upper idmapset or "userspace 150idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a 151kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``. 152 153A kernel id is always created by an idmapping. Such idmappings are associated 154with user namespaces. Since we mainly care about how idmappings work we're not 155going to be concerned with how idmappings are created nor how they are used 156outside of the filesystem context. This is best left to an explanation of user 157namespaces. 158 159The initial user namespace is special. It always has an idmapping of the 160following form:: 161 162 u0:k0:r4294967295 163 164which is an identity idmapping over the full range of ids available on this 165system. 166 167Other user namespaces usually have non-identity idmappings such as:: 168 169 u0:k10000:r10000 170 171When a process creates or wants to change ownership of a file, or when the 172ownership of a file is read from disk by a filesystem, the userspace id is 173immediately translated into a kernel id according to the idmapping associated 174with the relevant user namespace. 175 176For instance, consider a file that is stored on disk by a filesystem as being 177owned by ``u1000``: 178 179- If a filesystem were to be mounted in the initial user namespaces (as most 180 filesystems are) then the initial idmapping will be used. As we saw this is 181 simply the identity idmapping. This would mean id ``u1000`` read from disk 182 would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field 183 would contain ``k1000``. 184 185- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` 186 then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's 187 ``i_uid`` and ``i_gid`` would contain ``k11000``. 188 189Translation algorithms 190---------------------- 191 192We've already seen briefly that it is possible to translate between different 193idmappings. We'll now take a closer look how that works. 194 195Crossmapping 196~~~~~~~~~~~~ 197 198This translation algorithm is used by the kernel in quite a few places. For 199example, it is used when reporting back the ownership of a file to userspace 200via the ``stat()`` system call family. 201 202If we've been given ``k11000`` from one idmapping we can map that id up in 203another idmapping. In order for this to work both idmappings need to contain 204the same kernel id in their kernel idmapsets. For example, consider the 205following idmappings:: 206 207 1. u0:k10000:r10000 208 2. u20000:k10000:r10000 209 210and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can 211then translate ``k11000`` into a userspace id in the second idmapping using the 212kernel idmapset of the second idmapping:: 213 214 /* Map the kernel id up into a userspace id in the second idmapping. */ 215 from_kuid(u20000:k10000:r10000, k11000) = u21000 216 217Note, how we can get back to the kernel id in the first idmapping by inverting 218the algorithm:: 219 220 /* Map the userspace id down into a kernel id in the second idmapping. */ 221 make_kuid(u20000:k10000:r10000, u21000) = k11000 222 223 /* Map the kernel id up into a userspace id in the first idmapping. */ 224 from_kuid(u0:k10000:r10000, k11000) = u1000 225 226This algorithm allows us to answer the question what userspace id a given 227kernel id corresponds to in a given idmapping. In order to be able to answer 228this question both idmappings need to contain the same kernel id in their 229respective kernel idmapsets. 230 231For example, when the kernel reads a raw userspace id from disk it maps it down 232into a kernel id according to the idmapping associated with the filesystem. 233Let's assume the filesystem was mounted with an idmapping of 234``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This 235means ``u1000`` will be mapped to ``k21000`` which is what will be stored in 236the inode's ``i_uid`` and ``i_gid`` field. 237 238When someone in userspace calls ``stat()`` or a related function to get 239ownership information about the file the kernel can't simply map the id back up 240according to the filesystem's idmapping as this would give the wrong owner if 241the caller is using an idmapping. 242 243So the kernel will map the id back up in the idmapping of the caller. Let's 244assume the caller has the somewhat unconventional idmapping 245``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. 246Consequently the user would see that this file is owned by ``u4000``. 247 248Remapping 249~~~~~~~~~ 250 251It is possible to translate a kernel id from one idmapping to another one via 252the userspace idmapset of the two idmappings. This is equivalent to remapping 253a kernel id. 254 255Let's look at an example. We are given the following two idmappings:: 256 257 1. u0:k10000:r10000 258 2. u0:k20000:r10000 259 260and we are given ``k11000`` in the first idmapping. In order to translate this 261kernel id in the first idmapping into a kernel id in the second idmapping we 262need to perform two steps: 263 2641. Map the kernel id up into a userspace id in the first idmapping:: 265 266 /* Map the kernel id up into a userspace id in the first idmapping. */ 267 from_kuid(u0:k10000:r10000, k11000) = u1000 268 2692. Map the userspace id down into a kernel id in the second idmapping:: 270 271 /* Map the userspace id down into a kernel id in the second idmapping. */ 272 make_kuid(u0:k20000:r10000, u1000) = k21000 273 274As you can see we used the userspace idmapset in both idmappings to translate 275the kernel id in one idmapping to a kernel id in another idmapping. 276 277This allows us to answer the question what kernel id we would need to use to 278get the same userspace id in another idmapping. In order to be able to answer 279this question both idmappings need to contain the same userspace id in their 280respective userspace idmapsets. 281 282Note, how we can easily get back to the kernel id in the first idmapping by 283inverting the algorithm: 284 2851. Map the kernel id up into a userspace id in the second idmapping:: 286 287 /* Map the kernel id up into a userspace id in the second idmapping. */ 288 from_kuid(u0:k20000:r10000, k21000) = u1000 289 2902. Map the userspace id down into a kernel id in the first idmapping:: 291 292 /* Map the userspace id down into a kernel id in the first idmapping. */ 293 make_kuid(u0:k10000:r10000, u1000) = k11000 294 295Another way to look at this translation is to treat it as inverting one 296idmapping and applying another idmapping if both idmappings have the relevant 297userspace id mapped. This will come in handy when working with idmapped mounts. 298 299Invalid translations 300~~~~~~~~~~~~~~~~~~~~ 301 302It is never valid to use an id in the kernel idmapset of one idmapping as the 303id in the userspace idmapset of another or the same idmapping. While the kernel 304idmapset always indicates an idmapset in the kernel id space the userspace 305idmapset indicates a userspace id. So the following translations are forbidden:: 306 307 /* Map the userspace id down into a kernel id in the first idmapping. */ 308 make_kuid(u0:k10000:r10000, u1000) = k11000 309 310 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ 311 make_kuid(u10000:k20000:r10000, k110000) = k21000 312 ~~~~~~~ 313 314and equally wrong:: 315 316 /* Map the kernel id up into a userspace id in the first idmapping. */ 317 from_kuid(u0:k10000:r10000, k11000) = u1000 318 319 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ 320 from_kuid(u20000:k0:r10000, u1000) = k21000 321 ~~~~~ 322 323Since userspace ids have type ``uid_t`` and ``gid_t`` and kernel ids have type 324``kuid_t`` and ``kgid_t`` the compiler will throw an error when they are 325conflated. So the two examples above would cause a compilation failure. 326 327Idmappings when creating filesystem objects 328------------------------------------------- 329 330The concepts of mapping an id down or mapping an id up are expressed in the two 331kernel functions filesystem developers are rather familiar with and which we've 332already used in this document:: 333 334 /* Map the userspace id down into a kernel id. */ 335 make_kuid(idmapping, uid) 336 337 /* Map the kernel id up into a userspace id. */ 338 from_kuid(idmapping, kuid) 339 340We will take an abbreviated look into how idmappings figure into creating 341filesystem objects. For simplicity we will only look at what happens when the 342VFS has already completed path lookup right before it calls into the filesystem 343itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is 344called. We will also assume that the directory we're creating filesystem 345objects in is readable and writable for everyone. 346 347When creating a filesystem object the caller will look at the caller's 348filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids 349but they are exclusively used when determining file ownership which is why they 350are called "filesystem ids". They are usually identical to the uid and gid of 351the caller but can differ. We will just assume they are always identical to not 352get lost in too many details. 353 354When the caller enters the kernel two things happen: 355 3561. Map the caller's userspace ids down into kernel ids in the caller's 357 idmapping. 358 (To be precise, the kernel will simply look at the kernel ids stashed in the 359 credentials of the current task but for our education we'll pretend this 360 translation happens just in time.) 3612. Verify that the caller's kernel ids can be mapped up to userspace ids in the 362 filesystem's idmapping. 363 364The second step is important as regular filesystem will ultimately need to map 365the kernel id back up into a userspace id when writing to disk. 366So with the second step the kernel guarantees that a valid userspace id can be 367written to disk. If it can't the kernel will refuse the creation request to not 368even remotely risk filesystem corruption. 369 370The astute reader will have realized that this is simply a varation of the 371crossmapping algorithm we mentioned above in a previous section. First, the 372kernel maps the caller's userspace id down into a kernel id according to the 373caller's idmapping and then maps that kernel id up according to the 374filesystem's idmapping. 375 376Let's see some examples with caller/filesystem idmapping but without mount 377idmappings. This will exhibit some problems we can hit. After that we will 378revisit/reconsider these examples, this time using mount idmappings, to see how 379they can solve the problems we observed before. 380 381Example 1 382~~~~~~~~~ 383 384:: 385 386 caller id: u1000 387 caller idmapping: u0:k0:r4294967295 388 filesystem idmapping: u0:k0:r4294967295 389 390Both the caller and the filesystem use the identity idmapping: 391 3921. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 393 394 make_kuid(u0:k0:r4294967295, u1000) = k1000 395 3962. Verify that the caller's kernel ids can be mapped to userspace ids in the 397 filesystem's idmapping. 398 399 For this second step the kernel will call the function 400 ``fsuidgid_has_mapping()`` which ultimately boils down to calling 401 ``from_kuid()``:: 402 403 from_kuid(u0:k0:r4294967295, k1000) = u1000 404 405In this example both idmappings are the same so there's nothing exciting going 406on. Ultimately the userspace id that lands on disk will be ``u1000``. 407 408Example 2 409~~~~~~~~~ 410 411:: 412 413 caller id: u1000 414 caller idmapping: u0:k10000:r10000 415 filesystem idmapping: u0:k20000:r10000 416 4171. Map the caller's userspace ids down into kernel ids in the caller's 418 idmapping:: 419 420 make_kuid(u0:k10000:r10000, u1000) = k11000 421 4222. Verify that the caller's kernel ids can be mapped up to userspace ids in the 423 filesystem's idmapping:: 424 425 from_kuid(u0:k20000:r10000, k11000) = u-1 426 427It's immediately clear that while the caller's userspace id could be 428successfully mapped down into kernel ids in the caller's idmapping the kernel 429ids could not be mapped up according to the filesystem's idmapping. So the 430kernel will deny this creation request. 431 432Note that while this example is less common, because most filesystem can't be 433mounted with non-initial idmappings this is a general problem as we can see in 434the next examples. 435 436Example 3 437~~~~~~~~~ 438 439:: 440 441 caller id: u1000 442 caller idmapping: u0:k10000:r10000 443 filesystem idmapping: u0:k0:r4294967295 444 4451. Map the caller's userspace ids down into kernel ids in the caller's 446 idmapping:: 447 448 make_kuid(u0:k10000:r10000, u1000) = k11000 449 4502. Verify that the caller's kernel ids can be mapped up to userspace ids in the 451 filesystem's idmapping:: 452 453 from_kuid(u0:k0:r4294967295, k11000) = u11000 454 455We can see that the translation always succeeds. The userspace id that the 456filesystem will ultimately put to disk will always be identical to the value of 457the kernel id that was created in the caller's idmapping. This has mainly two 458consequences. 459 460First, that we can't allow a caller to ultimately write to disk with another 461userspace id. We could only do this if we were to mount the whole fileystem 462with the caller's or another idmapping. But that solution is limited to a few 463filesystems and not very flexible. But this is a use-case that is pretty 464important in containerized workloads. 465 466Second, the caller will usually not be able to create any files or access 467directories that have stricter permissions because none of the filesystem's 468kernel ids map up into valid userspace ids in the caller's idmapping 469 4701. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: 471 472 make_kuid(u0:k0:r4294967295, u1000) = k1000 473 4742. Map kernel ids up to userspace ids in the caller's idmapping:: 475 476 from_kuid(u0:k10000:r10000, k1000) = u-1 477 478Example 4 479~~~~~~~~~ 480 481:: 482 483 file id: u1000 484 caller idmapping: u0:k10000:r10000 485 filesystem idmapping: u0:k0:r4294967295 486 487In order to report ownership to userspace the kernel uses the crossmapping 488algorithm introduced in a previous section: 489 4901. Map the userspace id on disk down into a kernel id in the filesystem's 491 idmapping:: 492 493 make_kuid(u0:k0:r4294967295, u1000) = k1000 494 4952. Map the kernel id up into a userspace id in the caller's idmapping:: 496 497 from_kuid(u0:k10000:r10000, k1000) = u-1 498 499The crossmapping algorithm fails in this case because the kernel id in the 500filesystem idmapping cannot be mapped up to a userspace id in the caller's 501idmapping. Thus, the kernel will report the ownership of this file as the 502overflowid. 503 504Example 5 505~~~~~~~~~ 506 507:: 508 509 file id: u1000 510 caller idmapping: u0:k10000:r10000 511 filesystem idmapping: u0:k20000:r10000 512 513In order to report ownership to userspace the kernel uses the crossmapping 514algorithm introduced in a previous section: 515 5161. Map the userspace id on disk down into a kernel id in the filesystem's 517 idmapping:: 518 519 make_kuid(u0:k20000:r10000, u1000) = k21000 520 5212. Map the kernel id up into a userspace id in the caller's idmapping:: 522 523 from_kuid(u0:k10000:r10000, k21000) = u-1 524 525Again, the crossmapping algorithm fails in this case because the kernel id in 526the filesystem idmapping cannot be mapped to a userspace id in the caller's 527idmapping. Thus, the kernel will report the ownership of this file as the 528overflowid. 529 530Note how in the last two examples things would be simple if the caller would be 531using the initial idmapping. For a filesystem mounted with the initial 532idmapping it would be trivial. So we only consider a filesystem with an 533idmapping of ``u0:k20000:r10000``: 534 5351. Map the userspace id on disk down into a kernel id in the filesystem's 536 idmapping:: 537 538 make_kuid(u0:k20000:r10000, u1000) = k21000 539 5402. Map the kernel id up into a userspace id in the caller's idmapping:: 541 542 from_kuid(u0:k0:r4294967295, k21000) = u21000 543 544Idmappings on idmapped mounts 545----------------------------- 546 547The examples we've seen in the previous section where the caller's idmapping 548and the filesystem's idmapping are incompatible causes various issues for 549workloads. For a more complex but common example, consider two containers 550started on the host. To completely prevent the two containers from affecting 551each other, an administrator may often use different non-overlapping idmappings 552for the two containers:: 553 554 container1 idmapping: u0:k10000:r10000 555 container2 idmapping: u0:k20000:r10000 556 filesystem idmapping: u0:k30000:r10000 557 558An administrator wanting to provide easy read-write access to the following set 559of files:: 560 561 dir id: u0 562 dir/file1 id: u1000 563 dir/file2 id: u2000 564 565to both containers currently can't. 566 567Of course the administrator has the option to recursively change ownership via 568``chown()``. For example, they could change ownership so that ``dir`` and all 569files below it can be crossmapped from the filesystem's into the container's 570idmapping. Let's assume they change ownership so it is compatible with the 571first container's idmapping:: 572 573 dir id: u10000 574 dir/file1 id: u11000 575 dir/file2 id: u12000 576 577This would still leave ``dir`` rather useless to the second container. In fact, 578``dir`` and all files below it would continue to appear owned by the overflowid 579for the second container. 580 581Or consider another increasingly popular example. Some service managers such as 582systemd implement a concept called "portable home directories". A user may want 583to use their home directories on different machines where they are assigned 584different login userspace ids. Most users will have ``u1000`` as the login id 585on their machine at home and all files in their home directory will usually be 586owned by ``u1000``. At uni or at work they may have another login id such as 587``u1125``. This makes it rather difficult to interact with their home directory 588on their work machine. 589 590In both cases changing ownership recursively has grave implications. The most 591obvious one is that ownership is changed globally and permanently. In the home 592directory case this change in ownership would even need to happen everytime the 593user switches from their home to their work machine. For really large sets of 594files this becomes increasingly costly. 595 596If the user is lucky, they are dealing with a filesystem that is mountable 597inside user namespaces. But this would also change ownership globally and the 598change in ownership is tied to the lifetime of the filesystem mount, i.e. the 599superblock. The only way to change ownership is to completely unmount the 600filesystem and mount it again in another user namespace. This is usually 601impossible because it would mean that all users currently accessing the 602filesystem can't anymore. And it means that ``dir`` still can't be shared 603between two containers with different idmappings. 604But usually the user doesn't even have this option since most filesystems 605aren't mountable inside containers. And not having them mountable might be 606desirable as it doesn't require the filesystem to deal with malicious 607filesystem images. 608 609But the usecases mentioned above and more can be handled by idmapped mounts. 610They allow to expose the same set of dentries with different ownership at 611different mounts. This is achieved by marking the mounts with a user namespace 612through the ``mount_setattr()`` system call. The idmapping associated with it 613is then used to translate from the caller's idmapping to the filesystem's 614idmapping and vica versa using the remapping algorithm we introduced above. 615 616Idmapped mounts make it possible to change ownership in a temporary and 617localized way. The ownership changes are restricted to a specific mount and the 618ownership changes are tied to the lifetime of the mount. All other users and 619locations where the filesystem is exposed are unaffected. 620 621Filesystems that support idmapped mounts don't have any real reason to support 622being mountable inside user namespaces. A filesystem could be exposed 623completely under an idmapped mount to get the same effect. This has the 624advantage that filesystems can leave the creation of the superblock to 625privileged users in the initial user namespace. 626 627However, it is perfectly possible to combine idmapped mounts with filesystems 628mountable inside user namespaces. We will touch on this further below. 629 630Filesystem types vs idmapped mount types 631~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 632 633With the introduction of idmapped mounts we need to distinguish between 634filesystem ownership and mount ownership of a VFS object such as an inode. The 635owner of a inode might be different when looked at from a filesystem 636perspective than when looked at from an idmapped mount. Such fundamental 637conceptual distinctions should almost always be clearly expressed in the code. 638So, to distinguish idmapped mount ownership from filesystem ownership separate 639types have been introduced. 640 641If a uid or gid has been generated using the filesystem or caller's idmapping 642then we will use the ``kuid_t`` and ``kgid_t`` types. However, if a uid or gid 643has been generated using a mount idmapping then we will be using the dedicated 644``vfsuid_t`` and ``vfsgid_t`` types. 645 646All VFS helpers that generate or take uids and gids as arguments use the 647``vfsuid_t`` and ``vfsgid_t`` types and we will be able to rely on the compiler 648to catch errors that originate from conflating filesystem and VFS uids and gids. 649 650The ``vfsuid_t`` and ``vfsgid_t`` types are often mapped from and to ``kuid_t`` 651and ``kgid_t`` types similar how ``kuid_t`` and ``kgid_t`` types are mapped 652from and to ``uid_t`` and ``gid_t`` types:: 653 654 uid_t <--> kuid_t <--> vfsuid_t 655 gid_t <--> kgid_t <--> vfsgid_t 656 657Whenever we report ownership based on a ``vfsuid_t`` or ``vfsgid_t`` type, 658e.g., during ``stat()``, or store ownership information in a shared VFS object 659based on a ``vfsuid_t`` or ``vfsgid_t`` type, e.g., during ``chown()`` we can 660use the ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` helpers. 661 662To illustrate why this helper currently exists, consider what happens when we 663change ownership of an inode from an idmapped mount. After we generated 664a ``vfsuid_t`` or ``vfsgid_t`` based on the mount idmapping we later commit to 665this ``vfsuid_t`` or ``vfsgid_t`` to become the new filesytem wide ownership. 666Thus, we are turning the ``vfsuid_t`` or ``vfsgid_t`` into a global ``kuid_t`` 667or ``kgid_t``. And this can be done by using ``vfsuid_into_kuid()`` and 668``vfsgid_into_kgid()``. 669 670Note, whenever a shared VFS object, e.g., a cached ``struct inode`` or a cached 671``struct posix_acl``, stores ownership information a filesystem or "global" 672``kuid_t`` and ``kgid_t`` must be used. Ownership expressed via ``vfsuid_t`` 673and ``vfsgid_t`` is specific to an idmapped mount. 674 675We already noted that ``vfsuid_t`` and ``vfsgid_t`` types are generated based 676on mount idmappings whereas ``kuid_t`` and ``kgid_t`` types are generated based 677on filesystem idmappings. To prevent abusing filesystem idmappings to generate 678``vfsuid_t`` or ``vfsgid_t`` types or mount idmappings to generate ``kuid_t`` 679or ``kgid_t`` types filesystem idmappings and mount idmappings are different 680types as well. 681 682All helpers that map to or from ``vfsuid_t`` and ``vfsgid_t`` types require 683a mount idmapping to be passed which is of type ``struct mnt_idmap``. Passing 684a filesystem or caller idmapping will cause a compilation error. 685 686Similar to how we prefix all userspace ids in this document with ``u`` and all 687kernel ids with ``k`` we will prefix all VFS ids with ``v``. So a mount 688idmapping will be written as: ``u0:v10000:r10000``. 689 690Remapping helpers 691~~~~~~~~~~~~~~~~~ 692 693Idmapping functions were added that translate between idmappings. They make use 694of the remapping algorithm we've introduced earlier. We're going to look at: 695 696- ``i_uid_into_vfsuid()`` and ``i_gid_into_vfsgid()`` 697 698 The ``i_*id_into_vfs*id()`` functions translate filesystem's kernel ids into 699 VFS ids in the mount's idmapping:: 700 701 /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ 702 from_kuid(filesystem, kid) = uid 703 704 /* Map the filesystem's userspace id down ito a VFS id in the mount's idmapping. */ 705 make_kuid(mount, uid) = kuid 706 707- ``mapped_fsuid()`` and ``mapped_fsgid()`` 708 709 The ``mapped_fs*id()`` functions translate the caller's kernel ids into 710 kernel ids in the filesystem's idmapping. This translation is achieved by 711 remapping the caller's VFS ids using the mount's idmapping:: 712 713 /* Map the caller's VFS id up into a userspace id in the mount's idmapping. */ 714 from_kuid(mount, kid) = uid 715 716 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 717 make_kuid(filesystem, uid) = kuid 718 719- ``vfsuid_into_kuid()`` and ``vfsgid_into_kgid()`` 720 721 Whenever 722 723Note that these two functions invert each other. Consider the following 724idmappings:: 725 726 caller idmapping: u0:k10000:r10000 727 filesystem idmapping: u0:k20000:r10000 728 mount idmapping: u0:v10000:r10000 729 730Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id 731to ``k21000`` according to its idmapping. This is what is stored in the 732inode's ``i_uid`` and ``i_gid`` fields. 733 734When the caller queries the ownership of this file via ``stat()`` the kernel 735would usually simply use the crossmapping algorithm and map the filesystem's 736kernel id up to a userspace id in the caller's idmapping. 737 738But when the caller is accessing the file on an idmapped mount the kernel will 739first call ``i_uid_into_vfsuid()`` thereby translating the filesystem's kernel 740id into a VFS id in the mount's idmapping:: 741 742 i_uid_into_vfsuid(k21000): 743 /* Map the filesystem's kernel id up into a userspace id. */ 744 from_kuid(u0:k20000:r10000, k21000) = u1000 745 746 /* Map the filesystem's userspace id down into a VFS id in the mount's idmapping. */ 747 make_kuid(u0:v10000:r10000, u1000) = v11000 748 749Finally, when the kernel reports the owner to the caller it will turn the 750VFS id in the mount's idmapping into a userspace id in the caller's 751idmapping:: 752 753 k11000 = vfsuid_into_kuid(v11000) 754 from_kuid(u0:k10000:r10000, k11000) = u1000 755 756We can test whether this algorithm really works by verifying what happens when 757we create a new file. Let's say the user is creating a file with ``u1000``. 758 759The kernel maps this to ``k11000`` in the caller's idmapping. Usually the 760kernel would now apply the crossmapping, verifying that ``k11000`` can be 761mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't 762be mapped up in the filesystem's idmapping directly this creation request 763fails. 764 765But when the caller is accessing the file on an idmapped mount the kernel will 766first call ``mapped_fs*id()`` thereby translating the caller's kernel id into 767a VFS id according to the mount's idmapping:: 768 769 mapped_fsuid(k11000): 770 /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ 771 from_kuid(u0:k10000:r10000, k11000) = u1000 772 773 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 774 make_kuid(u0:v20000:r10000, u1000) = v21000 775 776When finally writing to disk the kernel will then map ``v21000`` up into a 777userspace id in the filesystem's idmapping:: 778 779 k21000 = vfsuid_into_kuid(v21000) 780 from_kuid(u0:k20000:r10000, k21000) = u1000 781 782As we can see, we end up with an invertible and therefore information 783preserving algorithm. A file created from ``u1000`` on an idmapped mount will 784also be reported as being owned by ``u1000`` and vica versa. 785 786Let's now briefly reconsider the failing examples from earlier in the context 787of idmapped mounts. 788 789Example 2 reconsidered 790~~~~~~~~~~~~~~~~~~~~~~ 791 792:: 793 794 caller id: u1000 795 caller idmapping: u0:k10000:r10000 796 filesystem idmapping: u0:k20000:r10000 797 mount idmapping: u0:v10000:r10000 798 799When the caller is using a non-initial idmapping the common case is to attach 800the same idmapping to the mount. We now perform three steps: 801 8021. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 803 804 make_kuid(u0:k10000:r10000, u1000) = k11000 805 8062. Translate the caller's VFS id into a kernel id in the filesystem's 807 idmapping:: 808 809 mapped_fsuid(v11000): 810 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 811 from_kuid(u0:v10000:r10000, v11000) = u1000 812 813 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 814 make_kuid(u0:k20000:r10000, u1000) = k21000 815 8162. Verify that the caller's kernel ids can be mapped to userspace ids in the 817 filesystem's idmapping:: 818 819 from_kuid(u0:k20000:r10000, k21000) = u1000 820 821So the ownership that lands on disk will be ``u1000``. 822 823Example 3 reconsidered 824~~~~~~~~~~~~~~~~~~~~~~ 825 826:: 827 828 caller id: u1000 829 caller idmapping: u0:k10000:r10000 830 filesystem idmapping: u0:k0:r4294967295 831 mount idmapping: u0:v10000:r10000 832 833The same translation algorithm works with the third example. 834 8351. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 836 837 make_kuid(u0:k10000:r10000, u1000) = k11000 838 8392. Translate the caller's VFS id into a kernel id in the filesystem's 840 idmapping:: 841 842 mapped_fsuid(v11000): 843 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 844 from_kuid(u0:v10000:r10000, v11000) = u1000 845 846 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 847 make_kuid(u0:k0:r4294967295, u1000) = k1000 848 8492. Verify that the caller's kernel ids can be mapped to userspace ids in the 850 filesystem's idmapping:: 851 852 from_kuid(u0:k0:r4294967295, k21000) = u1000 853 854So the ownership that lands on disk will be ``u1000``. 855 856Example 4 reconsidered 857~~~~~~~~~~~~~~~~~~~~~~ 858 859:: 860 861 file id: u1000 862 caller idmapping: u0:k10000:r10000 863 filesystem idmapping: u0:k0:r4294967295 864 mount idmapping: u0:v10000:r10000 865 866In order to report ownership to userspace the kernel now does three steps using 867the translation algorithm we introduced earlier: 868 8691. Map the userspace id on disk down into a kernel id in the filesystem's 870 idmapping:: 871 872 make_kuid(u0:k0:r4294967295, u1000) = k1000 873 8742. Translate the kernel id into a VFS id in the mount's idmapping:: 875 876 i_uid_into_vfsuid(k1000): 877 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 878 from_kuid(u0:k0:r4294967295, k1000) = u1000 879 880 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 881 make_kuid(u0:v10000:r10000, u1000) = v11000 882 8833. Map the VFS id up into a userspace id in the caller's idmapping:: 884 885 k11000 = vfsuid_into_kuid(v11000) 886 from_kuid(u0:k10000:r10000, k11000) = u1000 887 888Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's 889idmapping. With the idmapped mount in place it now can be crossmapped into the 890filesystem's idmapping via the mount's idmapping. The file will now be created 891with ``u1000`` according to the mount's idmapping. 892 893Example 5 reconsidered 894~~~~~~~~~~~~~~~~~~~~~~ 895 896:: 897 898 file id: u1000 899 caller idmapping: u0:k10000:r10000 900 filesystem idmapping: u0:k20000:r10000 901 mount idmapping: u0:v10000:r10000 902 903Again, in order to report ownership to userspace the kernel now does three 904steps using the translation algorithm we introduced earlier: 905 9061. Map the userspace id on disk down into a kernel id in the filesystem's 907 idmapping:: 908 909 make_kuid(u0:k20000:r10000, u1000) = k21000 910 9112. Translate the kernel id into a VFS id in the mount's idmapping:: 912 913 i_uid_into_vfsuid(k21000): 914 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 915 from_kuid(u0:k20000:r10000, k21000) = u1000 916 917 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 918 make_kuid(u0:v10000:r10000, u1000) = v11000 919 9203. Map the VFS id up into a userspace id in the caller's idmapping:: 921 922 k11000 = vfsuid_into_kuid(v11000) 923 from_kuid(u0:k10000:r10000, k11000) = u1000 924 925Earlier, the file's kernel id couldn't be crossmapped in the filesystems's 926idmapping. With the idmapped mount in place it now can be crossmapped into the 927filesystem's idmapping via the mount's idmapping. The file is now owned by 928``u1000`` according to the mount's idmapping. 929 930Changing ownership on a home directory 931~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 932 933We've seen above how idmapped mounts can be used to translate between 934idmappings when either the caller, the filesystem or both uses a non-initial 935idmapping. A wide range of usecases exist when the caller is using 936a non-initial idmapping. This mostly happens in the context of containerized 937workloads. The consequence is as we have seen that for both, filesystem's 938mounted with the initial idmapping and filesystems mounted with non-initial 939idmappings, access to the filesystem isn't working because the kernel ids can't 940be crossmapped between the caller's and the filesystem's idmapping. 941 942As we've seen above idmapped mounts provide a solution to this by remapping the 943caller's or filesystem's idmapping according to the mount's idmapping. 944 945Aside from containerized workloads, idmapped mounts have the advantage that 946they also work when both the caller and the filesystem use the initial 947idmapping which means users on the host can change the ownership of directories 948and files on a per-mount basis. 949 950Consider our previous example where a user has their home directory on portable 951storage. At home they have id ``u1000`` and all files in their home directory 952are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. 953 954Taking their home directory with them becomes problematic. They can't easily 955access their files, they might not be able to write to disk without applying 956lax permissions or ACLs and even if they can, they will end up with an annoying 957mix of files and directories owned by ``u1000`` and ``u1125``. 958 959Idmapped mounts allow to solve this problem. A user can create an idmapped 960mount for their home directory on their work computer or their computer at home 961depending on what ownership they would prefer to end up on the portable storage 962itself. 963 964Let's assume they want all files on disk to belong to ``u1000``. When the user 965plugs in their portable storage at their work station they can setup a job that 966creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now 967when they create a file the kernel performs the following steps we already know 968from above::: 969 970 caller id: u1125 971 caller idmapping: u0:k0:r4294967295 972 filesystem idmapping: u0:k0:r4294967295 973 mount idmapping: u1000:v1125:r1 974 9751. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 976 977 make_kuid(u0:k0:r4294967295, u1125) = k1125 978 9792. Translate the caller's VFS id into a kernel id in the filesystem's 980 idmapping:: 981 982 mapped_fsuid(v1125): 983 /* Map the VFS id up into a userspace id in the mount's idmapping. */ 984 from_kuid(u1000:v1125:r1, v1125) = u1000 985 986 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 987 make_kuid(u0:k0:r4294967295, u1000) = k1000 988 9892. Verify that the caller's filesystem ids can be mapped to userspace ids in the 990 filesystem's idmapping:: 991 992 from_kuid(u0:k0:r4294967295, k1000) = u1000 993 994So ultimately the file will be created with ``u1000`` on disk. 995 996Now let's briefly look at what ownership the caller with id ``u1125`` will see 997on their work computer: 998 999:: 1000 1001 file id: u1000 1002 caller idmapping: u0:k0:r4294967295 1003 filesystem idmapping: u0:k0:r4294967295 1004 mount idmapping: u1000:v1125:r1 1005 10061. Map the userspace id on disk down into a kernel id in the filesystem's 1007 idmapping:: 1008 1009 make_kuid(u0:k0:r4294967295, u1000) = k1000 1010 10112. Translate the kernel id into a VFS id in the mount's idmapping:: 1012 1013 i_uid_into_vfsuid(k1000): 1014 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 1015 from_kuid(u0:k0:r4294967295, k1000) = u1000 1016 1017 /* Map the userspace id down into a VFS id in the mounts's idmapping. */ 1018 make_kuid(u1000:v1125:r1, u1000) = v1125 1019 10203. Map the VFS id up into a userspace id in the caller's idmapping:: 1021 1022 k1125 = vfsuid_into_kuid(v1125) 1023 from_kuid(u0:k0:r4294967295, k1125) = u1125 1024 1025So ultimately the caller will be reported that the file belongs to ``u1125`` 1026which is the caller's userspace id on their workstation in our example. 1027 1028The raw userspace id that is put on disk is ``u1000`` so when the user takes 1029their home directory back to their home computer where they are assigned 1030``u1000`` using the initial idmapping and mount the filesystem with the initial 1031idmapping they will see all those files owned by ``u1000``. 1032