1.. SPDX-License-Identifier: GPL-2.0 2 3Idmappings 4========== 5 6Most filesystem developers will have encountered idmappings. They are used when 7reading from or writing ownership to disk, reporting ownership to userspace, or 8for permission checking. This document is aimed at filesystem developers that 9want to know how idmappings work. 10 11Formal notes 12------------ 13 14An idmapping is essentially a translation of a range of ids into another or the 15same range of ids. The notational convention for idmappings that is widely used 16in userspace is:: 17 18 u:k:r 19 20``u`` indicates the first element in the upper idmapset ``U`` and ``k`` 21indicates the first element in the lower idmapset ``K``. The ``r`` parameter 22indicates the range of the idmapping, i.e. how many ids are mapped. From now 23on, we will always prefix ids with ``u`` or ``k`` to make it clear whether 24we're talking about an id in the upper or lower idmapset. 25 26To see what this looks like in practice, let's take the following idmapping:: 27 28 u22:k10000:r3 29 30and write down the mappings it will generate:: 31 32 u22 -> k10000 33 u23 -> k10001 34 u24 -> k10002 35 36From a mathematical viewpoint ``U`` and ``K`` are well-ordered sets and an 37idmapping is an order isomorphism from ``U`` into ``K``. So ``U`` and ``K`` are 38order isomorphic. In fact, ``U`` and ``K`` are always well-ordered subsets of 39the set of all possible ids useable on a given system. 40 41Looking at this mathematically briefly will help us highlight some properties 42that make it easier to understand how we can translate between idmappings. For 43example, we know that the inverse idmapping is an order isomorphism as well:: 44 45 k10000 -> u22 46 k10001 -> u23 47 k10002 -> u24 48 49Given that we are dealing with order isomorphisms plus the fact that we're 50dealing with subsets we can embedd idmappings into each other, i.e. we can 51sensibly translate between different idmappings. For example, assume we've been 52given the three idmappings:: 53 54 1. u0:k10000:r10000 55 2. u0:k20000:r10000 56 3. u0:k30000:r10000 57 58and id ``k11000`` which has been generated by the first idmapping by mapping 59``u1000`` from the upper idmapset down to ``k11000`` in the lower idmapset. 60 61Because we're dealing with order isomorphic subsets it is meaningful to ask 62what id ``k11000`` corresponds to in the second or third idmapping. The 63straightfoward algorithm to use is to apply the inverse of the first idmapping, 64mapping ``k11000`` up to ``u1000``. Afterwards, we can map ``u1000`` down using 65either the second idmapping mapping or third idmapping mapping. The second 66idmapping would map ``u1000`` down to ``21000``. The third idmapping would map 67``u1000`` down to ``u31000``. 68 69If we were given the same task for the following three idmappings:: 70 71 1. u0:k10000:r10000 72 2. u0:k20000:r200 73 3. u0:k30000:r300 74 75we would fail to translate as the sets aren't order isomorphic over the full 76range of the first idmapping anymore (However they are order isomorphic over 77the full range of the second idmapping.). Neither the second or third idmapping 78contain ``u1000`` in the upper idmapset ``U``. This is equivalent to not having 79an id mapped. We can simply say that ``u1000`` is unmapped in the second and 80third idmapping. The kernel will report unmapped ids as the overflowuid 81``(uid_t)-1`` or overflowgid ``(gid_t)-1`` to userspace. 82 83The algorithm to calculate what a given id maps to is pretty simple. First, we 84need to verify that the range can contain our target id. We will skip this step 85for simplicity. After that if we want to know what ``id`` maps to we can do 86simple calculations: 87 88- If we want to map from left to right:: 89 90 u:k:r 91 id - u + k = n 92 93- If we want to map from right to left:: 94 95 u:k:r 96 id - k + u = n 97 98Instead of "left to right" we can also say "down" and instead of "right to 99left" we can also say "up". Obviously mapping down and up invert each other. 100 101To see whether the simple formulas above work, consider the following two 102idmappings:: 103 104 1. u0:k20000:r10000 105 2. u500:k30000:r10000 106 107Assume we are given ``k21000`` in the lower idmapset of the first idmapping. We 108want to know what id this was mapped from in the upper idmapset of the first 109idmapping. So we're mapping up in the first idmapping:: 110 111 id - k + u = n 112 k21000 - k20000 + u0 = u1000 113 114Now assume we are given the id ``u1100`` in the upper idmapset of the second 115idmapping and we want to know what this id maps down to in the lower idmapset 116of the second idmapping. This means we're mapping down in the second 117idmapping:: 118 119 id - u + k = n 120 u1100 - u500 + k30000 = k30600 121 122General notes 123------------- 124 125In the context of the kernel an idmapping can be interpreted as mapping a range 126of userspace ids into a range of kernel ids:: 127 128 userspace-id:kernel-id:range 129 130A userspace id is always an element in the upper idmapset of an idmapping of 131type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the lower 132idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on 133"userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` 134types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. 135 136The kernel is mostly concerned with kernel ids. They are used when performing 137permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. 138A userspace id on the other hand is an id that is reported to userspace by the 139kernel, or is passed by userspace to the kernel, or a raw device id that is 140written or read from disk. 141 142Note that we are only concerned with idmappings as the kernel stores them not 143how userspace would specify them. 144 145For the rest of this document we will prefix all userspace ids with ``u`` and 146all kernel ids with ``k``. Ranges of idmappings will be prefixed with ``r``. So 147an idmapping will be written as ``u0:k10000:r10000``. 148 149For example, the id ``u1000`` is an id in the upper idmapset or "userspace 150idmapset" starting with ``u1000``. And it is mapped to ``k11000`` which is a 151kernel id in the lower idmapset or "kernel idmapset" starting with ``k10000``. 152 153A kernel id is always created by an idmapping. Such idmappings are associated 154with user namespaces. Since we mainly care about how idmappings work we're not 155going to be concerned with how idmappings are created nor how they are used 156outside of the filesystem context. This is best left to an explanation of user 157namespaces. 158 159The initial user namespace is special. It always has an idmapping of the 160following form:: 161 162 u0:k0:r4294967295 163 164which is an identity idmapping over the full range of ids available on this 165system. 166 167Other user namespaces usually have non-identity idmappings such as:: 168 169 u0:k10000:r10000 170 171When a process creates or wants to change ownership of a file, or when the 172ownership of a file is read from disk by a filesystem, the userspace id is 173immediately translated into a kernel id according to the idmapping associated 174with the relevant user namespace. 175 176For instance, consider a file that is stored on disk by a filesystem as being 177owned by ``u1000``: 178 179- If a filesystem were to be mounted in the initial user namespaces (as most 180 filesystems are) then the initial idmapping will be used. As we saw this is 181 simply the identity idmapping. This would mean id ``u1000`` read from disk 182 would be mapped to id ``k1000``. So an inode's ``i_uid`` and ``i_gid`` field 183 would contain ``k1000``. 184 185- If a filesystem were to be mounted with an idmapping of ``u0:k10000:r10000`` 186 then ``u1000`` read from disk would be mapped to ``k11000``. So an inode's 187 ``i_uid`` and ``i_gid`` would contain ``k11000``. 188 189Translation algorithms 190---------------------- 191 192We've already seen briefly that it is possible to translate between different 193idmappings. We'll now take a closer look how that works. 194 195Crossmapping 196~~~~~~~~~~~~ 197 198This translation algorithm is used by the kernel in quite a few places. For 199example, it is used when reporting back the ownership of a file to userspace 200via the ``stat()`` system call family. 201 202If we've been given ``k11000`` from one idmapping we can map that id up in 203another idmapping. In order for this to work both idmappings need to contain 204the same kernel id in their kernel idmapsets. For example, consider the 205following idmappings:: 206 207 1. u0:k10000:r10000 208 2. u20000:k10000:r10000 209 210and we are mapping ``u1000`` down to ``k11000`` in the first idmapping . We can 211then translate ``k11000`` into a userspace id in the second idmapping using the 212kernel idmapset of the second idmapping:: 213 214 /* Map the kernel id up into a userspace id in the second idmapping. */ 215 from_kuid(u20000:k10000:r10000, k11000) = u21000 216 217Note, how we can get back to the kernel id in the first idmapping by inverting 218the algorithm:: 219 220 /* Map the userspace id down into a kernel id in the second idmapping. */ 221 make_kuid(u20000:k10000:r10000, u21000) = k11000 222 223 /* Map the kernel id up into a userspace id in the first idmapping. */ 224 from_kuid(u0:k10000:r10000, k11000) = u1000 225 226This algorithm allows us to answer the question what userspace id a given 227kernel id corresponds to in a given idmapping. In order to be able to answer 228this question both idmappings need to contain the same kernel id in their 229respective kernel idmapsets. 230 231For example, when the kernel reads a raw userspace id from disk it maps it down 232into a kernel id according to the idmapping associated with the filesystem. 233Let's assume the filesystem was mounted with an idmapping of 234``u0:k20000:r10000`` and it reads a file owned by ``u1000`` from disk. This 235means ``u1000`` will be mapped to ``k21000`` which is what will be stored in 236the inode's ``i_uid`` and ``i_gid`` field. 237 238When someone in userspace calls ``stat()`` or a related function to get 239ownership information about the file the kernel can't simply map the id back up 240according to the filesystem's idmapping as this would give the wrong owner if 241the caller is using an idmapping. 242 243So the kernel will map the id back up in the idmapping of the caller. Let's 244assume the caller has the slighly unconventional idmapping 245``u3000:k20000:r10000`` then ``k21000`` would map back up to ``u4000``. 246Consequently the user would see that this file is owned by ``u4000``. 247 248Remapping 249~~~~~~~~~ 250 251It is possible to translate a kernel id from one idmapping to another one via 252the userspace idmapset of the two idmappings. This is equivalent to remapping 253a kernel id. 254 255Let's look at an example. We are given the following two idmappings:: 256 257 1. u0:k10000:r10000 258 2. u0:k20000:r10000 259 260and we are given ``k11000`` in the first idmapping. In order to translate this 261kernel id in the first idmapping into a kernel id in the second idmapping we 262need to perform two steps: 263 2641. Map the kernel id up into a userspace id in the first idmapping:: 265 266 /* Map the kernel id up into a userspace id in the first idmapping. */ 267 from_kuid(u0:k10000:r10000, k11000) = u1000 268 2692. Map the userspace id down into a kernel id in the second idmapping:: 270 271 /* Map the userspace id down into a kernel id in the second idmapping. */ 272 make_kuid(u0:k20000:r10000, u1000) = k21000 273 274As you can see we used the userspace idmapset in both idmappings to translate 275the kernel id in one idmapping to a kernel id in another idmapping. 276 277This allows us to answer the question what kernel id we would need to use to 278get the same userspace id in another idmapping. In order to be able to answer 279this question both idmappings need to contain the same userspace id in their 280respective userspace idmapsets. 281 282Note, how we can easily get back to the kernel id in the first idmapping by 283inverting the algorithm: 284 2851. Map the kernel id up into a userspace id in the second idmapping:: 286 287 /* Map the kernel id up into a userspace id in the second idmapping. */ 288 from_kuid(u0:k20000:r10000, k21000) = u1000 289 2902. Map the userspace id down into a kernel id in the first idmapping:: 291 292 /* Map the userspace id down into a kernel id in the first idmapping. */ 293 make_kuid(u0:k10000:r10000, u1000) = k11000 294 295Another way to look at this translation is to treat it as inverting one 296idmapping and applying another idmapping if both idmappings have the relevant 297userspace id mapped. This will come in handy when working with idmapped mounts. 298 299Invalid translations 300~~~~~~~~~~~~~~~~~~~~ 301 302It is never valid to use an id in the kernel idmapset of one idmapping as the 303id in the userspace idmapset of another or the same idmapping. While the kernel 304idmapset always indicates an idmapset in the kernel id space the userspace 305idmapset indicates a userspace id. So the following translations are forbidden:: 306 307 /* Map the userspace id down into a kernel id in the first idmapping. */ 308 make_kuid(u0:k10000:r10000, u1000) = k11000 309 310 /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ 311 make_kuid(u10000:k20000:r10000, k110000) = k21000 312 ~~~~~~~ 313 314and equally wrong:: 315 316 /* Map the kernel id up into a userspace id in the first idmapping. */ 317 from_kuid(u0:k10000:r10000, k11000) = u1000 318 319 /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ 320 from_kuid(u20000:k0:r10000, u1000) = k21000 321 ~~~~~ 322 323Idmappings when creating filesystem objects 324------------------------------------------- 325 326The concepts of mapping an id down or mapping an id up are expressed in the two 327kernel functions filesystem developers are rather familiar with and which we've 328already used in this document:: 329 330 /* Map the userspace id down into a kernel id. */ 331 make_kuid(idmapping, uid) 332 333 /* Map the kernel id up into a userspace id. */ 334 from_kuid(idmapping, kuid) 335 336We will take an abbreviated look into how idmappings figure into creating 337filesystem objects. For simplicity we will only look at what happens when the 338VFS has already completed path lookup right before it calls into the filesystem 339itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is 340called. We will also assume that the directory we're creating filesystem 341objects in is readable and writable for everyone. 342 343When creating a filesystem object the caller will look at the caller's 344filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids 345but they are exclusively used when determining file ownership which is why they 346are called "filesystem ids". They are usually identical to the uid and gid of 347the caller but can differ. We will just assume they are always identical to not 348get lost in too many details. 349 350When the caller enters the kernel two things happen: 351 3521. Map the caller's userspace ids down into kernel ids in the caller's 353 idmapping. 354 (To be precise, the kernel will simply look at the kernel ids stashed in the 355 credentials of the current task but for our education we'll pretend this 356 translation happens just in time.) 3572. Verify that the caller's kernel ids can be mapped up to userspace ids in the 358 filesystem's idmapping. 359 360The second step is important as regular filesystem will ultimately need to map 361the kernel id back up into a userspace id when writing to disk. 362So with the second step the kernel guarantees that a valid userspace id can be 363written to disk. If it can't the kernel will refuse the creation request to not 364even remotely risk filesystem corruption. 365 366The astute reader will have realized that this is simply a varation of the 367crossmapping algorithm we mentioned above in a previous section. First, the 368kernel maps the caller's userspace id down into a kernel id according to the 369caller's idmapping and then maps that kernel id up according to the 370filesystem's idmapping. 371 372Example 1 373~~~~~~~~~ 374 375:: 376 377 caller id: u1000 378 caller idmapping: u0:k0:r4294967295 379 filesystem idmapping: u0:k0:r4294967295 380 381Both the caller and the filesystem use the identity idmapping: 382 3831. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 384 385 make_kuid(u0:k0:r4294967295, u1000) = k1000 386 3872. Verify that the caller's kernel ids can be mapped to userspace ids in the 388 filesystem's idmapping. 389 390 For this second step the kernel will call the function 391 ``fsuidgid_has_mapping()`` which ultimately boils down to calling 392 ``from_kuid()``:: 393 394 from_kuid(u0:k0:r4294967295, k1000) = u1000 395 396In this example both idmappings are the same so there's nothing exciting going 397on. Ultimately the userspace id that lands on disk will be ``u1000``. 398 399Example 2 400~~~~~~~~~ 401 402:: 403 404 caller id: u1000 405 caller idmapping: u0:k10000:r10000 406 filesystem idmapping: u0:k20000:r10000 407 4081. Map the caller's userspace ids down into kernel ids in the caller's 409 idmapping:: 410 411 make_kuid(u0:k10000:r10000, u1000) = k11000 412 4132. Verify that the caller's kernel ids can be mapped up to userspace ids in the 414 filesystem's idmapping:: 415 416 from_kuid(u0:k20000:r10000, k11000) = u-1 417 418It's immediately clear that while the caller's userspace id could be 419successfully mapped down into kernel ids in the caller's idmapping the kernel 420ids could not be mapped up according to the filesystem's idmapping. So the 421kernel will deny this creation request. 422 423Note that while this example is less common, because most filesystem can't be 424mounted with non-initial idmappings this is a general problem as we can see in 425the next examples. 426 427Example 3 428~~~~~~~~~ 429 430:: 431 432 caller id: u1000 433 caller idmapping: u0:k10000:r10000 434 filesystem idmapping: u0:k0:r4294967295 435 4361. Map the caller's userspace ids down into kernel ids in the caller's 437 idmapping:: 438 439 make_kuid(u0:k10000:r10000, u1000) = k11000 440 4412. Verify that the caller's kernel ids can be mapped up to userspace ids in the 442 filesystem's idmapping:: 443 444 from_kuid(u0:k0:r4294967295, k11000) = u11000 445 446We can see that the translation always succeeds. The userspace id that the 447filesystem will ultimately put to disk will always be identical to the value of 448the kernel id that was created in the caller's idmapping. This has mainly two 449consequences. 450 451First, that we can't allow a caller to ultimately write to disk with another 452userspace id. We could only do this if we were to mount the whole fileystem 453with the caller's or another idmapping. But that solution is limited to a few 454filesystems and not very flexible. But this is a use-case that is pretty 455important in containerized workloads. 456 457Second, the caller will usually not be able to create any files or access 458directories that have stricter permissions because none of the filesystem's 459kernel ids map up into valid userspace ids in the caller's idmapping 460 4611. Map raw userspace ids down to kernel ids in the filesystem's idmapping:: 462 463 make_kuid(u0:k0:r4294967295, u1000) = k1000 464 4652. Map kernel ids up to userspace ids in the caller's idmapping:: 466 467 from_kuid(u0:k10000:r10000, k1000) = u-1 468 469Example 4 470~~~~~~~~~ 471 472:: 473 474 file id: u1000 475 caller idmapping: u0:k10000:r10000 476 filesystem idmapping: u0:k0:r4294967295 477 478In order to report ownership to userspace the kernel uses the crossmapping 479algorithm introduced in a previous section: 480 4811. Map the userspace id on disk down into a kernel id in the filesystem's 482 idmapping:: 483 484 make_kuid(u0:k0:r4294967295, u1000) = k1000 485 4862. Map the kernel id up into a userspace id in the caller's idmapping:: 487 488 from_kuid(u0:k10000:r10000, k1000) = u-1 489 490The crossmapping algorithm fails in this case because the kernel id in the 491filesystem idmapping cannot be mapped up to a userspace id in the caller's 492idmapping. Thus, the kernel will report the ownership of this file as the 493overflowid. 494 495Example 5 496~~~~~~~~~ 497 498:: 499 500 file id: u1000 501 caller idmapping: u0:k10000:r10000 502 filesystem idmapping: u0:k20000:r10000 503 504In order to report ownership to userspace the kernel uses the crossmapping 505algorithm introduced in a previous section: 506 5071. Map the userspace id on disk down into a kernel id in the filesystem's 508 idmapping:: 509 510 make_kuid(u0:k20000:r10000, u1000) = k21000 511 5122. Map the kernel id up into a userspace id in the caller's idmapping:: 513 514 from_kuid(u0:k10000:r10000, k21000) = u-1 515 516Again, the crossmapping algorithm fails in this case because the kernel id in 517the filesystem idmapping cannot be mapped to a userspace id in the caller's 518idmapping. Thus, the kernel will report the ownership of this file as the 519overflowid. 520 521Note how in the last two examples things would be simple if the caller would be 522using the initial idmapping. For a filesystem mounted with the initial 523idmapping it would be trivial. So we only consider a filesystem with an 524idmapping of ``u0:k20000:r10000``: 525 5261. Map the userspace id on disk down into a kernel id in the filesystem's 527 idmapping:: 528 529 make_kuid(u0:k20000:r10000, u1000) = k21000 530 5312. Map the kernel id up into a userspace id in the caller's idmapping:: 532 533 from_kuid(u0:k0:r4294967295, k21000) = u21000 534 535Idmappings on idmapped mounts 536----------------------------- 537 538The examples we've seen in the previous section where the caller's idmapping 539and the filesystem's idmapping are incompatible causes various issues for 540workloads. For a more complex but common example, consider two containers 541started on the host. To completely prevent the two containers from affecting 542each other, an administrator may often use different non-overlapping idmappings 543for the two containers:: 544 545 container1 idmapping: u0:k10000:r10000 546 container2 idmapping: u0:k20000:r10000 547 filesystem idmapping: u0:k30000:r10000 548 549An administrator wanting to provide easy read-write access to the following set 550of files:: 551 552 dir id: u0 553 dir/file1 id: u1000 554 dir/file2 id: u2000 555 556to both containers currently can't. 557 558Of course the administrator has the option to recursively change ownership via 559``chown()``. For example, they could change ownership so that ``dir`` and all 560files below it can be crossmapped from the filesystem's into the container's 561idmapping. Let's assume they change ownership so it is compatible with the 562first container's idmapping:: 563 564 dir id: u10000 565 dir/file1 id: u11000 566 dir/file2 id: u12000 567 568This would still leave ``dir`` rather useless to the second container. In fact, 569``dir`` and all files below it would continue to appear owned by the overflowid 570for the second container. 571 572Or consider another increasingly popular example. Some service managers such as 573systemd implement a concept called "portable home directories". A user may want 574to use their home directories on different machines where they are assigned 575different login userspace ids. Most users will have ``u1000`` as the login id 576on their machine at home and all files in their home directory will usually be 577owned by ``u1000``. At uni or at work they may have another login id such as 578``u1125``. This makes it rather difficult to interact with their home directory 579on their work machine. 580 581In both cases changing ownership recursively has grave implications. The most 582obvious one is that ownership is changed globally and permanently. In the home 583directory case this change in ownership would even need to happen everytime the 584user switches from their home to their work machine. For really large sets of 585files this becomes increasingly costly. 586 587If the user is lucky, they are dealing with a filesystem that is mountable 588inside user namespaces. But this would also change ownership globally and the 589change in ownership is tied to the lifetime of the filesystem mount, i.e. the 590superblock. The only way to change ownership is to completely unmount the 591filesystem and mount it again in another user namespace. This is usually 592impossible because it would mean that all users currently accessing the 593filesystem can't anymore. And it means that ``dir`` still can't be shared 594between two containers with different idmappings. 595But usually the user doesn't even have this option since most filesystems 596aren't mountable inside containers. And not having them mountable might be 597desirable as it doesn't require the filesystem to deal with malicious 598filesystem images. 599 600But the usecases mentioned above and more can be handled by idmapped mounts. 601They allow to expose the same set of dentries with different ownership at 602different mounts. This is achieved by marking the mounts with a user namespace 603through the ``mount_setattr()`` system call. The idmapping associated with it 604is then used to translate from the caller's idmapping to the filesystem's 605idmapping and vica versa using the remapping algorithm we introduced above. 606 607Idmapped mounts make it possible to change ownership in a temporary and 608localized way. The ownership changes are restricted to a specific mount and the 609ownership changes are tied to the lifetime of the mount. All other users and 610locations where the filesystem is exposed are unaffected. 611 612Filesystems that support idmapped mounts don't have any real reason to support 613being mountable inside user namespaces. A filesystem could be exposed 614completely under an idmapped mount to get the same effect. This has the 615advantage that filesystems can leave the creation of the superblock to 616privileged users in the initial user namespace. 617 618However, it is perfectly possible to combine idmapped mounts with filesystems 619mountable inside user namespaces. We will touch on this further below. 620 621Remapping helpers 622~~~~~~~~~~~~~~~~~ 623 624Idmapping functions were added that translate between idmappings. They make use 625of the remapping algorithm we've introduced earlier. We're going to look at 626two: 627 628- ``i_uid_into_mnt()`` and ``i_gid_into_mnt()`` 629 630 The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into 631 kernel ids in the mount's idmapping:: 632 633 /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ 634 from_kuid(filesystem, kid) = uid 635 636 /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ 637 make_kuid(mount, uid) = kuid 638 639- ``mapped_fsuid()`` and ``mapped_fsgid()`` 640 641 The ``mapped_fs*id()`` functions translate the caller's kernel ids into 642 kernel ids in the filesystem's idmapping. This translation is achieved by 643 remapping the caller's kernel ids using the mount's idmapping:: 644 645 /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ 646 from_kuid(mount, kid) = uid 647 648 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 649 make_kuid(filesystem, uid) = kuid 650 651Note that these two functions invert each other. Consider the following 652idmappings:: 653 654 caller idmapping: u0:k10000:r10000 655 filesystem idmapping: u0:k20000:r10000 656 mount idmapping: u0:k10000:r10000 657 658Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id 659to ``k21000`` according to it's idmapping. This is what is stored in the 660inode's ``i_uid`` and ``i_gid`` fields. 661 662When the caller queries the ownership of this file via ``stat()`` the kernel 663would usually simply use the crossmapping algorithm and map the filesystem's 664kernel id up to a userspace id in the caller's idmapping. 665 666But when the caller is accessing the file on an idmapped mount the kernel will 667first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id 668into a kernel id in the mount's idmapping:: 669 670 i_uid_into_mnt(k21000): 671 /* Map the filesystem's kernel id up into a userspace id. */ 672 from_kuid(u0:k20000:r10000, k21000) = u1000 673 674 /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ 675 make_kuid(u0:k10000:r10000, u1000) = k11000 676 677Finally, when the kernel reports the owner to the caller it will turn the 678kernel id in the mount's idmapping into a userspace id in the caller's 679idmapping:: 680 681 from_kuid(u0:k10000:r10000, k11000) = u1000 682 683We can test whether this algorithm really works by verifying what happens when 684we create a new file. Let's say the user is creating a file with ``u1000``. 685 686The kernel maps this to ``k11000`` in the caller's idmapping. Usually the 687kernel would now apply the crossmapping, verifying that ``k11000`` can be 688mapped to a userspace id in the filesystem's idmapping. Since ``k11000`` can't 689be mapped up in the filesystem's idmapping directly this creation request 690fails. 691 692But when the caller is accessing the file on an idmapped mount the kernel will 693first call ``mapped_fs*id()`` thereby translating the caller's kernel id into 694a kernel id according to the mount's idmapping:: 695 696 mapped_fsuid(k11000): 697 /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ 698 from_kuid(u0:k10000:r10000, k11000) = u1000 699 700 /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ 701 make_kuid(u0:k20000:r10000, u1000) = k21000 702 703When finally writing to disk the kernel will then map ``k21000`` up into a 704userspace id in the filesystem's idmapping:: 705 706 from_kuid(u0:k20000:r10000, k21000) = u1000 707 708As we can see, we end up with an invertible and therefore information 709preserving algorithm. A file created from ``u1000`` on an idmapped mount will 710also be reported as being owned by ``u1000`` and vica versa. 711 712Let's now briefly reconsider the failing examples from earlier in the context 713of idmapped mounts. 714 715Example 2 reconsidered 716~~~~~~~~~~~~~~~~~~~~~~ 717 718:: 719 720 caller id: u1000 721 caller idmapping: u0:k10000:r10000 722 filesystem idmapping: u0:k20000:r10000 723 mount idmapping: u0:k10000:r10000 724 725When the caller is using a non-initial idmapping the common case is to attach 726the same idmapping to the mount. We now perform three steps: 727 7281. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 729 730 make_kuid(u0:k10000:r10000, u1000) = k11000 731 7322. Translate the caller's kernel id into a kernel id in the filesystem's 733 idmapping:: 734 735 mapped_fsuid(k11000): 736 /* Map the kernel id up into a userspace id in the mount's idmapping. */ 737 from_kuid(u0:k10000:r10000, k11000) = u1000 738 739 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 740 make_kuid(u0:k20000:r10000, u1000) = k21000 741 7422. Verify that the caller's kernel ids can be mapped to userspace ids in the 743 filesystem's idmapping:: 744 745 from_kuid(u0:k20000:r10000, k21000) = u1000 746 747So the ownership that lands on disk will be ``u1000``. 748 749Example 3 reconsidered 750~~~~~~~~~~~~~~~~~~~~~~ 751 752:: 753 754 caller id: u1000 755 caller idmapping: u0:k10000:r10000 756 filesystem idmapping: u0:k0:r4294967295 757 mount idmapping: u0:k10000:r10000 758 759The same translation algorithm works with the third example. 760 7611. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 762 763 make_kuid(u0:k10000:r10000, u1000) = k11000 764 7652. Translate the caller's kernel id into a kernel id in the filesystem's 766 idmapping:: 767 768 mapped_fsuid(k11000): 769 /* Map the kernel id up into a userspace id in the mount's idmapping. */ 770 from_kuid(u0:k10000:r10000, k11000) = u1000 771 772 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 773 make_kuid(u0:k0:r4294967295, u1000) = k1000 774 7752. Verify that the caller's kernel ids can be mapped to userspace ids in the 776 filesystem's idmapping:: 777 778 from_kuid(u0:k0:r4294967295, k21000) = u1000 779 780So the ownership that lands on disk will be ``u1000``. 781 782Example 4 reconsidered 783~~~~~~~~~~~~~~~~~~~~~~ 784 785:: 786 787 file id: u1000 788 caller idmapping: u0:k10000:r10000 789 filesystem idmapping: u0:k0:r4294967295 790 mount idmapping: u0:k10000:r10000 791 792In order to report ownership to userspace the kernel now does three steps using 793the translation algorithm we introduced earlier: 794 7951. Map the userspace id on disk down into a kernel id in the filesystem's 796 idmapping:: 797 798 make_kuid(u0:k0:r4294967295, u1000) = k1000 799 8002. Translate the kernel id into a kernel id in the mount's idmapping:: 801 802 i_uid_into_mnt(k1000): 803 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 804 from_kuid(u0:k0:r4294967295, k1000) = u1000 805 806 /* Map the userspace id down into a kernel id in the mounts's idmapping. */ 807 make_kuid(u0:k10000:r10000, u1000) = k11000 808 8093. Map the kernel id up into a userspace id in the caller's idmapping:: 810 811 from_kuid(u0:k10000:r10000, k11000) = u1000 812 813Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's 814idmapping. With the idmapped mount in place it now can be crossmapped into the 815filesystem's idmapping via the mount's idmapping. The file will now be created 816with ``u1000`` according to the mount's idmapping. 817 818Example 5 reconsidered 819~~~~~~~~~~~~~~~~~~~~~~ 820 821:: 822 823 file id: u1000 824 caller idmapping: u0:k10000:r10000 825 filesystem idmapping: u0:k20000:r10000 826 mount idmapping: u0:k10000:r10000 827 828Again, in order to report ownership to userspace the kernel now does three 829steps using the translation algorithm we introduced earlier: 830 8311. Map the userspace id on disk down into a kernel id in the filesystem's 832 idmapping:: 833 834 make_kuid(u0:k20000:r10000, u1000) = k21000 835 8362. Translate the kernel id into a kernel id in the mount's idmapping:: 837 838 i_uid_into_mnt(k21000): 839 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 840 from_kuid(u0:k20000:r10000, k21000) = u1000 841 842 /* Map the userspace id down into a kernel id in the mounts's idmapping. */ 843 make_kuid(u0:k10000:r10000, u1000) = k11000 844 8453. Map the kernel id up into a userspace id in the caller's idmapping:: 846 847 from_kuid(u0:k10000:r10000, k11000) = u1000 848 849Earlier, the file's kernel id couldn't be crossmapped in the filesystems's 850idmapping. With the idmapped mount in place it now can be crossmapped into the 851filesystem's idmapping via the mount's idmapping. The file is now owned by 852``u1000`` according to the mount's idmapping. 853 854Changing ownership on a home directory 855~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 856 857We've seen above how idmapped mounts can be used to translate between 858idmappings when either the caller, the filesystem or both uses a non-initial 859idmapping. A wide range of usecases exist when the caller is using 860a non-initial idmapping. This mostly happens in the context of containerized 861workloads. The consequence is as we have seen that for both, filesystem's 862mounted with the initial idmapping and filesystems mounted with non-initial 863idmappings, access to the filesystem isn't working because the kernel ids can't 864be crossmapped between the caller's and the filesystem's idmapping. 865 866As we've seen above idmapped mounts provide a solution to this by remapping the 867caller's or filesystem's idmapping according to the mount's idmapping. 868 869Aside from containerized workloads, idmapped mounts have the advantage that 870they also work when both the caller and the filesystem use the initial 871idmapping which means users on the host can change the ownership of directories 872and files on a per-mount basis. 873 874Consider our previous example where a user has their home directory on portable 875storage. At home they have id ``u1000`` and all files in their home directory 876are owned by ``u1000`` whereas at uni or work they have login id ``u1125``. 877 878Taking their home directory with them becomes problematic. They can't easily 879access their files, they might not be able to write to disk without applying 880lax permissions or ACLs and even if they can, they will end up with an annoying 881mix of files and directories owned by ``u1000`` and ``u1125``. 882 883Idmapped mounts allow to solve this problem. A user can create an idmapped 884mount for their home directory on their work computer or their computer at home 885depending on what ownership they would prefer to end up on the portable storage 886itself. 887 888Let's assume they want all files on disk to belong to ``u1000``. When the user 889plugs in their portable storage at their work station they can setup a job that 890creates an idmapped mount with the minimal idmapping ``u1000:k1125:r1``. So now 891when they create a file the kernel performs the following steps we already know 892from above::: 893 894 caller id: u1125 895 caller idmapping: u0:k0:r4294967295 896 filesystem idmapping: u0:k0:r4294967295 897 mount idmapping: u1000:k1125:r1 898 8991. Map the caller's userspace ids into kernel ids in the caller's idmapping:: 900 901 make_kuid(u0:k0:r4294967295, u1125) = k1125 902 9032. Translate the caller's kernel id into a kernel id in the filesystem's 904 idmapping:: 905 906 mapped_fsuid(k1125): 907 /* Map the kernel id up into a userspace id in the mount's idmapping. */ 908 from_kuid(u1000:k1125:r1, k1125) = u1000 909 910 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 911 make_kuid(u0:k0:r4294967295, u1000) = k1000 912 9132. Verify that the caller's kernel ids can be mapped to userspace ids in the 914 filesystem's idmapping:: 915 916 from_kuid(u0:k0:r4294967295, k1000) = u1000 917 918So ultimately the file will be created with ``u1000`` on disk. 919 920Now let's briefly look at what ownership the caller with id ``u1125`` will see 921on their work computer: 922 923:: 924 925 file id: u1000 926 caller idmapping: u0:k0:r4294967295 927 filesystem idmapping: u0:k0:r4294967295 928 mount idmapping: u1000:k1125:r1 929 9301. Map the userspace id on disk down into a kernel id in the filesystem's 931 idmapping:: 932 933 make_kuid(u0:k0:r4294967295, u1000) = k1000 934 9352. Translate the kernel id into a kernel id in the mount's idmapping:: 936 937 i_uid_into_mnt(k1000): 938 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 939 from_kuid(u0:k0:r4294967295, k1000) = u1000 940 941 /* Map the userspace id down into a kernel id in the mounts's idmapping. */ 942 make_kuid(u1000:k1125:r1, u1000) = k1125 943 9443. Map the kernel id up into a userspace id in the caller's idmapping:: 945 946 from_kuid(u0:k0:r4294967295, k1125) = u1125 947 948So ultimately the caller will be reported that the file belongs to ``u1125`` 949which is the caller's userspace id on their workstation in our example. 950 951The raw userspace id that is put on disk is ``u1000`` so when the user takes 952their home directory back to their home computer where they are assigned 953``u1000`` using the initial idmapping and mount the filesystem with the initial 954idmapping they will see all those files owned by ``u1000``. 955 956Shortcircuting 957-------------- 958 959Currently, the implementation of idmapped mounts enforces that the filesystem 960is mounted with the initial idmapping. The reason is simply that none of the 961filesystems that we targeted were mountable with a non-initial idmapping. But 962that might change soon enough. As we've seen above, thanks to the properties of 963idmappings the translation works for both filesystems mounted with the initial 964idmapping and filesystem with non-initial idmappings. 965 966Based on this current restriction to filesystem mounted with the initial 967idmapping two noticeable shortcuts have been taken: 968 9691. We always stash a reference to the initial user namespace in ``struct 970 vfsmount``. Idmapped mounts are thus mounts that have a non-initial user 971 namespace attached to them. 972 973 In order to support idmapped mounts this needs to be changed. Instead of 974 stashing the initial user namespace the user namespace the filesystem was 975 mounted with must be stashed. An idmapped mount is then any mount that has 976 a different user namespace attached then the filesystem was mounted with. 977 This has no user-visible consequences. 978 9792. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()`` 980 are simplified. 981 982 Let's consider ``mapped_fs*id()`` first. This function translates the 983 caller's kernel id into a kernel id in the filesystem's idmapping via 984 a mount's idmapping. The full algorithm is:: 985 986 mapped_fsuid(kid): 987 /* Map the kernel id up into a userspace id in the mount's idmapping. */ 988 from_kuid(mount-idmapping, kid) = uid 989 990 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 991 make_kuid(filesystem-idmapping, uid) = kuid 992 993 We know that the filesystem is always mounted with the initial idmapping as 994 we enforce this in ``mount_setattr()``. So this can be shortened to:: 995 996 mapped_fsuid(kid): 997 /* Map the kernel id up into a userspace id in the mount's idmapping. */ 998 from_kuid(mount-idmapping, kid) = uid 999 1000 /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ 1001 KUIDT_INIT(uid) = kuid 1002 1003 Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel 1004 id into a mount's kernel id:: 1005 1006 i_uid_into_mnt(kid): 1007 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 1008 from_kuid(filesystem-idmapping, kid) = uid 1009 1010 /* Map the userspace id down into a kernel id in the mounts's idmapping. */ 1011 make_kuid(mount-idmapping, uid) = kuid 1012 1013 Again, we know that the filesystem is always mounted with the initial 1014 idmapping as we enforce this in ``mount_setattr()``. So this can be 1015 shortened to:: 1016 1017 i_uid_into_mnt(kid): 1018 /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ 1019 __kuid_val(kid) = uid 1020 1021 /* Map the userspace id down into a kernel id in the mounts's idmapping. */ 1022 make_kuid(mount-idmapping, uid) = kuid 1023 1024Handling filesystems mounted with non-initial idmappings requires that the 1025translation functions be converted to their full form. They can still be 1026shortcircuited on non-idmapped mounts. This has no user-visible consequences. 1027