1c2ed6743SKees Cook====================== 2c2ed6743SKees CookKernel Self-Protection 3c2ed6743SKees Cook====================== 4c2ed6743SKees Cook 5c2ed6743SKees CookKernel self-protection is the design and implementation of systems and 6c2ed6743SKees Cookstructures within the Linux kernel to protect against security flaws in 7c2ed6743SKees Cookthe kernel itself. This covers a wide range of issues, including removing 8c2ed6743SKees Cookentire classes of bugs, blocking security flaw exploitation methods, 9c2ed6743SKees Cookand actively detecting attack attempts. Not all topics are explored in 10c2ed6743SKees Cookthis document, but it should serve as a reasonable starting point and 11c2ed6743SKees Cookanswer any frequently asked questions. (Patches welcome, of course!) 12c2ed6743SKees Cook 13c2ed6743SKees CookIn the worst-case scenario, we assume an unprivileged local attacker 14c2ed6743SKees Cookhas arbitrary read and write access to the kernel's memory. In many 15c2ed6743SKees Cookcases, bugs being exploited will not provide this level of access, 16c2ed6743SKees Cookbut with systems in place that defend against the worst case we'll 17c2ed6743SKees Cookcover the more limited cases as well. A higher bar, and one that should 18c2ed6743SKees Cookstill be kept in mind, is protecting the kernel against a _privileged_ 19c2ed6743SKees Cooklocal attacker, since the root user has access to a vastly increased 20c2ed6743SKees Cookattack surface. (Especially when they have the ability to load arbitrary 21c2ed6743SKees Cookkernel modules.) 22c2ed6743SKees Cook 23c2ed6743SKees CookThe goals for successful self-protection systems would be that they 24c2ed6743SKees Cookare effective, on by default, require no opt-in by developers, have no 25c2ed6743SKees Cookperformance impact, do not impede kernel debugging, and have tests. It 26c2ed6743SKees Cookis uncommon that all these goals can be met, but it is worth explicitly 27c2ed6743SKees Cookmentioning them, since these aspects need to be explored, dealt with, 28c2ed6743SKees Cookand/or accepted. 29c2ed6743SKees Cook 30c2ed6743SKees Cook 31c2ed6743SKees CookAttack Surface Reduction 32c2ed6743SKees Cook======================== 33c2ed6743SKees Cook 34c2ed6743SKees CookThe most fundamental defense against security exploits is to reduce the 35c2ed6743SKees Cookareas of the kernel that can be used to redirect execution. This ranges 36c2ed6743SKees Cookfrom limiting the exposed APIs available to userspace, making in-kernel 37c2ed6743SKees CookAPIs hard to use incorrectly, minimizing the areas of writable kernel 38c2ed6743SKees Cookmemory, etc. 39c2ed6743SKees Cook 40c2ed6743SKees CookStrict kernel memory permissions 41c2ed6743SKees Cook-------------------------------- 42c2ed6743SKees Cook 43c2ed6743SKees CookWhen all of kernel memory is writable, it becomes trivial for attacks 44c2ed6743SKees Cookto redirect execution flow. To reduce the availability of these targets 45c2ed6743SKees Cookthe kernel needs to protect its memory with a tight set of permissions. 46c2ed6743SKees Cook 47c2ed6743SKees CookExecutable code and read-only data must not be writable 48c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 49c2ed6743SKees Cook 50c2ed6743SKees CookAny areas of the kernel with executable memory must not be writable. 51c2ed6743SKees CookWhile this obviously includes the kernel text itself, we must consider 52c2ed6743SKees Cookall additional places too: kernel modules, JIT memory, etc. (There are 53c2ed6743SKees Cooktemporary exceptions to this rule to support things like instruction 54c2ed6743SKees Cookalternatives, breakpoints, kprobes, etc. If these must exist in a 55c2ed6743SKees Cookkernel, they are implemented in a way where the memory is temporarily 56c2ed6743SKees Cookmade writable during the update, and then returned to the original 57c2ed6743SKees Cookpermissions.) 58c2ed6743SKees Cook 59c2ed6743SKees CookIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and 60c2ed6743SKees Cook``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not 61c2ed6743SKees Cookwritable, data is not executable, and read-only data is neither writable 62c2ed6743SKees Cooknor executable. 63c2ed6743SKees Cook 64c2ed6743SKees CookMost architectures have these options on by default and not user selectable. 65c2ed6743SKees CookFor some architectures like arm that wish to have these be selectable, 66c2ed6743SKees Cookthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable 67c2ed6743SKees Cooka Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines 68c2ed6743SKees Cookthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. 69c2ed6743SKees Cook 70c2ed6743SKees CookFunction pointers and sensitive variables must not be writable 71c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72c2ed6743SKees Cook 73c2ed6743SKees CookVast areas of kernel memory contain function pointers that are looked 74c2ed6743SKees Cookup by the kernel and used to continue execution (e.g. descriptor/vector 75c2ed6743SKees Cooktables, file/network/etc operation structures, etc). The number of these 76c2ed6743SKees Cookvariables must be reduced to an absolute minimum. 77c2ed6743SKees Cook 78c2ed6743SKees CookMany such variables can be made read-only by setting them "const" 79c2ed6743SKees Cookso that they live in the .rodata section instead of the .data section 80c2ed6743SKees Cookof the kernel, gaining the protection of the kernel's strict memory 81c2ed6743SKees Cookpermissions as described above. 82c2ed6743SKees Cook 83c2ed6743SKees CookFor variables that are initialized once at ``__init`` time, these can 84c2ed6743SKees Cookbe marked with the (new and under development) ``__ro_after_init`` 85c2ed6743SKees Cookattribute. 86c2ed6743SKees Cook 87c2ed6743SKees CookWhat remains are variables that are updated rarely (e.g. GDT). These 88c2ed6743SKees Cookwill need another infrastructure (similar to the temporary exceptions 89c2ed6743SKees Cookmade to kernel code mentioned above) that allow them to spend the rest 90c2ed6743SKees Cookof their lifetime read-only. (For example, when being updated, only the 91c2ed6743SKees CookCPU thread performing the update would be given uninterruptible write 92c2ed6743SKees Cookaccess to the memory.) 93c2ed6743SKees Cook 94c2ed6743SKees CookSegregation of kernel memory from userspace memory 95c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 96c2ed6743SKees Cook 97c2ed6743SKees CookThe kernel must never execute userspace memory. The kernel must also never 98c2ed6743SKees Cookaccess userspace memory without explicit expectation to do so. These 99c2ed6743SKees Cookrules can be enforced either by support of hardware-based restrictions 100c2ed6743SKees Cook(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 101c2ed6743SKees CookBy blocking userspace memory in this way, execution and data parsing 102c2ed6743SKees Cookcannot be passed to trivially-controlled userspace memory, forcing 103c2ed6743SKees Cookattacks to operate entirely in kernel memory. 104c2ed6743SKees Cook 105c2ed6743SKees CookReduced access to syscalls 106c2ed6743SKees Cook-------------------------- 107c2ed6743SKees Cook 108c2ed6743SKees CookOne trivial way to eliminate many syscalls for 64-bit systems is building 109c2ed6743SKees Cookwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. 110c2ed6743SKees Cook 111c2ed6743SKees CookThe "seccomp" system provides an opt-in feature made available to 112c2ed6743SKees Cookuserspace, which provides a way to reduce the number of kernel entry 113c2ed6743SKees Cookpoints available to a running process. This limits the breadth of kernel 114c2ed6743SKees Cookcode that can be reached, possibly reducing the availability of a given 115c2ed6743SKees Cookbug to an attack. 116c2ed6743SKees Cook 117c2ed6743SKees CookAn area of improvement would be creating viable ways to keep access to 118c2ed6743SKees Cookthings like compat, user namespaces, BPF creation, and perf limited only 119c2ed6743SKees Cookto trusted processes. This would keep the scope of kernel entry points 120c2ed6743SKees Cookrestricted to the more regular set of normally available to unprivileged 121c2ed6743SKees Cookuserspace. 122c2ed6743SKees Cook 123c2ed6743SKees CookRestricting access to kernel modules 124c2ed6743SKees Cook------------------------------------ 125c2ed6743SKees Cook 126c2ed6743SKees CookThe kernel should never allow an unprivileged user the ability to 127c2ed6743SKees Cookload specific kernel modules, since that would provide a facility to 128c2ed6743SKees Cookunexpectedly extend the available attack surface. (The on-demand loading 129c2ed6743SKees Cookof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 130c2ed6743SKees Cookconsidered "expected" here, though additional consideration should be 131c2ed6743SKees Cookgiven even to these.) For example, loading a filesystem module via an 132c2ed6743SKees Cookunprivileged socket API is nonsense: only the root or physically local 133c2ed6743SKees Cookuser should trigger filesystem module loading. (And even this can be up 134c2ed6743SKees Cookfor debate in some scenarios.) 135c2ed6743SKees Cook 136c2ed6743SKees CookTo protect against even privileged users, systems may need to either 137c2ed6743SKees Cookdisable module loading entirely (e.g. monolithic kernel builds or 138c2ed6743SKees Cookmodules_disabled sysctl), or provide signed modules (e.g. 139c2ed6743SKees Cook``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having 140c2ed6743SKees Cookroot load arbitrary kernel code via the module loader interface. 141c2ed6743SKees Cook 142c2ed6743SKees Cook 143c2ed6743SKees CookMemory integrity 144c2ed6743SKees Cook================ 145c2ed6743SKees Cook 146c2ed6743SKees CookThere are many memory structures in the kernel that are regularly abused 147c2ed6743SKees Cookto gain execution control during an attack, By far the most commonly 148c2ed6743SKees Cookunderstood is that of the stack buffer overflow in which the return 149c2ed6743SKees Cookaddress stored on the stack is overwritten. Many other examples of this 150c2ed6743SKees Cookkind of attack exist, and protections exist to defend against them. 151c2ed6743SKees Cook 152c2ed6743SKees CookStack buffer overflow 153c2ed6743SKees Cook--------------------- 154c2ed6743SKees Cook 155c2ed6743SKees CookThe classic stack buffer overflow involves writing past the expected end 156c2ed6743SKees Cookof a variable stored on the stack, ultimately writing a controlled value 157c2ed6743SKees Cookto the stack frame's stored return address. The most widely used defense 158c2ed6743SKees Cookis the presence of a stack canary between the stack variables and the 159*050e9baaSLinus Torvaldsreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before 160c2ed6743SKees Cookthe function returns. Other defenses include things like shadow stacks. 161c2ed6743SKees Cook 162c2ed6743SKees CookStack depth overflow 163c2ed6743SKees Cook-------------------- 164c2ed6743SKees Cook 165c2ed6743SKees CookA less well understood attack is using a bug that triggers the 166c2ed6743SKees Cookkernel to consume stack memory with deep function calls or large stack 167c2ed6743SKees Cookallocations. With this attack it is possible to write beyond the end of 168c2ed6743SKees Cookthe kernel's preallocated stack space and into sensitive structures. Two 169c2ed6743SKees Cookimportant changes need to be made for better protections: moving the 170c2ed6743SKees Cooksensitive thread_info structure elsewhere, and adding a faulting memory 171c2ed6743SKees Cookhole at the bottom of the stack to catch these overflows. 172c2ed6743SKees Cook 173c2ed6743SKees CookHeap memory integrity 174c2ed6743SKees Cook--------------------- 175c2ed6743SKees Cook 176c2ed6743SKees CookThe structures used to track heap free lists can be sanity-checked during 177c2ed6743SKees Cookallocation and freeing to make sure they aren't being used to manipulate 178c2ed6743SKees Cookother memory areas. 179c2ed6743SKees Cook 180c2ed6743SKees CookCounter integrity 181c2ed6743SKees Cook----------------- 182c2ed6743SKees Cook 183c2ed6743SKees CookMany places in the kernel use atomic counters to track object references 184c2ed6743SKees Cookor perform similar lifetime management. When these counters can be made 185c2ed6743SKees Cookto wrap (over or under) this traditionally exposes a use-after-free 186c2ed6743SKees Cookflaw. By trapping atomic wrapping, this class of bug vanishes. 187c2ed6743SKees Cook 188c2ed6743SKees CookSize calculation overflow detection 189c2ed6743SKees Cook----------------------------------- 190c2ed6743SKees Cook 191c2ed6743SKees CookSimilar to counter overflow, integer overflows (usually size calculations) 192c2ed6743SKees Cookneed to be detected at runtime to kill this class of bug, which 193c2ed6743SKees Cooktraditionally leads to being able to write past the end of kernel buffers. 194c2ed6743SKees Cook 195c2ed6743SKees Cook 196c2ed6743SKees CookProbabilistic defenses 197c2ed6743SKees Cook====================== 198c2ed6743SKees Cook 199c2ed6743SKees CookWhile many protections can be considered deterministic (e.g. read-only 200c2ed6743SKees Cookmemory cannot be written to), some protections provide only statistical 201c2ed6743SKees Cookdefense, in that an attack must gather enough information about a 202c2ed6743SKees Cookrunning system to overcome the defense. While not perfect, these do 203c2ed6743SKees Cookprovide meaningful defenses. 204c2ed6743SKees Cook 205c2ed6743SKees CookCanaries, blinding, and other secrets 206c2ed6743SKees Cook------------------------------------- 207c2ed6743SKees Cook 208c2ed6743SKees CookIt should be noted that things like the stack canary discussed earlier 209c2ed6743SKees Cookare technically statistical defenses, since they rely on a secret value, 210c2ed6743SKees Cookand such values may become discoverable through an information exposure 211c2ed6743SKees Cookflaw. 212c2ed6743SKees Cook 213c2ed6743SKees CookBlinding literal values for things like JITs, where the executable 214c2ed6743SKees Cookcontents may be partially under the control of userspace, need a similar 215c2ed6743SKees Cooksecret value. 216c2ed6743SKees Cook 217c2ed6743SKees CookIt is critical that the secret values used must be separate (e.g. 218c2ed6743SKees Cookdifferent canary per stack) and high entropy (e.g. is the RNG actually 219c2ed6743SKees Cookworking?) in order to maximize their success. 220c2ed6743SKees Cook 221c2ed6743SKees CookKernel Address Space Layout Randomization (KASLR) 222c2ed6743SKees Cook------------------------------------------------- 223c2ed6743SKees Cook 224c2ed6743SKees CookSince the location of kernel memory is almost always instrumental in 225c2ed6743SKees Cookmounting a successful attack, making the location non-deterministic 226c2ed6743SKees Cookraises the difficulty of an exploit. (Note that this in turn makes 227c2ed6743SKees Cookthe value of information exposures higher, since they may be used to 228c2ed6743SKees Cookdiscover desired memory locations.) 229c2ed6743SKees Cook 230c2ed6743SKees CookText and module base 231c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~ 232c2ed6743SKees Cook 233c2ed6743SKees CookBy relocating the physical and virtual base address of the kernel at 234c2ed6743SKees Cookboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be 235c2ed6743SKees Cookfrustrated. Additionally, offsetting the module loading base address 236c2ed6743SKees Cookmeans that even systems that load the same set of modules in the same 237c2ed6743SKees Cookorder every boot will not share a common base address with the rest of 238c2ed6743SKees Cookthe kernel text. 239c2ed6743SKees Cook 240c2ed6743SKees CookStack base 241c2ed6743SKees Cook~~~~~~~~~~ 242c2ed6743SKees Cook 243c2ed6743SKees CookIf the base address of the kernel stack is not the same between processes, 244c2ed6743SKees Cookor even not the same between syscalls, targets on or beyond the stack 245c2ed6743SKees Cookbecome more difficult to locate. 246c2ed6743SKees Cook 247c2ed6743SKees CookDynamic memory base 248c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~ 249c2ed6743SKees Cook 250c2ed6743SKees CookMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 251c2ed6743SKees Cookbeing relatively deterministic in layout due to the order of early-boot 252c2ed6743SKees Cookinitializations. If the base address of these areas is not the same 253c2ed6743SKees Cookbetween boots, targeting them is frustrated, requiring an information 254c2ed6743SKees Cookexposure specific to the region. 255c2ed6743SKees Cook 256c2ed6743SKees CookStructure layout 257c2ed6743SKees Cook~~~~~~~~~~~~~~~~ 258c2ed6743SKees Cook 259c2ed6743SKees CookBy performing a per-build randomization of the layout of sensitive 260c2ed6743SKees Cookstructures, attacks must either be tuned to known kernel builds or expose 261c2ed6743SKees Cookenough kernel memory to determine structure layouts before manipulating 262c2ed6743SKees Cookthem. 263c2ed6743SKees Cook 264c2ed6743SKees Cook 265c2ed6743SKees CookPreventing Information Exposures 266c2ed6743SKees Cook================================ 267c2ed6743SKees Cook 268c2ed6743SKees CookSince the locations of sensitive structures are the primary target for 269c2ed6743SKees Cookattacks, it is important to defend against exposure of both kernel memory 270c2ed6743SKees Cookaddresses and kernel memory contents (since they may contain kernel 271c2ed6743SKees Cookaddresses or other sensitive things like canary values). 272c2ed6743SKees Cook 273227d1a61STobin C. HardingKernel addresses 274227d1a61STobin C. Harding---------------- 275227d1a61STobin C. Harding 276227d1a61STobin C. HardingPrinting kernel addresses to userspace leaks sensitive information about 277227d1a61STobin C. Hardingthe kernel memory layout. Care should be exercised when using any printk 278227d1a61STobin C. Hardingspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] 279227d1a61STobin C. Hardingin certain circumstances [*]). Any file written to using one of these 280227d1a61STobin C. Hardingspecifiers should be readable only by privileged processes. 281227d1a61STobin C. Harding 282227d1a61STobin C. HardingKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 283227d1a61STobin C. Hardingaddresses printed with the specifier %p are hashed before printing. 284227d1a61STobin C. Harding 285227d1a61STobin C. Harding[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is 286227d1a61STobin C. Hardingprinted. If KALLSYMS is not enabled the raw address is printed. 287227d1a61STobin C. Harding 288c2ed6743SKees CookUnique identifiers 289c2ed6743SKees Cook------------------ 290c2ed6743SKees Cook 291c2ed6743SKees CookKernel memory addresses must never be used as identifiers exposed to 292c2ed6743SKees Cookuserspace. Instead, use an atomic counter, an idr, or similar unique 293c2ed6743SKees Cookidentifier. 294c2ed6743SKees Cook 295c2ed6743SKees CookMemory initialization 296c2ed6743SKees Cook--------------------- 297c2ed6743SKees Cook 298c2ed6743SKees CookMemory copied to userspace must always be fully initialized. If not 299c2ed6743SKees Cookexplicitly memset(), this will require changes to the compiler to make 300c2ed6743SKees Cooksure structure holes are cleared. 301c2ed6743SKees Cook 302c2ed6743SKees CookMemory poisoning 303c2ed6743SKees Cook---------------- 304c2ed6743SKees Cook 305c2ed6743SKees CookWhen releasing memory, it is best to poison the contents (clear stack on 306c2ed6743SKees Cooksyscall return, wipe heap memory on a free), to avoid reuse attacks that 307c2ed6743SKees Cookrely on the old contents of memory. This frustrates many uninitialized 308c2ed6743SKees Cookvariable attacks, stack content exposures, heap content exposures, and 309c2ed6743SKees Cookuse-after-free attacks. 310c2ed6743SKees Cook 311c2ed6743SKees CookDestination tracking 312c2ed6743SKees Cook-------------------- 313c2ed6743SKees Cook 314c2ed6743SKees CookTo help kill classes of bugs that result in kernel addresses being 315c2ed6743SKees Cookwritten to userspace, the destination of writes needs to be tracked. If 316c2ed6743SKees Cookthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), 317c2ed6743SKees Cookit should automatically censor sensitive values. 318