1c2ed6743SKees Cook====================== 2c2ed6743SKees CookKernel Self-Protection 3c2ed6743SKees Cook====================== 4c2ed6743SKees Cook 5c2ed6743SKees CookKernel self-protection is the design and implementation of systems and 6c2ed6743SKees Cookstructures within the Linux kernel to protect against security flaws in 7c2ed6743SKees Cookthe kernel itself. This covers a wide range of issues, including removing 8c2ed6743SKees Cookentire classes of bugs, blocking security flaw exploitation methods, 9c2ed6743SKees Cookand actively detecting attack attempts. Not all topics are explored in 10c2ed6743SKees Cookthis document, but it should serve as a reasonable starting point and 11c2ed6743SKees Cookanswer any frequently asked questions. (Patches welcome, of course!) 12c2ed6743SKees Cook 13c2ed6743SKees CookIn the worst-case scenario, we assume an unprivileged local attacker 14c2ed6743SKees Cookhas arbitrary read and write access to the kernel's memory. In many 15c2ed6743SKees Cookcases, bugs being exploited will not provide this level of access, 16c2ed6743SKees Cookbut with systems in place that defend against the worst case we'll 17c2ed6743SKees Cookcover the more limited cases as well. A higher bar, and one that should 18c2ed6743SKees Cookstill be kept in mind, is protecting the kernel against a _privileged_ 19c2ed6743SKees Cooklocal attacker, since the root user has access to a vastly increased 20c2ed6743SKees Cookattack surface. (Especially when they have the ability to load arbitrary 21c2ed6743SKees Cookkernel modules.) 22c2ed6743SKees Cook 23c2ed6743SKees CookThe goals for successful self-protection systems would be that they 24c2ed6743SKees Cookare effective, on by default, require no opt-in by developers, have no 25c2ed6743SKees Cookperformance impact, do not impede kernel debugging, and have tests. It 26c2ed6743SKees Cookis uncommon that all these goals can be met, but it is worth explicitly 27c2ed6743SKees Cookmentioning them, since these aspects need to be explored, dealt with, 28c2ed6743SKees Cookand/or accepted. 29c2ed6743SKees Cook 30c2ed6743SKees Cook 31c2ed6743SKees CookAttack Surface Reduction 32c2ed6743SKees Cook======================== 33c2ed6743SKees Cook 34c2ed6743SKees CookThe most fundamental defense against security exploits is to reduce the 35c2ed6743SKees Cookareas of the kernel that can be used to redirect execution. This ranges 36c2ed6743SKees Cookfrom limiting the exposed APIs available to userspace, making in-kernel 37c2ed6743SKees CookAPIs hard to use incorrectly, minimizing the areas of writable kernel 38c2ed6743SKees Cookmemory, etc. 39c2ed6743SKees Cook 40c2ed6743SKees CookStrict kernel memory permissions 41c2ed6743SKees Cook-------------------------------- 42c2ed6743SKees Cook 43c2ed6743SKees CookWhen all of kernel memory is writable, it becomes trivial for attacks 44c2ed6743SKees Cookto redirect execution flow. To reduce the availability of these targets 45c2ed6743SKees Cookthe kernel needs to protect its memory with a tight set of permissions. 46c2ed6743SKees Cook 47c2ed6743SKees CookExecutable code and read-only data must not be writable 48c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 49c2ed6743SKees Cook 50c2ed6743SKees CookAny areas of the kernel with executable memory must not be writable. 51c2ed6743SKees CookWhile this obviously includes the kernel text itself, we must consider 52c2ed6743SKees Cookall additional places too: kernel modules, JIT memory, etc. (There are 53c2ed6743SKees Cooktemporary exceptions to this rule to support things like instruction 54c2ed6743SKees Cookalternatives, breakpoints, kprobes, etc. If these must exist in a 55c2ed6743SKees Cookkernel, they are implemented in a way where the memory is temporarily 56c2ed6743SKees Cookmade writable during the update, and then returned to the original 57c2ed6743SKees Cookpermissions.) 58c2ed6743SKees Cook 59c2ed6743SKees CookIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and 60c2ed6743SKees Cook``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not 61c2ed6743SKees Cookwritable, data is not executable, and read-only data is neither writable 62c2ed6743SKees Cooknor executable. 63c2ed6743SKees Cook 64c2ed6743SKees CookMost architectures have these options on by default and not user selectable. 65c2ed6743SKees CookFor some architectures like arm that wish to have these be selectable, 66c2ed6743SKees Cookthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable 67c2ed6743SKees Cooka Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines 68c2ed6743SKees Cookthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. 69c2ed6743SKees Cook 70c2ed6743SKees CookFunction pointers and sensitive variables must not be writable 71c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72c2ed6743SKees Cook 73c2ed6743SKees CookVast areas of kernel memory contain function pointers that are looked 74c2ed6743SKees Cookup by the kernel and used to continue execution (e.g. descriptor/vector 75c2ed6743SKees Cooktables, file/network/etc operation structures, etc). The number of these 76c2ed6743SKees Cookvariables must be reduced to an absolute minimum. 77c2ed6743SKees Cook 78c2ed6743SKees CookMany such variables can be made read-only by setting them "const" 79c2ed6743SKees Cookso that they live in the .rodata section instead of the .data section 80c2ed6743SKees Cookof the kernel, gaining the protection of the kernel's strict memory 81c2ed6743SKees Cookpermissions as described above. 82c2ed6743SKees Cook 83c2ed6743SKees CookFor variables that are initialized once at ``__init`` time, these can 84*b080e521SShuah Khanbe marked with the ``__ro_after_init`` attribute. 85c2ed6743SKees Cook 86c2ed6743SKees CookWhat remains are variables that are updated rarely (e.g. GDT). These 87c2ed6743SKees Cookwill need another infrastructure (similar to the temporary exceptions 88c2ed6743SKees Cookmade to kernel code mentioned above) that allow them to spend the rest 89c2ed6743SKees Cookof their lifetime read-only. (For example, when being updated, only the 90c2ed6743SKees CookCPU thread performing the update would be given uninterruptible write 91c2ed6743SKees Cookaccess to the memory.) 92c2ed6743SKees Cook 93c2ed6743SKees CookSegregation of kernel memory from userspace memory 94c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 95c2ed6743SKees Cook 96c2ed6743SKees CookThe kernel must never execute userspace memory. The kernel must also never 97c2ed6743SKees Cookaccess userspace memory without explicit expectation to do so. These 98c2ed6743SKees Cookrules can be enforced either by support of hardware-based restrictions 99c2ed6743SKees Cook(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 100c2ed6743SKees CookBy blocking userspace memory in this way, execution and data parsing 101c2ed6743SKees Cookcannot be passed to trivially-controlled userspace memory, forcing 102c2ed6743SKees Cookattacks to operate entirely in kernel memory. 103c2ed6743SKees Cook 104c2ed6743SKees CookReduced access to syscalls 105c2ed6743SKees Cook-------------------------- 106c2ed6743SKees Cook 107c2ed6743SKees CookOne trivial way to eliminate many syscalls for 64-bit systems is building 108c2ed6743SKees Cookwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. 109c2ed6743SKees Cook 110c2ed6743SKees CookThe "seccomp" system provides an opt-in feature made available to 111c2ed6743SKees Cookuserspace, which provides a way to reduce the number of kernel entry 112c2ed6743SKees Cookpoints available to a running process. This limits the breadth of kernel 113c2ed6743SKees Cookcode that can be reached, possibly reducing the availability of a given 114c2ed6743SKees Cookbug to an attack. 115c2ed6743SKees Cook 116c2ed6743SKees CookAn area of improvement would be creating viable ways to keep access to 117c2ed6743SKees Cookthings like compat, user namespaces, BPF creation, and perf limited only 118c2ed6743SKees Cookto trusted processes. This would keep the scope of kernel entry points 119c2ed6743SKees Cookrestricted to the more regular set of normally available to unprivileged 120c2ed6743SKees Cookuserspace. 121c2ed6743SKees Cook 122c2ed6743SKees CookRestricting access to kernel modules 123c2ed6743SKees Cook------------------------------------ 124c2ed6743SKees Cook 125c2ed6743SKees CookThe kernel should never allow an unprivileged user the ability to 126c2ed6743SKees Cookload specific kernel modules, since that would provide a facility to 127c2ed6743SKees Cookunexpectedly extend the available attack surface. (The on-demand loading 128c2ed6743SKees Cookof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 129c2ed6743SKees Cookconsidered "expected" here, though additional consideration should be 130c2ed6743SKees Cookgiven even to these.) For example, loading a filesystem module via an 131c2ed6743SKees Cookunprivileged socket API is nonsense: only the root or physically local 132c2ed6743SKees Cookuser should trigger filesystem module loading. (And even this can be up 133c2ed6743SKees Cookfor debate in some scenarios.) 134c2ed6743SKees Cook 135c2ed6743SKees CookTo protect against even privileged users, systems may need to either 136c2ed6743SKees Cookdisable module loading entirely (e.g. monolithic kernel builds or 137c2ed6743SKees Cookmodules_disabled sysctl), or provide signed modules (e.g. 138c2ed6743SKees Cook``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having 139c2ed6743SKees Cookroot load arbitrary kernel code via the module loader interface. 140c2ed6743SKees Cook 141c2ed6743SKees Cook 142c2ed6743SKees CookMemory integrity 143c2ed6743SKees Cook================ 144c2ed6743SKees Cook 145c2ed6743SKees CookThere are many memory structures in the kernel that are regularly abused 146c2ed6743SKees Cookto gain execution control during an attack, By far the most commonly 147c2ed6743SKees Cookunderstood is that of the stack buffer overflow in which the return 148c2ed6743SKees Cookaddress stored on the stack is overwritten. Many other examples of this 149c2ed6743SKees Cookkind of attack exist, and protections exist to defend against them. 150c2ed6743SKees Cook 151c2ed6743SKees CookStack buffer overflow 152c2ed6743SKees Cook--------------------- 153c2ed6743SKees Cook 154c2ed6743SKees CookThe classic stack buffer overflow involves writing past the expected end 155c2ed6743SKees Cookof a variable stored on the stack, ultimately writing a controlled value 156c2ed6743SKees Cookto the stack frame's stored return address. The most widely used defense 157c2ed6743SKees Cookis the presence of a stack canary between the stack variables and the 158050e9baaSLinus Torvaldsreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before 159c2ed6743SKees Cookthe function returns. Other defenses include things like shadow stacks. 160c2ed6743SKees Cook 161c2ed6743SKees CookStack depth overflow 162c2ed6743SKees Cook-------------------- 163c2ed6743SKees Cook 164c2ed6743SKees CookA less well understood attack is using a bug that triggers the 165c2ed6743SKees Cookkernel to consume stack memory with deep function calls or large stack 166c2ed6743SKees Cookallocations. With this attack it is possible to write beyond the end of 167c2ed6743SKees Cookthe kernel's preallocated stack space and into sensitive structures. Two 168c2ed6743SKees Cookimportant changes need to be made for better protections: moving the 169c2ed6743SKees Cooksensitive thread_info structure elsewhere, and adding a faulting memory 170c2ed6743SKees Cookhole at the bottom of the stack to catch these overflows. 171c2ed6743SKees Cook 172c2ed6743SKees CookHeap memory integrity 173c2ed6743SKees Cook--------------------- 174c2ed6743SKees Cook 175c2ed6743SKees CookThe structures used to track heap free lists can be sanity-checked during 176c2ed6743SKees Cookallocation and freeing to make sure they aren't being used to manipulate 177c2ed6743SKees Cookother memory areas. 178c2ed6743SKees Cook 179c2ed6743SKees CookCounter integrity 180c2ed6743SKees Cook----------------- 181c2ed6743SKees Cook 182c2ed6743SKees CookMany places in the kernel use atomic counters to track object references 183c2ed6743SKees Cookor perform similar lifetime management. When these counters can be made 184c2ed6743SKees Cookto wrap (over or under) this traditionally exposes a use-after-free 185c2ed6743SKees Cookflaw. By trapping atomic wrapping, this class of bug vanishes. 186c2ed6743SKees Cook 187c2ed6743SKees CookSize calculation overflow detection 188c2ed6743SKees Cook----------------------------------- 189c2ed6743SKees Cook 190c2ed6743SKees CookSimilar to counter overflow, integer overflows (usually size calculations) 191c2ed6743SKees Cookneed to be detected at runtime to kill this class of bug, which 192c2ed6743SKees Cooktraditionally leads to being able to write past the end of kernel buffers. 193c2ed6743SKees Cook 194c2ed6743SKees Cook 195c2ed6743SKees CookProbabilistic defenses 196c2ed6743SKees Cook====================== 197c2ed6743SKees Cook 198c2ed6743SKees CookWhile many protections can be considered deterministic (e.g. read-only 199c2ed6743SKees Cookmemory cannot be written to), some protections provide only statistical 200c2ed6743SKees Cookdefense, in that an attack must gather enough information about a 201c2ed6743SKees Cookrunning system to overcome the defense. While not perfect, these do 202c2ed6743SKees Cookprovide meaningful defenses. 203c2ed6743SKees Cook 204c2ed6743SKees CookCanaries, blinding, and other secrets 205c2ed6743SKees Cook------------------------------------- 206c2ed6743SKees Cook 207c2ed6743SKees CookIt should be noted that things like the stack canary discussed earlier 208c2ed6743SKees Cookare technically statistical defenses, since they rely on a secret value, 209c2ed6743SKees Cookand such values may become discoverable through an information exposure 210c2ed6743SKees Cookflaw. 211c2ed6743SKees Cook 212c2ed6743SKees CookBlinding literal values for things like JITs, where the executable 213c2ed6743SKees Cookcontents may be partially under the control of userspace, need a similar 214c2ed6743SKees Cooksecret value. 215c2ed6743SKees Cook 216c2ed6743SKees CookIt is critical that the secret values used must be separate (e.g. 217c2ed6743SKees Cookdifferent canary per stack) and high entropy (e.g. is the RNG actually 218c2ed6743SKees Cookworking?) in order to maximize their success. 219c2ed6743SKees Cook 220c2ed6743SKees CookKernel Address Space Layout Randomization (KASLR) 221c2ed6743SKees Cook------------------------------------------------- 222c2ed6743SKees Cook 223c2ed6743SKees CookSince the location of kernel memory is almost always instrumental in 224c2ed6743SKees Cookmounting a successful attack, making the location non-deterministic 225c2ed6743SKees Cookraises the difficulty of an exploit. (Note that this in turn makes 226c2ed6743SKees Cookthe value of information exposures higher, since they may be used to 227c2ed6743SKees Cookdiscover desired memory locations.) 228c2ed6743SKees Cook 229c2ed6743SKees CookText and module base 230c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~ 231c2ed6743SKees Cook 232c2ed6743SKees CookBy relocating the physical and virtual base address of the kernel at 233c2ed6743SKees Cookboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be 234c2ed6743SKees Cookfrustrated. Additionally, offsetting the module loading base address 235c2ed6743SKees Cookmeans that even systems that load the same set of modules in the same 236c2ed6743SKees Cookorder every boot will not share a common base address with the rest of 237c2ed6743SKees Cookthe kernel text. 238c2ed6743SKees Cook 239c2ed6743SKees CookStack base 240c2ed6743SKees Cook~~~~~~~~~~ 241c2ed6743SKees Cook 242c2ed6743SKees CookIf the base address of the kernel stack is not the same between processes, 243c2ed6743SKees Cookor even not the same between syscalls, targets on or beyond the stack 244c2ed6743SKees Cookbecome more difficult to locate. 245c2ed6743SKees Cook 246c2ed6743SKees CookDynamic memory base 247c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~ 248c2ed6743SKees Cook 249c2ed6743SKees CookMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 250c2ed6743SKees Cookbeing relatively deterministic in layout due to the order of early-boot 251c2ed6743SKees Cookinitializations. If the base address of these areas is not the same 252c2ed6743SKees Cookbetween boots, targeting them is frustrated, requiring an information 253c2ed6743SKees Cookexposure specific to the region. 254c2ed6743SKees Cook 255c2ed6743SKees CookStructure layout 256c2ed6743SKees Cook~~~~~~~~~~~~~~~~ 257c2ed6743SKees Cook 258c2ed6743SKees CookBy performing a per-build randomization of the layout of sensitive 259c2ed6743SKees Cookstructures, attacks must either be tuned to known kernel builds or expose 260c2ed6743SKees Cookenough kernel memory to determine structure layouts before manipulating 261c2ed6743SKees Cookthem. 262c2ed6743SKees Cook 263c2ed6743SKees Cook 264c2ed6743SKees CookPreventing Information Exposures 265c2ed6743SKees Cook================================ 266c2ed6743SKees Cook 267c2ed6743SKees CookSince the locations of sensitive structures are the primary target for 268c2ed6743SKees Cookattacks, it is important to defend against exposure of both kernel memory 269c2ed6743SKees Cookaddresses and kernel memory contents (since they may contain kernel 270c2ed6743SKees Cookaddresses or other sensitive things like canary values). 271c2ed6743SKees Cook 272227d1a61STobin C. HardingKernel addresses 273227d1a61STobin C. Harding---------------- 274227d1a61STobin C. Harding 275227d1a61STobin C. HardingPrinting kernel addresses to userspace leaks sensitive information about 276227d1a61STobin C. Hardingthe kernel memory layout. Care should be exercised when using any printk 277227d1a61STobin C. Hardingspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] 278227d1a61STobin C. Hardingin certain circumstances [*]). Any file written to using one of these 279227d1a61STobin C. Hardingspecifiers should be readable only by privileged processes. 280227d1a61STobin C. Harding 281227d1a61STobin C. HardingKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 282227d1a61STobin C. Hardingaddresses printed with the specifier %p are hashed before printing. 283227d1a61STobin C. Harding 284227d1a61STobin C. Harding[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is 285227d1a61STobin C. Hardingprinted. If KALLSYMS is not enabled the raw address is printed. 286227d1a61STobin C. Harding 287c2ed6743SKees CookUnique identifiers 288c2ed6743SKees Cook------------------ 289c2ed6743SKees Cook 290c2ed6743SKees CookKernel memory addresses must never be used as identifiers exposed to 291c2ed6743SKees Cookuserspace. Instead, use an atomic counter, an idr, or similar unique 292c2ed6743SKees Cookidentifier. 293c2ed6743SKees Cook 294c2ed6743SKees CookMemory initialization 295c2ed6743SKees Cook--------------------- 296c2ed6743SKees Cook 297c2ed6743SKees CookMemory copied to userspace must always be fully initialized. If not 298c2ed6743SKees Cookexplicitly memset(), this will require changes to the compiler to make 299c2ed6743SKees Cooksure structure holes are cleared. 300c2ed6743SKees Cook 301c2ed6743SKees CookMemory poisoning 302c2ed6743SKees Cook---------------- 303c2ed6743SKees Cook 304ed535a2dSAlexander PopovWhen releasing memory, it is best to poison the contents, to avoid reuse 305ed535a2dSAlexander Popovattacks that rely on the old contents of memory. E.g., clear stack on a 306ed535a2dSAlexander Popovsyscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a 307ed535a2dSAlexander Popovfree. This frustrates many uninitialized variable attacks, stack content 308ed535a2dSAlexander Popovexposures, heap content exposures, and use-after-free attacks. 309c2ed6743SKees Cook 310c2ed6743SKees CookDestination tracking 311c2ed6743SKees Cook-------------------- 312c2ed6743SKees Cook 313c2ed6743SKees CookTo help kill classes of bugs that result in kernel addresses being 314c2ed6743SKees Cookwritten to userspace, the destination of writes needs to be tracked. If 315c2ed6743SKees Cookthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), 316c2ed6743SKees Cookit should automatically censor sensitive values. 317