1====================== 2Kernel Self-Protection 3====================== 4 5Kernel self-protection is the design and implementation of systems and 6structures within the Linux kernel to protect against security flaws in 7the kernel itself. This covers a wide range of issues, including removing 8entire classes of bugs, blocking security flaw exploitation methods, 9and actively detecting attack attempts. Not all topics are explored in 10this document, but it should serve as a reasonable starting point and 11answer any frequently asked questions. (Patches welcome, of course!) 12 13In the worst-case scenario, we assume an unprivileged local attacker 14has arbitrary read and write access to the kernel's memory. In many 15cases, bugs being exploited will not provide this level of access, 16but with systems in place that defend against the worst case we'll 17cover the more limited cases as well. A higher bar, and one that should 18still be kept in mind, is protecting the kernel against a _privileged_ 19local attacker, since the root user has access to a vastly increased 20attack surface. (Especially when they have the ability to load arbitrary 21kernel modules.) 22 23The goals for successful self-protection systems would be that they 24are effective, on by default, require no opt-in by developers, have no 25performance impact, do not impede kernel debugging, and have tests. It 26is uncommon that all these goals can be met, but it is worth explicitly 27mentioning them, since these aspects need to be explored, dealt with, 28and/or accepted. 29 30 31Attack Surface Reduction 32======================== 33 34The most fundamental defense against security exploits is to reduce the 35areas of the kernel that can be used to redirect execution. This ranges 36from limiting the exposed APIs available to userspace, making in-kernel 37APIs hard to use incorrectly, minimizing the areas of writable kernel 38memory, etc. 39 40Strict kernel memory permissions 41-------------------------------- 42 43When all of kernel memory is writable, it becomes trivial for attacks 44to redirect execution flow. To reduce the availability of these targets 45the kernel needs to protect its memory with a tight set of permissions. 46 47Executable code and read-only data must not be writable 48~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 49 50Any areas of the kernel with executable memory must not be writable. 51While this obviously includes the kernel text itself, we must consider 52all additional places too: kernel modules, JIT memory, etc. (There are 53temporary exceptions to this rule to support things like instruction 54alternatives, breakpoints, kprobes, etc. If these must exist in a 55kernel, they are implemented in a way where the memory is temporarily 56made writable during the update, and then returned to the original 57permissions.) 58 59In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and 60``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not 61writable, data is not executable, and read-only data is neither writable 62nor executable. 63 64Most architectures have these options on by default and not user selectable. 65For some architectures like arm that wish to have these be selectable, 66the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable 67a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines 68the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. 69 70Function pointers and sensitive variables must not be writable 71~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72 73Vast areas of kernel memory contain function pointers that are looked 74up by the kernel and used to continue execution (e.g. descriptor/vector 75tables, file/network/etc operation structures, etc). The number of these 76variables must be reduced to an absolute minimum. 77 78Many such variables can be made read-only by setting them "const" 79so that they live in the .rodata section instead of the .data section 80of the kernel, gaining the protection of the kernel's strict memory 81permissions as described above. 82 83For variables that are initialized once at ``__init`` time, these can 84be marked with the (new and under development) ``__ro_after_init`` 85attribute. 86 87What remains are variables that are updated rarely (e.g. GDT). These 88will need another infrastructure (similar to the temporary exceptions 89made to kernel code mentioned above) that allow them to spend the rest 90of their lifetime read-only. (For example, when being updated, only the 91CPU thread performing the update would be given uninterruptible write 92access to the memory.) 93 94Segregation of kernel memory from userspace memory 95~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 96 97The kernel must never execute userspace memory. The kernel must also never 98access userspace memory without explicit expectation to do so. These 99rules can be enforced either by support of hardware-based restrictions 100(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). 101By blocking userspace memory in this way, execution and data parsing 102cannot be passed to trivially-controlled userspace memory, forcing 103attacks to operate entirely in kernel memory. 104 105Reduced access to syscalls 106-------------------------- 107 108One trivial way to eliminate many syscalls for 64-bit systems is building 109without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. 110 111The "seccomp" system provides an opt-in feature made available to 112userspace, which provides a way to reduce the number of kernel entry 113points available to a running process. This limits the breadth of kernel 114code that can be reached, possibly reducing the availability of a given 115bug to an attack. 116 117An area of improvement would be creating viable ways to keep access to 118things like compat, user namespaces, BPF creation, and perf limited only 119to trusted processes. This would keep the scope of kernel entry points 120restricted to the more regular set of normally available to unprivileged 121userspace. 122 123Restricting access to kernel modules 124------------------------------------ 125 126The kernel should never allow an unprivileged user the ability to 127load specific kernel modules, since that would provide a facility to 128unexpectedly extend the available attack surface. (The on-demand loading 129of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is 130considered "expected" here, though additional consideration should be 131given even to these.) For example, loading a filesystem module via an 132unprivileged socket API is nonsense: only the root or physically local 133user should trigger filesystem module loading. (And even this can be up 134for debate in some scenarios.) 135 136To protect against even privileged users, systems may need to either 137disable module loading entirely (e.g. monolithic kernel builds or 138modules_disabled sysctl), or provide signed modules (e.g. 139``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having 140root load arbitrary kernel code via the module loader interface. 141 142 143Memory integrity 144================ 145 146There are many memory structures in the kernel that are regularly abused 147to gain execution control during an attack, By far the most commonly 148understood is that of the stack buffer overflow in which the return 149address stored on the stack is overwritten. Many other examples of this 150kind of attack exist, and protections exist to defend against them. 151 152Stack buffer overflow 153--------------------- 154 155The classic stack buffer overflow involves writing past the expected end 156of a variable stored on the stack, ultimately writing a controlled value 157to the stack frame's stored return address. The most widely used defense 158is the presence of a stack canary between the stack variables and the 159return address (``CONFIG_CC_STACKPROTECTOR``), which is verified just before 160the function returns. Other defenses include things like shadow stacks. 161 162Stack depth overflow 163-------------------- 164 165A less well understood attack is using a bug that triggers the 166kernel to consume stack memory with deep function calls or large stack 167allocations. With this attack it is possible to write beyond the end of 168the kernel's preallocated stack space and into sensitive structures. Two 169important changes need to be made for better protections: moving the 170sensitive thread_info structure elsewhere, and adding a faulting memory 171hole at the bottom of the stack to catch these overflows. 172 173Heap memory integrity 174--------------------- 175 176The structures used to track heap free lists can be sanity-checked during 177allocation and freeing to make sure they aren't being used to manipulate 178other memory areas. 179 180Counter integrity 181----------------- 182 183Many places in the kernel use atomic counters to track object references 184or perform similar lifetime management. When these counters can be made 185to wrap (over or under) this traditionally exposes a use-after-free 186flaw. By trapping atomic wrapping, this class of bug vanishes. 187 188Size calculation overflow detection 189----------------------------------- 190 191Similar to counter overflow, integer overflows (usually size calculations) 192need to be detected at runtime to kill this class of bug, which 193traditionally leads to being able to write past the end of kernel buffers. 194 195 196Probabilistic defenses 197====================== 198 199While many protections can be considered deterministic (e.g. read-only 200memory cannot be written to), some protections provide only statistical 201defense, in that an attack must gather enough information about a 202running system to overcome the defense. While not perfect, these do 203provide meaningful defenses. 204 205Canaries, blinding, and other secrets 206------------------------------------- 207 208It should be noted that things like the stack canary discussed earlier 209are technically statistical defenses, since they rely on a secret value, 210and such values may become discoverable through an information exposure 211flaw. 212 213Blinding literal values for things like JITs, where the executable 214contents may be partially under the control of userspace, need a similar 215secret value. 216 217It is critical that the secret values used must be separate (e.g. 218different canary per stack) and high entropy (e.g. is the RNG actually 219working?) in order to maximize their success. 220 221Kernel Address Space Layout Randomization (KASLR) 222------------------------------------------------- 223 224Since the location of kernel memory is almost always instrumental in 225mounting a successful attack, making the location non-deterministic 226raises the difficulty of an exploit. (Note that this in turn makes 227the value of information exposures higher, since they may be used to 228discover desired memory locations.) 229 230Text and module base 231~~~~~~~~~~~~~~~~~~~~ 232 233By relocating the physical and virtual base address of the kernel at 234boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be 235frustrated. Additionally, offsetting the module loading base address 236means that even systems that load the same set of modules in the same 237order every boot will not share a common base address with the rest of 238the kernel text. 239 240Stack base 241~~~~~~~~~~ 242 243If the base address of the kernel stack is not the same between processes, 244or even not the same between syscalls, targets on or beyond the stack 245become more difficult to locate. 246 247Dynamic memory base 248~~~~~~~~~~~~~~~~~~~ 249 250Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up 251being relatively deterministic in layout due to the order of early-boot 252initializations. If the base address of these areas is not the same 253between boots, targeting them is frustrated, requiring an information 254exposure specific to the region. 255 256Structure layout 257~~~~~~~~~~~~~~~~ 258 259By performing a per-build randomization of the layout of sensitive 260structures, attacks must either be tuned to known kernel builds or expose 261enough kernel memory to determine structure layouts before manipulating 262them. 263 264 265Preventing Information Exposures 266================================ 267 268Since the locations of sensitive structures are the primary target for 269attacks, it is important to defend against exposure of both kernel memory 270addresses and kernel memory contents (since they may contain kernel 271addresses or other sensitive things like canary values). 272 273Kernel addresses 274---------------- 275 276Printing kernel addresses to userspace leaks sensitive information about 277the kernel memory layout. Care should be exercised when using any printk 278specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] 279in certain circumstances [*]). Any file written to using one of these 280specifiers should be readable only by privileged processes. 281 282Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 283addresses printed with the specifier %p are hashed before printing. 284 285[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is 286printed. If KALLSYMS is not enabled the raw address is printed. 287 288Unique identifiers 289------------------ 290 291Kernel memory addresses must never be used as identifiers exposed to 292userspace. Instead, use an atomic counter, an idr, or similar unique 293identifier. 294 295Memory initialization 296--------------------- 297 298Memory copied to userspace must always be fully initialized. If not 299explicitly memset(), this will require changes to the compiler to make 300sure structure holes are cleared. 301 302Memory poisoning 303---------------- 304 305When releasing memory, it is best to poison the contents (clear stack on 306syscall return, wipe heap memory on a free), to avoid reuse attacks that 307rely on the old contents of memory. This frustrates many uninitialized 308variable attacks, stack content exposures, heap content exposures, and 309use-after-free attacks. 310 311Destination tracking 312-------------------- 313 314To help kill classes of bugs that result in kernel addresses being 315written to userspace, the destination of writes needs to be tracked. If 316the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), 317it should automatically censor sensitive values. 318