1c2ed6743SKees Cook======================
2c2ed6743SKees CookKernel Self-Protection
3c2ed6743SKees Cook======================
4c2ed6743SKees Cook
5c2ed6743SKees CookKernel self-protection is the design and implementation of systems and
6c2ed6743SKees Cookstructures within the Linux kernel to protect against security flaws in
7c2ed6743SKees Cookthe kernel itself. This covers a wide range of issues, including removing
8c2ed6743SKees Cookentire classes of bugs, blocking security flaw exploitation methods,
9c2ed6743SKees Cookand actively detecting attack attempts. Not all topics are explored in
10c2ed6743SKees Cookthis document, but it should serve as a reasonable starting point and
11c2ed6743SKees Cookanswer any frequently asked questions. (Patches welcome, of course!)
12c2ed6743SKees Cook
13c2ed6743SKees CookIn the worst-case scenario, we assume an unprivileged local attacker
14c2ed6743SKees Cookhas arbitrary read and write access to the kernel's memory. In many
15c2ed6743SKees Cookcases, bugs being exploited will not provide this level of access,
16c2ed6743SKees Cookbut with systems in place that defend against the worst case we'll
17c2ed6743SKees Cookcover the more limited cases as well. A higher bar, and one that should
18c2ed6743SKees Cookstill be kept in mind, is protecting the kernel against a _privileged_
19c2ed6743SKees Cooklocal attacker, since the root user has access to a vastly increased
20c2ed6743SKees Cookattack surface. (Especially when they have the ability to load arbitrary
21c2ed6743SKees Cookkernel modules.)
22c2ed6743SKees Cook
23c2ed6743SKees CookThe goals for successful self-protection systems would be that they
24c2ed6743SKees Cookare effective, on by default, require no opt-in by developers, have no
25c2ed6743SKees Cookperformance impact, do not impede kernel debugging, and have tests. It
26c2ed6743SKees Cookis uncommon that all these goals can be met, but it is worth explicitly
27c2ed6743SKees Cookmentioning them, since these aspects need to be explored, dealt with,
28c2ed6743SKees Cookand/or accepted.
29c2ed6743SKees Cook
30c2ed6743SKees Cook
31c2ed6743SKees CookAttack Surface Reduction
32c2ed6743SKees Cook========================
33c2ed6743SKees Cook
34c2ed6743SKees CookThe most fundamental defense against security exploits is to reduce the
35c2ed6743SKees Cookareas of the kernel that can be used to redirect execution. This ranges
36c2ed6743SKees Cookfrom limiting the exposed APIs available to userspace, making in-kernel
37c2ed6743SKees CookAPIs hard to use incorrectly, minimizing the areas of writable kernel
38c2ed6743SKees Cookmemory, etc.
39c2ed6743SKees Cook
40c2ed6743SKees CookStrict kernel memory permissions
41c2ed6743SKees Cook--------------------------------
42c2ed6743SKees Cook
43c2ed6743SKees CookWhen all of kernel memory is writable, it becomes trivial for attacks
44c2ed6743SKees Cookto redirect execution flow. To reduce the availability of these targets
45c2ed6743SKees Cookthe kernel needs to protect its memory with a tight set of permissions.
46c2ed6743SKees Cook
47c2ed6743SKees CookExecutable code and read-only data must not be writable
48c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49c2ed6743SKees Cook
50c2ed6743SKees CookAny areas of the kernel with executable memory must not be writable.
51c2ed6743SKees CookWhile this obviously includes the kernel text itself, we must consider
52c2ed6743SKees Cookall additional places too: kernel modules, JIT memory, etc. (There are
53c2ed6743SKees Cooktemporary exceptions to this rule to support things like instruction
54c2ed6743SKees Cookalternatives, breakpoints, kprobes, etc. If these must exist in a
55c2ed6743SKees Cookkernel, they are implemented in a way where the memory is temporarily
56c2ed6743SKees Cookmade writable during the update, and then returned to the original
57c2ed6743SKees Cookpermissions.)
58c2ed6743SKees Cook
59c2ed6743SKees CookIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
60c2ed6743SKees Cook``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
61c2ed6743SKees Cookwritable, data is not executable, and read-only data is neither writable
62c2ed6743SKees Cooknor executable.
63c2ed6743SKees Cook
64c2ed6743SKees CookMost architectures have these options on by default and not user selectable.
65c2ed6743SKees CookFor some architectures like arm that wish to have these be selectable,
66c2ed6743SKees Cookthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
67c2ed6743SKees Cooka Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
68c2ed6743SKees Cookthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
69c2ed6743SKees Cook
70c2ed6743SKees CookFunction pointers and sensitive variables must not be writable
71c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72c2ed6743SKees Cook
73c2ed6743SKees CookVast areas of kernel memory contain function pointers that are looked
74c2ed6743SKees Cookup by the kernel and used to continue execution (e.g. descriptor/vector
75c2ed6743SKees Cooktables, file/network/etc operation structures, etc). The number of these
76c2ed6743SKees Cookvariables must be reduced to an absolute minimum.
77c2ed6743SKees Cook
78c2ed6743SKees CookMany such variables can be made read-only by setting them "const"
79c2ed6743SKees Cookso that they live in the .rodata section instead of the .data section
80c2ed6743SKees Cookof the kernel, gaining the protection of the kernel's strict memory
81c2ed6743SKees Cookpermissions as described above.
82c2ed6743SKees Cook
83c2ed6743SKees CookFor variables that are initialized once at ``__init`` time, these can
84*b080e521SShuah Khanbe marked with the ``__ro_after_init`` attribute.
85c2ed6743SKees Cook
86c2ed6743SKees CookWhat remains are variables that are updated rarely (e.g. GDT). These
87c2ed6743SKees Cookwill need another infrastructure (similar to the temporary exceptions
88c2ed6743SKees Cookmade to kernel code mentioned above) that allow them to spend the rest
89c2ed6743SKees Cookof their lifetime read-only. (For example, when being updated, only the
90c2ed6743SKees CookCPU thread performing the update would be given uninterruptible write
91c2ed6743SKees Cookaccess to the memory.)
92c2ed6743SKees Cook
93c2ed6743SKees CookSegregation of kernel memory from userspace memory
94c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
95c2ed6743SKees Cook
96c2ed6743SKees CookThe kernel must never execute userspace memory. The kernel must also never
97c2ed6743SKees Cookaccess userspace memory without explicit expectation to do so. These
98c2ed6743SKees Cookrules can be enforced either by support of hardware-based restrictions
99c2ed6743SKees Cook(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
100c2ed6743SKees CookBy blocking userspace memory in this way, execution and data parsing
101c2ed6743SKees Cookcannot be passed to trivially-controlled userspace memory, forcing
102c2ed6743SKees Cookattacks to operate entirely in kernel memory.
103c2ed6743SKees Cook
104c2ed6743SKees CookReduced access to syscalls
105c2ed6743SKees Cook--------------------------
106c2ed6743SKees Cook
107c2ed6743SKees CookOne trivial way to eliminate many syscalls for 64-bit systems is building
108c2ed6743SKees Cookwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
109c2ed6743SKees Cook
110c2ed6743SKees CookThe "seccomp" system provides an opt-in feature made available to
111c2ed6743SKees Cookuserspace, which provides a way to reduce the number of kernel entry
112c2ed6743SKees Cookpoints available to a running process. This limits the breadth of kernel
113c2ed6743SKees Cookcode that can be reached, possibly reducing the availability of a given
114c2ed6743SKees Cookbug to an attack.
115c2ed6743SKees Cook
116c2ed6743SKees CookAn area of improvement would be creating viable ways to keep access to
117c2ed6743SKees Cookthings like compat, user namespaces, BPF creation, and perf limited only
118c2ed6743SKees Cookto trusted processes. This would keep the scope of kernel entry points
119c2ed6743SKees Cookrestricted to the more regular set of normally available to unprivileged
120c2ed6743SKees Cookuserspace.
121c2ed6743SKees Cook
122c2ed6743SKees CookRestricting access to kernel modules
123c2ed6743SKees Cook------------------------------------
124c2ed6743SKees Cook
125c2ed6743SKees CookThe kernel should never allow an unprivileged user the ability to
126c2ed6743SKees Cookload specific kernel modules, since that would provide a facility to
127c2ed6743SKees Cookunexpectedly extend the available attack surface. (The on-demand loading
128c2ed6743SKees Cookof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
129c2ed6743SKees Cookconsidered "expected" here, though additional consideration should be
130c2ed6743SKees Cookgiven even to these.) For example, loading a filesystem module via an
131c2ed6743SKees Cookunprivileged socket API is nonsense: only the root or physically local
132c2ed6743SKees Cookuser should trigger filesystem module loading. (And even this can be up
133c2ed6743SKees Cookfor debate in some scenarios.)
134c2ed6743SKees Cook
135c2ed6743SKees CookTo protect against even privileged users, systems may need to either
136c2ed6743SKees Cookdisable module loading entirely (e.g. monolithic kernel builds or
137c2ed6743SKees Cookmodules_disabled sysctl), or provide signed modules (e.g.
138c2ed6743SKees Cook``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
139c2ed6743SKees Cookroot load arbitrary kernel code via the module loader interface.
140c2ed6743SKees Cook
141c2ed6743SKees Cook
142c2ed6743SKees CookMemory integrity
143c2ed6743SKees Cook================
144c2ed6743SKees Cook
145c2ed6743SKees CookThere are many memory structures in the kernel that are regularly abused
146c2ed6743SKees Cookto gain execution control during an attack, By far the most commonly
147c2ed6743SKees Cookunderstood is that of the stack buffer overflow in which the return
148c2ed6743SKees Cookaddress stored on the stack is overwritten. Many other examples of this
149c2ed6743SKees Cookkind of attack exist, and protections exist to defend against them.
150c2ed6743SKees Cook
151c2ed6743SKees CookStack buffer overflow
152c2ed6743SKees Cook---------------------
153c2ed6743SKees Cook
154c2ed6743SKees CookThe classic stack buffer overflow involves writing past the expected end
155c2ed6743SKees Cookof a variable stored on the stack, ultimately writing a controlled value
156c2ed6743SKees Cookto the stack frame's stored return address. The most widely used defense
157c2ed6743SKees Cookis the presence of a stack canary between the stack variables and the
158050e9baaSLinus Torvaldsreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before
159c2ed6743SKees Cookthe function returns. Other defenses include things like shadow stacks.
160c2ed6743SKees Cook
161c2ed6743SKees CookStack depth overflow
162c2ed6743SKees Cook--------------------
163c2ed6743SKees Cook
164c2ed6743SKees CookA less well understood attack is using a bug that triggers the
165c2ed6743SKees Cookkernel to consume stack memory with deep function calls or large stack
166c2ed6743SKees Cookallocations. With this attack it is possible to write beyond the end of
167c2ed6743SKees Cookthe kernel's preallocated stack space and into sensitive structures. Two
168c2ed6743SKees Cookimportant changes need to be made for better protections: moving the
169c2ed6743SKees Cooksensitive thread_info structure elsewhere, and adding a faulting memory
170c2ed6743SKees Cookhole at the bottom of the stack to catch these overflows.
171c2ed6743SKees Cook
172c2ed6743SKees CookHeap memory integrity
173c2ed6743SKees Cook---------------------
174c2ed6743SKees Cook
175c2ed6743SKees CookThe structures used to track heap free lists can be sanity-checked during
176c2ed6743SKees Cookallocation and freeing to make sure they aren't being used to manipulate
177c2ed6743SKees Cookother memory areas.
178c2ed6743SKees Cook
179c2ed6743SKees CookCounter integrity
180c2ed6743SKees Cook-----------------
181c2ed6743SKees Cook
182c2ed6743SKees CookMany places in the kernel use atomic counters to track object references
183c2ed6743SKees Cookor perform similar lifetime management. When these counters can be made
184c2ed6743SKees Cookto wrap (over or under) this traditionally exposes a use-after-free
185c2ed6743SKees Cookflaw. By trapping atomic wrapping, this class of bug vanishes.
186c2ed6743SKees Cook
187c2ed6743SKees CookSize calculation overflow detection
188c2ed6743SKees Cook-----------------------------------
189c2ed6743SKees Cook
190c2ed6743SKees CookSimilar to counter overflow, integer overflows (usually size calculations)
191c2ed6743SKees Cookneed to be detected at runtime to kill this class of bug, which
192c2ed6743SKees Cooktraditionally leads to being able to write past the end of kernel buffers.
193c2ed6743SKees Cook
194c2ed6743SKees Cook
195c2ed6743SKees CookProbabilistic defenses
196c2ed6743SKees Cook======================
197c2ed6743SKees Cook
198c2ed6743SKees CookWhile many protections can be considered deterministic (e.g. read-only
199c2ed6743SKees Cookmemory cannot be written to), some protections provide only statistical
200c2ed6743SKees Cookdefense, in that an attack must gather enough information about a
201c2ed6743SKees Cookrunning system to overcome the defense. While not perfect, these do
202c2ed6743SKees Cookprovide meaningful defenses.
203c2ed6743SKees Cook
204c2ed6743SKees CookCanaries, blinding, and other secrets
205c2ed6743SKees Cook-------------------------------------
206c2ed6743SKees Cook
207c2ed6743SKees CookIt should be noted that things like the stack canary discussed earlier
208c2ed6743SKees Cookare technically statistical defenses, since they rely on a secret value,
209c2ed6743SKees Cookand such values may become discoverable through an information exposure
210c2ed6743SKees Cookflaw.
211c2ed6743SKees Cook
212c2ed6743SKees CookBlinding literal values for things like JITs, where the executable
213c2ed6743SKees Cookcontents may be partially under the control of userspace, need a similar
214c2ed6743SKees Cooksecret value.
215c2ed6743SKees Cook
216c2ed6743SKees CookIt is critical that the secret values used must be separate (e.g.
217c2ed6743SKees Cookdifferent canary per stack) and high entropy (e.g. is the RNG actually
218c2ed6743SKees Cookworking?) in order to maximize their success.
219c2ed6743SKees Cook
220c2ed6743SKees CookKernel Address Space Layout Randomization (KASLR)
221c2ed6743SKees Cook-------------------------------------------------
222c2ed6743SKees Cook
223c2ed6743SKees CookSince the location of kernel memory is almost always instrumental in
224c2ed6743SKees Cookmounting a successful attack, making the location non-deterministic
225c2ed6743SKees Cookraises the difficulty of an exploit. (Note that this in turn makes
226c2ed6743SKees Cookthe value of information exposures higher, since they may be used to
227c2ed6743SKees Cookdiscover desired memory locations.)
228c2ed6743SKees Cook
229c2ed6743SKees CookText and module base
230c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~
231c2ed6743SKees Cook
232c2ed6743SKees CookBy relocating the physical and virtual base address of the kernel at
233c2ed6743SKees Cookboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
234c2ed6743SKees Cookfrustrated. Additionally, offsetting the module loading base address
235c2ed6743SKees Cookmeans that even systems that load the same set of modules in the same
236c2ed6743SKees Cookorder every boot will not share a common base address with the rest of
237c2ed6743SKees Cookthe kernel text.
238c2ed6743SKees Cook
239c2ed6743SKees CookStack base
240c2ed6743SKees Cook~~~~~~~~~~
241c2ed6743SKees Cook
242c2ed6743SKees CookIf the base address of the kernel stack is not the same between processes,
243c2ed6743SKees Cookor even not the same between syscalls, targets on or beyond the stack
244c2ed6743SKees Cookbecome more difficult to locate.
245c2ed6743SKees Cook
246c2ed6743SKees CookDynamic memory base
247c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~
248c2ed6743SKees Cook
249c2ed6743SKees CookMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
250c2ed6743SKees Cookbeing relatively deterministic in layout due to the order of early-boot
251c2ed6743SKees Cookinitializations. If the base address of these areas is not the same
252c2ed6743SKees Cookbetween boots, targeting them is frustrated, requiring an information
253c2ed6743SKees Cookexposure specific to the region.
254c2ed6743SKees Cook
255c2ed6743SKees CookStructure layout
256c2ed6743SKees Cook~~~~~~~~~~~~~~~~
257c2ed6743SKees Cook
258c2ed6743SKees CookBy performing a per-build randomization of the layout of sensitive
259c2ed6743SKees Cookstructures, attacks must either be tuned to known kernel builds or expose
260c2ed6743SKees Cookenough kernel memory to determine structure layouts before manipulating
261c2ed6743SKees Cookthem.
262c2ed6743SKees Cook
263c2ed6743SKees Cook
264c2ed6743SKees CookPreventing Information Exposures
265c2ed6743SKees Cook================================
266c2ed6743SKees Cook
267c2ed6743SKees CookSince the locations of sensitive structures are the primary target for
268c2ed6743SKees Cookattacks, it is important to defend against exposure of both kernel memory
269c2ed6743SKees Cookaddresses and kernel memory contents (since they may contain kernel
270c2ed6743SKees Cookaddresses or other sensitive things like canary values).
271c2ed6743SKees Cook
272227d1a61STobin C. HardingKernel addresses
273227d1a61STobin C. Harding----------------
274227d1a61STobin C. Harding
275227d1a61STobin C. HardingPrinting kernel addresses to userspace leaks sensitive information about
276227d1a61STobin C. Hardingthe kernel memory layout. Care should be exercised when using any printk
277227d1a61STobin C. Hardingspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
278227d1a61STobin C. Hardingin certain circumstances [*]).  Any file written to using one of these
279227d1a61STobin C. Hardingspecifiers should be readable only by privileged processes.
280227d1a61STobin C. Harding
281227d1a61STobin C. HardingKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
282227d1a61STobin C. Hardingaddresses printed with the specifier %p are hashed before printing.
283227d1a61STobin C. Harding
284227d1a61STobin C. Harding[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
285227d1a61STobin C. Hardingprinted. If KALLSYMS is not enabled the raw address is printed.
286227d1a61STobin C. Harding
287c2ed6743SKees CookUnique identifiers
288c2ed6743SKees Cook------------------
289c2ed6743SKees Cook
290c2ed6743SKees CookKernel memory addresses must never be used as identifiers exposed to
291c2ed6743SKees Cookuserspace. Instead, use an atomic counter, an idr, or similar unique
292c2ed6743SKees Cookidentifier.
293c2ed6743SKees Cook
294c2ed6743SKees CookMemory initialization
295c2ed6743SKees Cook---------------------
296c2ed6743SKees Cook
297c2ed6743SKees CookMemory copied to userspace must always be fully initialized. If not
298c2ed6743SKees Cookexplicitly memset(), this will require changes to the compiler to make
299c2ed6743SKees Cooksure structure holes are cleared.
300c2ed6743SKees Cook
301c2ed6743SKees CookMemory poisoning
302c2ed6743SKees Cook----------------
303c2ed6743SKees Cook
304ed535a2dSAlexander PopovWhen releasing memory, it is best to poison the contents, to avoid reuse
305ed535a2dSAlexander Popovattacks that rely on the old contents of memory. E.g., clear stack on a
306ed535a2dSAlexander Popovsyscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
307ed535a2dSAlexander Popovfree. This frustrates many uninitialized variable attacks, stack content
308ed535a2dSAlexander Popovexposures, heap content exposures, and use-after-free attacks.
309c2ed6743SKees Cook
310c2ed6743SKees CookDestination tracking
311c2ed6743SKees Cook--------------------
312c2ed6743SKees Cook
313c2ed6743SKees CookTo help kill classes of bugs that result in kernel addresses being
314c2ed6743SKees Cookwritten to userspace, the destination of writes needs to be tracked. If
315c2ed6743SKees Cookthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
316c2ed6743SKees Cookit should automatically censor sensitive values.
317