xref: /openbmc/linux/Documentation/security/self-protection.rst (revision 050e9baa9dc9fbd9ce2b27f0056990fc9e0a08a0)
1c2ed6743SKees Cook======================
2c2ed6743SKees CookKernel Self-Protection
3c2ed6743SKees Cook======================
4c2ed6743SKees Cook
5c2ed6743SKees CookKernel self-protection is the design and implementation of systems and
6c2ed6743SKees Cookstructures within the Linux kernel to protect against security flaws in
7c2ed6743SKees Cookthe kernel itself. This covers a wide range of issues, including removing
8c2ed6743SKees Cookentire classes of bugs, blocking security flaw exploitation methods,
9c2ed6743SKees Cookand actively detecting attack attempts. Not all topics are explored in
10c2ed6743SKees Cookthis document, but it should serve as a reasonable starting point and
11c2ed6743SKees Cookanswer any frequently asked questions. (Patches welcome, of course!)
12c2ed6743SKees Cook
13c2ed6743SKees CookIn the worst-case scenario, we assume an unprivileged local attacker
14c2ed6743SKees Cookhas arbitrary read and write access to the kernel's memory. In many
15c2ed6743SKees Cookcases, bugs being exploited will not provide this level of access,
16c2ed6743SKees Cookbut with systems in place that defend against the worst case we'll
17c2ed6743SKees Cookcover the more limited cases as well. A higher bar, and one that should
18c2ed6743SKees Cookstill be kept in mind, is protecting the kernel against a _privileged_
19c2ed6743SKees Cooklocal attacker, since the root user has access to a vastly increased
20c2ed6743SKees Cookattack surface. (Especially when they have the ability to load arbitrary
21c2ed6743SKees Cookkernel modules.)
22c2ed6743SKees Cook
23c2ed6743SKees CookThe goals for successful self-protection systems would be that they
24c2ed6743SKees Cookare effective, on by default, require no opt-in by developers, have no
25c2ed6743SKees Cookperformance impact, do not impede kernel debugging, and have tests. It
26c2ed6743SKees Cookis uncommon that all these goals can be met, but it is worth explicitly
27c2ed6743SKees Cookmentioning them, since these aspects need to be explored, dealt with,
28c2ed6743SKees Cookand/or accepted.
29c2ed6743SKees Cook
30c2ed6743SKees Cook
31c2ed6743SKees CookAttack Surface Reduction
32c2ed6743SKees Cook========================
33c2ed6743SKees Cook
34c2ed6743SKees CookThe most fundamental defense against security exploits is to reduce the
35c2ed6743SKees Cookareas of the kernel that can be used to redirect execution. This ranges
36c2ed6743SKees Cookfrom limiting the exposed APIs available to userspace, making in-kernel
37c2ed6743SKees CookAPIs hard to use incorrectly, minimizing the areas of writable kernel
38c2ed6743SKees Cookmemory, etc.
39c2ed6743SKees Cook
40c2ed6743SKees CookStrict kernel memory permissions
41c2ed6743SKees Cook--------------------------------
42c2ed6743SKees Cook
43c2ed6743SKees CookWhen all of kernel memory is writable, it becomes trivial for attacks
44c2ed6743SKees Cookto redirect execution flow. To reduce the availability of these targets
45c2ed6743SKees Cookthe kernel needs to protect its memory with a tight set of permissions.
46c2ed6743SKees Cook
47c2ed6743SKees CookExecutable code and read-only data must not be writable
48c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49c2ed6743SKees Cook
50c2ed6743SKees CookAny areas of the kernel with executable memory must not be writable.
51c2ed6743SKees CookWhile this obviously includes the kernel text itself, we must consider
52c2ed6743SKees Cookall additional places too: kernel modules, JIT memory, etc. (There are
53c2ed6743SKees Cooktemporary exceptions to this rule to support things like instruction
54c2ed6743SKees Cookalternatives, breakpoints, kprobes, etc. If these must exist in a
55c2ed6743SKees Cookkernel, they are implemented in a way where the memory is temporarily
56c2ed6743SKees Cookmade writable during the update, and then returned to the original
57c2ed6743SKees Cookpermissions.)
58c2ed6743SKees Cook
59c2ed6743SKees CookIn support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
60c2ed6743SKees Cook``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
61c2ed6743SKees Cookwritable, data is not executable, and read-only data is neither writable
62c2ed6743SKees Cooknor executable.
63c2ed6743SKees Cook
64c2ed6743SKees CookMost architectures have these options on by default and not user selectable.
65c2ed6743SKees CookFor some architectures like arm that wish to have these be selectable,
66c2ed6743SKees Cookthe architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
67c2ed6743SKees Cooka Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
68c2ed6743SKees Cookthe default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
69c2ed6743SKees Cook
70c2ed6743SKees CookFunction pointers and sensitive variables must not be writable
71c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
72c2ed6743SKees Cook
73c2ed6743SKees CookVast areas of kernel memory contain function pointers that are looked
74c2ed6743SKees Cookup by the kernel and used to continue execution (e.g. descriptor/vector
75c2ed6743SKees Cooktables, file/network/etc operation structures, etc). The number of these
76c2ed6743SKees Cookvariables must be reduced to an absolute minimum.
77c2ed6743SKees Cook
78c2ed6743SKees CookMany such variables can be made read-only by setting them "const"
79c2ed6743SKees Cookso that they live in the .rodata section instead of the .data section
80c2ed6743SKees Cookof the kernel, gaining the protection of the kernel's strict memory
81c2ed6743SKees Cookpermissions as described above.
82c2ed6743SKees Cook
83c2ed6743SKees CookFor variables that are initialized once at ``__init`` time, these can
84c2ed6743SKees Cookbe marked with the (new and under development) ``__ro_after_init``
85c2ed6743SKees Cookattribute.
86c2ed6743SKees Cook
87c2ed6743SKees CookWhat remains are variables that are updated rarely (e.g. GDT). These
88c2ed6743SKees Cookwill need another infrastructure (similar to the temporary exceptions
89c2ed6743SKees Cookmade to kernel code mentioned above) that allow them to spend the rest
90c2ed6743SKees Cookof their lifetime read-only. (For example, when being updated, only the
91c2ed6743SKees CookCPU thread performing the update would be given uninterruptible write
92c2ed6743SKees Cookaccess to the memory.)
93c2ed6743SKees Cook
94c2ed6743SKees CookSegregation of kernel memory from userspace memory
95c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
96c2ed6743SKees Cook
97c2ed6743SKees CookThe kernel must never execute userspace memory. The kernel must also never
98c2ed6743SKees Cookaccess userspace memory without explicit expectation to do so. These
99c2ed6743SKees Cookrules can be enforced either by support of hardware-based restrictions
100c2ed6743SKees Cook(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
101c2ed6743SKees CookBy blocking userspace memory in this way, execution and data parsing
102c2ed6743SKees Cookcannot be passed to trivially-controlled userspace memory, forcing
103c2ed6743SKees Cookattacks to operate entirely in kernel memory.
104c2ed6743SKees Cook
105c2ed6743SKees CookReduced access to syscalls
106c2ed6743SKees Cook--------------------------
107c2ed6743SKees Cook
108c2ed6743SKees CookOne trivial way to eliminate many syscalls for 64-bit systems is building
109c2ed6743SKees Cookwithout ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
110c2ed6743SKees Cook
111c2ed6743SKees CookThe "seccomp" system provides an opt-in feature made available to
112c2ed6743SKees Cookuserspace, which provides a way to reduce the number of kernel entry
113c2ed6743SKees Cookpoints available to a running process. This limits the breadth of kernel
114c2ed6743SKees Cookcode that can be reached, possibly reducing the availability of a given
115c2ed6743SKees Cookbug to an attack.
116c2ed6743SKees Cook
117c2ed6743SKees CookAn area of improvement would be creating viable ways to keep access to
118c2ed6743SKees Cookthings like compat, user namespaces, BPF creation, and perf limited only
119c2ed6743SKees Cookto trusted processes. This would keep the scope of kernel entry points
120c2ed6743SKees Cookrestricted to the more regular set of normally available to unprivileged
121c2ed6743SKees Cookuserspace.
122c2ed6743SKees Cook
123c2ed6743SKees CookRestricting access to kernel modules
124c2ed6743SKees Cook------------------------------------
125c2ed6743SKees Cook
126c2ed6743SKees CookThe kernel should never allow an unprivileged user the ability to
127c2ed6743SKees Cookload specific kernel modules, since that would provide a facility to
128c2ed6743SKees Cookunexpectedly extend the available attack surface. (The on-demand loading
129c2ed6743SKees Cookof modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
130c2ed6743SKees Cookconsidered "expected" here, though additional consideration should be
131c2ed6743SKees Cookgiven even to these.) For example, loading a filesystem module via an
132c2ed6743SKees Cookunprivileged socket API is nonsense: only the root or physically local
133c2ed6743SKees Cookuser should trigger filesystem module loading. (And even this can be up
134c2ed6743SKees Cookfor debate in some scenarios.)
135c2ed6743SKees Cook
136c2ed6743SKees CookTo protect against even privileged users, systems may need to either
137c2ed6743SKees Cookdisable module loading entirely (e.g. monolithic kernel builds or
138c2ed6743SKees Cookmodules_disabled sysctl), or provide signed modules (e.g.
139c2ed6743SKees Cook``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
140c2ed6743SKees Cookroot load arbitrary kernel code via the module loader interface.
141c2ed6743SKees Cook
142c2ed6743SKees Cook
143c2ed6743SKees CookMemory integrity
144c2ed6743SKees Cook================
145c2ed6743SKees Cook
146c2ed6743SKees CookThere are many memory structures in the kernel that are regularly abused
147c2ed6743SKees Cookto gain execution control during an attack, By far the most commonly
148c2ed6743SKees Cookunderstood is that of the stack buffer overflow in which the return
149c2ed6743SKees Cookaddress stored on the stack is overwritten. Many other examples of this
150c2ed6743SKees Cookkind of attack exist, and protections exist to defend against them.
151c2ed6743SKees Cook
152c2ed6743SKees CookStack buffer overflow
153c2ed6743SKees Cook---------------------
154c2ed6743SKees Cook
155c2ed6743SKees CookThe classic stack buffer overflow involves writing past the expected end
156c2ed6743SKees Cookof a variable stored on the stack, ultimately writing a controlled value
157c2ed6743SKees Cookto the stack frame's stored return address. The most widely used defense
158c2ed6743SKees Cookis the presence of a stack canary between the stack variables and the
159*050e9baaSLinus Torvaldsreturn address (``CONFIG_STACKPROTECTOR``), which is verified just before
160c2ed6743SKees Cookthe function returns. Other defenses include things like shadow stacks.
161c2ed6743SKees Cook
162c2ed6743SKees CookStack depth overflow
163c2ed6743SKees Cook--------------------
164c2ed6743SKees Cook
165c2ed6743SKees CookA less well understood attack is using a bug that triggers the
166c2ed6743SKees Cookkernel to consume stack memory with deep function calls or large stack
167c2ed6743SKees Cookallocations. With this attack it is possible to write beyond the end of
168c2ed6743SKees Cookthe kernel's preallocated stack space and into sensitive structures. Two
169c2ed6743SKees Cookimportant changes need to be made for better protections: moving the
170c2ed6743SKees Cooksensitive thread_info structure elsewhere, and adding a faulting memory
171c2ed6743SKees Cookhole at the bottom of the stack to catch these overflows.
172c2ed6743SKees Cook
173c2ed6743SKees CookHeap memory integrity
174c2ed6743SKees Cook---------------------
175c2ed6743SKees Cook
176c2ed6743SKees CookThe structures used to track heap free lists can be sanity-checked during
177c2ed6743SKees Cookallocation and freeing to make sure they aren't being used to manipulate
178c2ed6743SKees Cookother memory areas.
179c2ed6743SKees Cook
180c2ed6743SKees CookCounter integrity
181c2ed6743SKees Cook-----------------
182c2ed6743SKees Cook
183c2ed6743SKees CookMany places in the kernel use atomic counters to track object references
184c2ed6743SKees Cookor perform similar lifetime management. When these counters can be made
185c2ed6743SKees Cookto wrap (over or under) this traditionally exposes a use-after-free
186c2ed6743SKees Cookflaw. By trapping atomic wrapping, this class of bug vanishes.
187c2ed6743SKees Cook
188c2ed6743SKees CookSize calculation overflow detection
189c2ed6743SKees Cook-----------------------------------
190c2ed6743SKees Cook
191c2ed6743SKees CookSimilar to counter overflow, integer overflows (usually size calculations)
192c2ed6743SKees Cookneed to be detected at runtime to kill this class of bug, which
193c2ed6743SKees Cooktraditionally leads to being able to write past the end of kernel buffers.
194c2ed6743SKees Cook
195c2ed6743SKees Cook
196c2ed6743SKees CookProbabilistic defenses
197c2ed6743SKees Cook======================
198c2ed6743SKees Cook
199c2ed6743SKees CookWhile many protections can be considered deterministic (e.g. read-only
200c2ed6743SKees Cookmemory cannot be written to), some protections provide only statistical
201c2ed6743SKees Cookdefense, in that an attack must gather enough information about a
202c2ed6743SKees Cookrunning system to overcome the defense. While not perfect, these do
203c2ed6743SKees Cookprovide meaningful defenses.
204c2ed6743SKees Cook
205c2ed6743SKees CookCanaries, blinding, and other secrets
206c2ed6743SKees Cook-------------------------------------
207c2ed6743SKees Cook
208c2ed6743SKees CookIt should be noted that things like the stack canary discussed earlier
209c2ed6743SKees Cookare technically statistical defenses, since they rely on a secret value,
210c2ed6743SKees Cookand such values may become discoverable through an information exposure
211c2ed6743SKees Cookflaw.
212c2ed6743SKees Cook
213c2ed6743SKees CookBlinding literal values for things like JITs, where the executable
214c2ed6743SKees Cookcontents may be partially under the control of userspace, need a similar
215c2ed6743SKees Cooksecret value.
216c2ed6743SKees Cook
217c2ed6743SKees CookIt is critical that the secret values used must be separate (e.g.
218c2ed6743SKees Cookdifferent canary per stack) and high entropy (e.g. is the RNG actually
219c2ed6743SKees Cookworking?) in order to maximize their success.
220c2ed6743SKees Cook
221c2ed6743SKees CookKernel Address Space Layout Randomization (KASLR)
222c2ed6743SKees Cook-------------------------------------------------
223c2ed6743SKees Cook
224c2ed6743SKees CookSince the location of kernel memory is almost always instrumental in
225c2ed6743SKees Cookmounting a successful attack, making the location non-deterministic
226c2ed6743SKees Cookraises the difficulty of an exploit. (Note that this in turn makes
227c2ed6743SKees Cookthe value of information exposures higher, since they may be used to
228c2ed6743SKees Cookdiscover desired memory locations.)
229c2ed6743SKees Cook
230c2ed6743SKees CookText and module base
231c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~~
232c2ed6743SKees Cook
233c2ed6743SKees CookBy relocating the physical and virtual base address of the kernel at
234c2ed6743SKees Cookboot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
235c2ed6743SKees Cookfrustrated. Additionally, offsetting the module loading base address
236c2ed6743SKees Cookmeans that even systems that load the same set of modules in the same
237c2ed6743SKees Cookorder every boot will not share a common base address with the rest of
238c2ed6743SKees Cookthe kernel text.
239c2ed6743SKees Cook
240c2ed6743SKees CookStack base
241c2ed6743SKees Cook~~~~~~~~~~
242c2ed6743SKees Cook
243c2ed6743SKees CookIf the base address of the kernel stack is not the same between processes,
244c2ed6743SKees Cookor even not the same between syscalls, targets on or beyond the stack
245c2ed6743SKees Cookbecome more difficult to locate.
246c2ed6743SKees Cook
247c2ed6743SKees CookDynamic memory base
248c2ed6743SKees Cook~~~~~~~~~~~~~~~~~~~
249c2ed6743SKees Cook
250c2ed6743SKees CookMuch of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
251c2ed6743SKees Cookbeing relatively deterministic in layout due to the order of early-boot
252c2ed6743SKees Cookinitializations. If the base address of these areas is not the same
253c2ed6743SKees Cookbetween boots, targeting them is frustrated, requiring an information
254c2ed6743SKees Cookexposure specific to the region.
255c2ed6743SKees Cook
256c2ed6743SKees CookStructure layout
257c2ed6743SKees Cook~~~~~~~~~~~~~~~~
258c2ed6743SKees Cook
259c2ed6743SKees CookBy performing a per-build randomization of the layout of sensitive
260c2ed6743SKees Cookstructures, attacks must either be tuned to known kernel builds or expose
261c2ed6743SKees Cookenough kernel memory to determine structure layouts before manipulating
262c2ed6743SKees Cookthem.
263c2ed6743SKees Cook
264c2ed6743SKees Cook
265c2ed6743SKees CookPreventing Information Exposures
266c2ed6743SKees Cook================================
267c2ed6743SKees Cook
268c2ed6743SKees CookSince the locations of sensitive structures are the primary target for
269c2ed6743SKees Cookattacks, it is important to defend against exposure of both kernel memory
270c2ed6743SKees Cookaddresses and kernel memory contents (since they may contain kernel
271c2ed6743SKees Cookaddresses or other sensitive things like canary values).
272c2ed6743SKees Cook
273227d1a61STobin C. HardingKernel addresses
274227d1a61STobin C. Harding----------------
275227d1a61STobin C. Harding
276227d1a61STobin C. HardingPrinting kernel addresses to userspace leaks sensitive information about
277227d1a61STobin C. Hardingthe kernel memory layout. Care should be exercised when using any printk
278227d1a61STobin C. Hardingspecifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
279227d1a61STobin C. Hardingin certain circumstances [*]).  Any file written to using one of these
280227d1a61STobin C. Hardingspecifiers should be readable only by privileged processes.
281227d1a61STobin C. Harding
282227d1a61STobin C. HardingKernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
283227d1a61STobin C. Hardingaddresses printed with the specifier %p are hashed before printing.
284227d1a61STobin C. Harding
285227d1a61STobin C. Harding[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
286227d1a61STobin C. Hardingprinted. If KALLSYMS is not enabled the raw address is printed.
287227d1a61STobin C. Harding
288c2ed6743SKees CookUnique identifiers
289c2ed6743SKees Cook------------------
290c2ed6743SKees Cook
291c2ed6743SKees CookKernel memory addresses must never be used as identifiers exposed to
292c2ed6743SKees Cookuserspace. Instead, use an atomic counter, an idr, or similar unique
293c2ed6743SKees Cookidentifier.
294c2ed6743SKees Cook
295c2ed6743SKees CookMemory initialization
296c2ed6743SKees Cook---------------------
297c2ed6743SKees Cook
298c2ed6743SKees CookMemory copied to userspace must always be fully initialized. If not
299c2ed6743SKees Cookexplicitly memset(), this will require changes to the compiler to make
300c2ed6743SKees Cooksure structure holes are cleared.
301c2ed6743SKees Cook
302c2ed6743SKees CookMemory poisoning
303c2ed6743SKees Cook----------------
304c2ed6743SKees Cook
305c2ed6743SKees CookWhen releasing memory, it is best to poison the contents (clear stack on
306c2ed6743SKees Cooksyscall return, wipe heap memory on a free), to avoid reuse attacks that
307c2ed6743SKees Cookrely on the old contents of memory. This frustrates many uninitialized
308c2ed6743SKees Cookvariable attacks, stack content exposures, heap content exposures, and
309c2ed6743SKees Cookuse-after-free attacks.
310c2ed6743SKees Cook
311c2ed6743SKees CookDestination tracking
312c2ed6743SKees Cook--------------------
313c2ed6743SKees Cook
314c2ed6743SKees CookTo help kill classes of bugs that result in kernel addresses being
315c2ed6743SKees Cookwritten to userspace, the destination of writes needs to be tracked. If
316c2ed6743SKees Cookthe buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
317c2ed6743SKees Cookit should automatically censor sensitive values.
318