1The Kernel Concurrency Sanitizer (KCSAN) 2======================================== 3 4The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which 5relies on compile-time instrumentation, and uses a watchpoint-based sampling 6approach to detect races. KCSAN's primary purpose is to detect `data races`_. 7 8Usage 9----- 10 11KCSAN requires Clang version 11 or later. 12 13To enable KCSAN configure the kernel with:: 14 15 CONFIG_KCSAN = y 16 17KCSAN provides several other configuration options to customize behaviour (see 18the respective help text in ``lib/Kconfig.kcsan`` for more info). 19 20Error reports 21~~~~~~~~~~~~~ 22 23A typical data race report looks like this:: 24 25 ================================================================== 26 BUG: KCSAN: data-race in generic_permission / kernfs_refresh_inode 27 28 write to 0xffff8fee4c40700c of 4 bytes by task 175 on cpu 4: 29 kernfs_refresh_inode+0x70/0x170 30 kernfs_iop_permission+0x4f/0x90 31 inode_permission+0x190/0x200 32 link_path_walk.part.0+0x503/0x8e0 33 path_lookupat.isra.0+0x69/0x4d0 34 filename_lookup+0x136/0x280 35 user_path_at_empty+0x47/0x60 36 vfs_statx+0x9b/0x130 37 __do_sys_newlstat+0x50/0xb0 38 __x64_sys_newlstat+0x37/0x50 39 do_syscall_64+0x85/0x260 40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 41 42 read to 0xffff8fee4c40700c of 4 bytes by task 166 on cpu 6: 43 generic_permission+0x5b/0x2a0 44 kernfs_iop_permission+0x66/0x90 45 inode_permission+0x190/0x200 46 link_path_walk.part.0+0x503/0x8e0 47 path_lookupat.isra.0+0x69/0x4d0 48 filename_lookup+0x136/0x280 49 user_path_at_empty+0x47/0x60 50 do_faccessat+0x11a/0x390 51 __x64_sys_access+0x3c/0x50 52 do_syscall_64+0x85/0x260 53 entry_SYSCALL_64_after_hwframe+0x44/0xa9 54 55 Reported by Kernel Concurrency Sanitizer on: 56 CPU: 6 PID: 166 Comm: systemd-journal Not tainted 5.3.0-rc7+ #1 57 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 58 ================================================================== 59 60The header of the report provides a short summary of the functions involved in 61the race. It is followed by the access types and stack traces of the 2 threads 62involved in the data race. 63 64The other less common type of data race report looks like this:: 65 66 ================================================================== 67 BUG: KCSAN: data-race in e1000_clean_rx_irq+0x551/0xb10 68 69 race at unknown origin, with read to 0xffff933db8a2ae6c of 1 bytes by interrupt on cpu 0: 70 e1000_clean_rx_irq+0x551/0xb10 71 e1000_clean+0x533/0xda0 72 net_rx_action+0x329/0x900 73 __do_softirq+0xdb/0x2db 74 irq_exit+0x9b/0xa0 75 do_IRQ+0x9c/0xf0 76 ret_from_intr+0x0/0x18 77 default_idle+0x3f/0x220 78 arch_cpu_idle+0x21/0x30 79 do_idle+0x1df/0x230 80 cpu_startup_entry+0x14/0x20 81 rest_init+0xc5/0xcb 82 arch_call_rest_init+0x13/0x2b 83 start_kernel+0x6db/0x700 84 85 Reported by Kernel Concurrency Sanitizer on: 86 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc7+ #2 87 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 88 ================================================================== 89 90This report is generated where it was not possible to determine the other 91racing thread, but a race was inferred due to the data value of the watched 92memory location having changed. These can occur either due to missing 93instrumentation or e.g. DMA accesses. These reports will only be generated if 94``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y`` (selected by default). 95 96Selective analysis 97~~~~~~~~~~~~~~~~~~ 98 99It may be desirable to disable data race detection for specific accesses, 100functions, compilation units, or entire subsystems. For static blacklisting, 101the below options are available: 102 103* KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that 104 any data races due to accesses in ``expr`` should be ignored and resulting 105 behaviour when encountering a data race is deemed safe. 106 107* Disabling data race detection for entire functions can be accomplished by 108 using the function attribute ``__no_kcsan``:: 109 110 __no_kcsan 111 void foo(void) { 112 ... 113 114 To dynamically limit for which functions to generate reports, see the 115 `DebugFS interface`_ blacklist/whitelist feature. 116 117* To disable data race detection for a particular compilation unit, add to the 118 ``Makefile``:: 119 120 KCSAN_SANITIZE_file.o := n 121 122* To disable data race detection for all compilation units listed in a 123 ``Makefile``, add to the respective ``Makefile``:: 124 125 KCSAN_SANITIZE := n 126 127Furthermore, it is possible to tell KCSAN to show or hide entire classes of 128data races, depending on preferences. These can be changed via the following 129Kconfig options: 130 131* ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: If enabled and a conflicting write 132 is observed via a watchpoint, but the data value of the memory location was 133 observed to remain unchanged, do not report the data race. 134 135* ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: Assume that plain aligned writes 136 up to word size are atomic by default. Assumes that such writes are not 137 subject to unsafe compiler optimizations resulting in data races. The option 138 causes KCSAN to not report data races due to conflicts where the only plain 139 accesses are aligned writes up to word size. 140 141DebugFS interface 142~~~~~~~~~~~~~~~~~ 143 144The file ``/sys/kernel/debug/kcsan`` provides the following interface: 145 146* Reading ``/sys/kernel/debug/kcsan`` returns various runtime statistics. 147 148* Writing ``on`` or ``off`` to ``/sys/kernel/debug/kcsan`` allows turning KCSAN 149 on or off, respectively. 150 151* Writing ``!some_func_name`` to ``/sys/kernel/debug/kcsan`` adds 152 ``some_func_name`` to the report filter list, which (by default) blacklists 153 reporting data races where either one of the top stackframes are a function 154 in the list. 155 156* Writing either ``blacklist`` or ``whitelist`` to ``/sys/kernel/debug/kcsan`` 157 changes the report filtering behaviour. For example, the blacklist feature 158 can be used to silence frequently occurring data races; the whitelist feature 159 can help with reproduction and testing of fixes. 160 161Tuning performance 162~~~~~~~~~~~~~~~~~~ 163 164Core parameters that affect KCSAN's overall performance and bug detection 165ability are exposed as kernel command-line arguments whose defaults can also be 166changed via the corresponding Kconfig options. 167 168* ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): Number of per-CPU memory 169 operations to skip, before another watchpoint is set up. Setting up 170 watchpoints more frequently will result in the likelihood of races to be 171 observed to increase. This parameter has the most significant impact on 172 overall system performance and race detection ability. 173 174* ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): For tasks, the 175 microsecond delay to stall execution after a watchpoint has been set up. 176 Larger values result in the window in which we may observe a race to 177 increase. 178 179* ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): For 180 interrupts, the microsecond delay to stall execution after a watchpoint has 181 been set up. Interrupts have tighter latency requirements, and their delay 182 should generally be smaller than the one chosen for tasks. 183 184They may be tweaked at runtime via ``/sys/module/kcsan/parameters/``. 185 186Data Races 187---------- 188 189In an execution, two memory accesses form a *data race* if they *conflict*, 190they happen concurrently in different threads, and at least one of them is a 191*plain access*; they *conflict* if both access the same memory location, and at 192least one is a write. For a more thorough discussion and definition, see `"Plain 193Accesses and Data Races" in the LKMM`_. 194 195.. _"Plain Accesses and Data Races" in the LKMM: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1922 196 197Relationship with the Linux-Kernel Memory Consistency Model (LKMM) 198~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 199 200The LKMM defines the propagation and ordering rules of various memory 201operations, which gives developers the ability to reason about concurrent code. 202Ultimately this allows to determine the possible executions of concurrent code, 203and if that code is free from data races. 204 205KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``, 206``atomic_*``, etc.), but is oblivious of any ordering guarantees and simply 207assumes that memory barriers are placed correctly. In other words, KCSAN 208assumes that as long as a plain access is not observed to race with another 209conflicting access, memory operations are correctly ordered. 210 211This means that KCSAN will not report *potential* data races due to missing 212memory ordering. Developers should therefore carefully consider the required 213memory ordering requirements that remain unchecked. If, however, missing 214memory ordering (that is observable with a particular compiler and 215architecture) leads to an observable data race (e.g. entering a critical 216section erroneously), KCSAN would report the resulting data race. 217 218Race Detection Beyond Data Races 219-------------------------------- 220 221For code with complex concurrency design, race-condition bugs may not always 222manifest as data races. Race conditions occur if concurrently executing 223operations result in unexpected system behaviour. On the other hand, data races 224are defined at the C-language level. The following macros can be used to check 225properties of concurrent code where bugs would not manifest as data races. 226 227.. kernel-doc:: include/linux/kcsan-checks.h 228 :functions: ASSERT_EXCLUSIVE_WRITER ASSERT_EXCLUSIVE_WRITER_SCOPED 229 ASSERT_EXCLUSIVE_ACCESS ASSERT_EXCLUSIVE_ACCESS_SCOPED 230 ASSERT_EXCLUSIVE_BITS 231 232Implementation Details 233---------------------- 234 235KCSAN relies on observing that two accesses happen concurrently. Crucially, we 236want to (a) increase the chances of observing races (especially for races that 237manifest rarely), and (b) be able to actually observe them. We can accomplish 238(a) by injecting various delays, and (b) by using address watchpoints (or 239breakpoints). 240 241If we deliberately stall a memory access, while we have a watchpoint for its 242address set up, and then observe the watchpoint to fire, two accesses to the 243same address just raced. Using hardware watchpoints, this is the approach taken 244in `DataCollider 245<http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf>`_. 246Unlike DataCollider, KCSAN does not use hardware watchpoints, but instead 247relies on compiler instrumentation and "soft watchpoints". 248 249In KCSAN, watchpoints are implemented using an efficient encoding that stores 250access type, size, and address in a long; the benefits of using "soft 251watchpoints" are portability and greater flexibility. KCSAN then relies on the 252compiler instrumenting plain accesses. For each instrumented plain access: 253 2541. Check if a matching watchpoint exists; if yes, and at least one access is a 255 write, then we encountered a racing access. 256 2572. Periodically, if no matching watchpoint exists, set up a watchpoint and 258 stall for a small randomized delay. 259 2603. Also check the data value before the delay, and re-check the data value 261 after delay; if the values mismatch, we infer a race of unknown origin. 262 263To detect data races between plain and marked accesses, KCSAN also annotates 264marked accesses, but only to check if a watchpoint exists; i.e. KCSAN never 265sets up a watchpoint on marked accesses. By never setting up watchpoints for 266marked operations, if all accesses to a variable that is accessed concurrently 267are properly marked, KCSAN will never trigger a watchpoint and therefore never 268report the accesses. 269 270Key Properties 271~~~~~~~~~~~~~~ 272 2731. **Memory Overhead:** The overall memory overhead is only a few MiB 274 depending on configuration. The current implementation uses a small array of 275 longs to encode watchpoint information, which is negligible. 276 2772. **Performance Overhead:** KCSAN's runtime aims to be minimal, using an 278 efficient watchpoint encoding that does not require acquiring any shared 279 locks in the fast-path. For kernel boot on a system with 8 CPUs: 280 281 - 5.0x slow-down with the default KCSAN config; 282 - 2.8x slow-down from runtime fast-path overhead only (set very large 283 ``KCSAN_SKIP_WATCH`` and unset ``KCSAN_SKIP_WATCH_RANDOMIZE``). 284 2853. **Annotation Overheads:** Minimal annotations are required outside the KCSAN 286 runtime. As a result, maintenance overheads are minimal as the kernel 287 evolves. 288 2894. **Detects Racy Writes from Devices:** Due to checking data values upon 290 setting up watchpoints, racy writes from devices can also be detected. 291 2925. **Memory Ordering:** KCSAN is *not* explicitly aware of the LKMM's ordering 293 rules; this may result in missed data races (false negatives). 294 2956. **Analysis Accuracy:** For observed executions, due to using a sampling 296 strategy, the analysis is *unsound* (false negatives possible), but aims to 297 be complete (no false positives). 298 299Alternatives Considered 300----------------------- 301 302An alternative data race detection approach for the kernel can be found in the 303`Kernel Thread Sanitizer (KTSAN) <https://github.com/google/ktsan/wiki>`_. 304KTSAN is a happens-before data race detector, which explicitly establishes the 305happens-before order between memory operations, which can then be used to 306determine data races as defined in `Data Races`_. 307 308To build a correct happens-before relation, KTSAN must be aware of all ordering 309rules of the LKMM and synchronization primitives. Unfortunately, any omission 310leads to large numbers of false positives, which is especially detrimental in 311the context of the kernel which includes numerous custom synchronization 312mechanisms. To track the happens-before relation, KTSAN's implementation 313requires metadata for each memory location (shadow memory), which for each page 314corresponds to 4 pages of shadow memory, and can translate into overhead of 315tens of GiB on a large system. 316