1Entry/exit handling for exceptions, interrupts, syscalls and KVM 2================================================================ 3 4All transitions between execution domains require state updates which are 5subject to strict ordering constraints. State updates are required for the 6following: 7 8 * Lockdep 9 * RCU / Context tracking 10 * Preemption counter 11 * Tracing 12 * Time accounting 13 14The update order depends on the transition type and is explained below in 15the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular 16exceptions`_, `NMI and NMI-like exceptions`_. 17 18Non-instrumentable code - noinstr 19--------------------------------- 20 21Most instrumentation facilities depend on RCU, so intrumentation is prohibited 22for entry code before RCU starts watching and exit code after RCU stops 23watching. In addition, many architectures must save and restore register state, 24which means that (for example) a breakpoint in the breakpoint entry code would 25overwrite the debug registers of the initial breakpoint. 26 27Such code must be marked with the 'noinstr' attribute, placing that code into a 28special section inaccessible to instrumentation and debug facilities. Some 29functions are partially instrumentable, which is handled by marking them 30noinstr and using instrumentation_begin() and instrumentation_end() to flag the 31instrumentable ranges of code: 32 33.. code-block:: c 34 35 noinstr void entry(void) 36 { 37 handle_entry(); // <-- must be 'noinstr' or '__always_inline' 38 ... 39 40 instrumentation_begin(); 41 handle_context(); // <-- instrumentable code 42 instrumentation_end(); 43 44 ... 45 handle_exit(); // <-- must be 'noinstr' or '__always_inline' 46 } 47 48This allows verification of the 'noinstr' restrictions via objtool on 49supported architectures. 50 51Invoking non-instrumentable functions from instrumentable context has no 52restrictions and is useful to protect e.g. state switching which would 53cause malfunction if instrumented. 54 55All non-instrumentable entry/exit code sections before and after the RCU 56state transitions must run with interrupts disabled. 57 58Syscalls 59-------- 60 61Syscall-entry code starts in assembly code and calls out into low-level C code 62after establishing low-level architecture-specific state and stack frames. This 63low-level C code must not be instrumented. A typical syscall handling function 64invoked from low-level assembly code looks like this: 65 66.. code-block:: c 67 68 noinstr void syscall(struct pt_regs *regs, int nr) 69 { 70 arch_syscall_enter(regs); 71 nr = syscall_enter_from_user_mode(regs, nr); 72 73 instrumentation_begin(); 74 if (!invoke_syscall(regs, nr) && nr != -1) 75 result_reg(regs) = __sys_ni_syscall(regs); 76 instrumentation_end(); 77 78 syscall_exit_to_user_mode(regs); 79 } 80 81syscall_enter_from_user_mode() first invokes enter_from_user_mode() which 82establishes state in the following order: 83 84 * Lockdep 85 * RCU / Context tracking 86 * Tracing 87 88and then invokes the various entry work functions like ptrace, seccomp, audit, 89syscall tracing, etc. After all that is done, the instrumentable invoke_syscall 90function can be invoked. The instrumentable code section then ends, after which 91syscall_exit_to_user_mode() is invoked. 92 93syscall_exit_to_user_mode() handles all work which needs to be done before 94returning to user space like tracing, audit, signals, task work etc. After 95that it invokes exit_to_user_mode() which again handles the state 96transition in the reverse order: 97 98 * Tracing 99 * RCU / Context tracking 100 * Lockdep 101 102syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also 103available as fine grained subfunctions in cases where the architecture code 104has to do extra work between the various steps. In such cases it has to 105ensure that enter_from_user_mode() is called first on entry and 106exit_to_user_mode() is called last on exit. 107 108Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking 109to print a warning. 110 111KVM 112--- 113 114Entering or exiting guest mode is very similar to syscalls. From the host 115kernel point of view the CPU goes off into user space when entering the 116guest and returns to the kernel on exit. 117 118kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() 119and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). 120The state operations have the same ordering. 121 122Task work handling is done separately for guest at the boundary of the 123vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of 124the work handled on return to user space. 125 126Do not nest KVM entry/exit transitions because doing so is nonsensical. 127 128Interrupts and regular exceptions 129--------------------------------- 130 131Interrupts entry and exit handling is slightly more complex than syscalls 132and KVM transitions. 133 134If an interrupt is raised while the CPU executes in user space, the entry 135and exit handling is exactly the same as for syscalls. 136 137If the interrupt is raised while the CPU executes in kernel space the entry and 138exit handling is slightly different. RCU state is only updated when the 139interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will 140already be watching. Lockdep and tracing have to be updated unconditionally. 141 142irqentry_enter() and irqentry_exit() provide the implementation for this. 143 144The architecture-specific part looks similar to syscall handling: 145 146.. code-block:: c 147 148 noinstr void interrupt(struct pt_regs *regs, int nr) 149 { 150 arch_interrupt_enter(regs); 151 state = irqentry_enter(regs); 152 153 instrumentation_begin(); 154 155 irq_enter_rcu(); 156 invoke_irq_handler(regs, nr); 157 irq_exit_rcu(); 158 159 instrumentation_end(); 160 161 irqentry_exit(regs, state); 162 } 163 164Note that the invocation of the actual interrupt handler is within a 165irq_enter_rcu() and irq_exit_rcu() pair. 166 167irq_enter_rcu() updates the preemption count which makes in_hardirq() 168return true, handles NOHZ tick state and interrupt time accounting. This 169means that up to the point where irq_enter_rcu() is invoked in_hardirq() 170returns false. 171 172irq_exit_rcu() handles interrupt time accounting, undoes the preemption 173count update and eventually handles soft interrupts and NOHZ tick state. 174 175In theory, the preemption count could be updated in irqentry_enter(). In 176practice, deferring this update to irq_enter_rcu() allows the preemption-count 177code to be traced, while also maintaining symmetry with irq_exit_rcu() and 178irqentry_exit(), which are described in the next paragraph. The only downside 179is that the early entry code up to irq_enter_rcu() must be aware that the 180preemption count has not yet been updated with the HARDIRQ_OFFSET state. 181 182Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count 183before it handles soft interrupts, whose handlers must run in BH context rather 184than irq-disabled context. In addition, irqentry_exit() might schedule, which 185also requires that HARDIRQ_OFFSET has been removed from the preemption count. 186 187Even though interrupt handlers are expected to run with local interrupts 188disabled, interrupt nesting is common from an entry/exit perspective. For 189example, softirq handling happens within an irqentry_{enter,exit}() block with 190local interrupts enabled. Also, although uncommon, nothing prevents an 191interrupt handler from re-enabling interrupts. 192 193Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it 194runs with local interrupts disabled. But NMIs can happen anytime, and a lot of 195the entry code is shared between the two. 196 197NMI and NMI-like exceptions 198--------------------------- 199 200NMIs and NMI-like exceptions (machine checks, double faults, debug 201interrupts, etc.) can hit any context and must be extra careful with 202the state. 203 204State changes for debug exceptions and machine-check exceptions depend on 205whether these exceptions happened in user-space (breakpoints or watchpoints) or 206in kernel mode (code patching). From user-space, they are treated like 207interrupts, while from kernel mode they are treated like NMIs. 208 209NMIs and other NMI-like exceptions handle state transitions without 210distinguishing between user-mode and kernel-mode origin. 211 212The state update on entry is handled in irqentry_nmi_enter() which updates 213state in the following order: 214 215 * Preemption counter 216 * Lockdep 217 * RCU / Context tracking 218 * Tracing 219 220The exit counterpart irqentry_nmi_exit() does the reverse operation in the 221reverse order. 222 223Note that the update of the preemption counter has to be the first 224operation on enter and the last operation on exit. The reason is that both 225lockdep and RCU rely on in_nmi() returning true in this case. The 226preemption count modification in the NMI entry/exit case must not be 227traced. 228 229Architecture-specific code looks like this: 230 231.. code-block:: c 232 233 noinstr void nmi(struct pt_regs *regs) 234 { 235 arch_nmi_enter(regs); 236 state = irqentry_nmi_enter(regs); 237 238 instrumentation_begin(); 239 nmi_handler(regs); 240 instrumentation_end(); 241 242 irqentry_nmi_exit(regs); 243 } 244 245and for e.g. a debug exception it can look like this: 246 247.. code-block:: c 248 249 noinstr void debug(struct pt_regs *regs) 250 { 251 arch_nmi_enter(regs); 252 253 debug_regs = save_debug_regs(); 254 255 if (user_mode(regs)) { 256 state = irqentry_enter(regs); 257 258 instrumentation_begin(); 259 user_mode_debug_handler(regs, debug_regs); 260 instrumentation_end(); 261 262 irqentry_exit(regs, state); 263 } else { 264 state = irqentry_nmi_enter(regs); 265 266 instrumentation_begin(); 267 kernel_mode_debug_handler(regs, debug_regs); 268 instrumentation_end(); 269 270 irqentry_nmi_exit(regs, state); 271 } 272 } 273 274There is no combined irqentry_nmi_if_kernel() function available as the 275above cannot be handled in an exception-agnostic way. 276 277NMIs can happen in any context. For example, an NMI-like exception triggered 278while handling an NMI. So NMI entry code has to be reentrant and state updates 279need to handle nesting. 280