1bf026e2eSThomas GleixnerEntry/exit handling for exceptions, interrupts, syscalls and KVM 2bf026e2eSThomas Gleixner================================================================ 3bf026e2eSThomas Gleixner 4bf026e2eSThomas GleixnerAll transitions between execution domains require state updates which are 5bf026e2eSThomas Gleixnersubject to strict ordering constraints. State updates are required for the 6bf026e2eSThomas Gleixnerfollowing: 7bf026e2eSThomas Gleixner 8bf026e2eSThomas Gleixner * Lockdep 9bf026e2eSThomas Gleixner * RCU / Context tracking 10bf026e2eSThomas Gleixner * Preemption counter 11bf026e2eSThomas Gleixner * Tracing 12bf026e2eSThomas Gleixner * Time accounting 13bf026e2eSThomas Gleixner 14bf026e2eSThomas GleixnerThe update order depends on the transition type and is explained below in 15bf026e2eSThomas Gleixnerthe transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular 16bf026e2eSThomas Gleixnerexceptions`_, `NMI and NMI-like exceptions`_. 17bf026e2eSThomas Gleixner 18bf026e2eSThomas GleixnerNon-instrumentable code - noinstr 19bf026e2eSThomas Gleixner--------------------------------- 20bf026e2eSThomas Gleixner 21bf026e2eSThomas GleixnerMost instrumentation facilities depend on RCU, so intrumentation is prohibited 22bf026e2eSThomas Gleixnerfor entry code before RCU starts watching and exit code after RCU stops 23bf026e2eSThomas Gleixnerwatching. In addition, many architectures must save and restore register state, 24bf026e2eSThomas Gleixnerwhich means that (for example) a breakpoint in the breakpoint entry code would 25bf026e2eSThomas Gleixneroverwrite the debug registers of the initial breakpoint. 26bf026e2eSThomas Gleixner 27bf026e2eSThomas GleixnerSuch code must be marked with the 'noinstr' attribute, placing that code into a 28bf026e2eSThomas Gleixnerspecial section inaccessible to instrumentation and debug facilities. Some 29bf026e2eSThomas Gleixnerfunctions are partially instrumentable, which is handled by marking them 30bf026e2eSThomas Gleixnernoinstr and using instrumentation_begin() and instrumentation_end() to flag the 31bf026e2eSThomas Gleixnerinstrumentable ranges of code: 32bf026e2eSThomas Gleixner 33bf026e2eSThomas Gleixner.. code-block:: c 34bf026e2eSThomas Gleixner 35bf026e2eSThomas Gleixner noinstr void entry(void) 36bf026e2eSThomas Gleixner { 37bf026e2eSThomas Gleixner handle_entry(); // <-- must be 'noinstr' or '__always_inline' 38bf026e2eSThomas Gleixner ... 39bf026e2eSThomas Gleixner 40bf026e2eSThomas Gleixner instrumentation_begin(); 41bf026e2eSThomas Gleixner handle_context(); // <-- instrumentable code 42bf026e2eSThomas Gleixner instrumentation_end(); 43bf026e2eSThomas Gleixner 44bf026e2eSThomas Gleixner ... 45bf026e2eSThomas Gleixner handle_exit(); // <-- must be 'noinstr' or '__always_inline' 46bf026e2eSThomas Gleixner } 47bf026e2eSThomas Gleixner 48bf026e2eSThomas GleixnerThis allows verification of the 'noinstr' restrictions via objtool on 49bf026e2eSThomas Gleixnersupported architectures. 50bf026e2eSThomas Gleixner 51bf026e2eSThomas GleixnerInvoking non-instrumentable functions from instrumentable context has no 52bf026e2eSThomas Gleixnerrestrictions and is useful to protect e.g. state switching which would 53bf026e2eSThomas Gleixnercause malfunction if instrumented. 54bf026e2eSThomas Gleixner 55bf026e2eSThomas GleixnerAll non-instrumentable entry/exit code sections before and after the RCU 56bf026e2eSThomas Gleixnerstate transitions must run with interrupts disabled. 57bf026e2eSThomas Gleixner 58bf026e2eSThomas GleixnerSyscalls 59bf026e2eSThomas Gleixner-------- 60bf026e2eSThomas Gleixner 61bf026e2eSThomas GleixnerSyscall-entry code starts in assembly code and calls out into low-level C code 62bf026e2eSThomas Gleixnerafter establishing low-level architecture-specific state and stack frames. This 63bf026e2eSThomas Gleixnerlow-level C code must not be instrumented. A typical syscall handling function 64bf026e2eSThomas Gleixnerinvoked from low-level assembly code looks like this: 65bf026e2eSThomas Gleixner 66bf026e2eSThomas Gleixner.. code-block:: c 67bf026e2eSThomas Gleixner 68bf026e2eSThomas Gleixner noinstr void syscall(struct pt_regs *regs, int nr) 69bf026e2eSThomas Gleixner { 70bf026e2eSThomas Gleixner arch_syscall_enter(regs); 71bf026e2eSThomas Gleixner nr = syscall_enter_from_user_mode(regs, nr); 72bf026e2eSThomas Gleixner 73bf026e2eSThomas Gleixner instrumentation_begin(); 74bf026e2eSThomas Gleixner if (!invoke_syscall(regs, nr) && nr != -1) 75bf026e2eSThomas Gleixner result_reg(regs) = __sys_ni_syscall(regs); 76bf026e2eSThomas Gleixner instrumentation_end(); 77bf026e2eSThomas Gleixner 78bf026e2eSThomas Gleixner syscall_exit_to_user_mode(regs); 79bf026e2eSThomas Gleixner } 80bf026e2eSThomas Gleixner 81bf026e2eSThomas Gleixnersyscall_enter_from_user_mode() first invokes enter_from_user_mode() which 82bf026e2eSThomas Gleixnerestablishes state in the following order: 83bf026e2eSThomas Gleixner 84bf026e2eSThomas Gleixner * Lockdep 85bf026e2eSThomas Gleixner * RCU / Context tracking 86bf026e2eSThomas Gleixner * Tracing 87bf026e2eSThomas Gleixner 88bf026e2eSThomas Gleixnerand then invokes the various entry work functions like ptrace, seccomp, audit, 89bf026e2eSThomas Gleixnersyscall tracing, etc. After all that is done, the instrumentable invoke_syscall 90bf026e2eSThomas Gleixnerfunction can be invoked. The instrumentable code section then ends, after which 91bf026e2eSThomas Gleixnersyscall_exit_to_user_mode() is invoked. 92bf026e2eSThomas Gleixner 93bf026e2eSThomas Gleixnersyscall_exit_to_user_mode() handles all work which needs to be done before 94bf026e2eSThomas Gleixnerreturning to user space like tracing, audit, signals, task work etc. After 95bf026e2eSThomas Gleixnerthat it invokes exit_to_user_mode() which again handles the state 96bf026e2eSThomas Gleixnertransition in the reverse order: 97bf026e2eSThomas Gleixner 98bf026e2eSThomas Gleixner * Tracing 99bf026e2eSThomas Gleixner * RCU / Context tracking 100bf026e2eSThomas Gleixner * Lockdep 101bf026e2eSThomas Gleixner 102bf026e2eSThomas Gleixnersyscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also 103bf026e2eSThomas Gleixneravailable as fine grained subfunctions in cases where the architecture code 104bf026e2eSThomas Gleixnerhas to do extra work between the various steps. In such cases it has to 105bf026e2eSThomas Gleixnerensure that enter_from_user_mode() is called first on entry and 106bf026e2eSThomas Gleixnerexit_to_user_mode() is called last on exit. 107bf026e2eSThomas Gleixner 108*e3aa43e9SNicolas Saenz JulienneDo not nest syscalls. Nested systcalls will cause RCU and/or context tracking 109*e3aa43e9SNicolas Saenz Julienneto print a warning. 110bf026e2eSThomas Gleixner 111bf026e2eSThomas GleixnerKVM 112bf026e2eSThomas Gleixner--- 113bf026e2eSThomas Gleixner 114bf026e2eSThomas GleixnerEntering or exiting guest mode is very similar to syscalls. From the host 115bf026e2eSThomas Gleixnerkernel point of view the CPU goes off into user space when entering the 116bf026e2eSThomas Gleixnerguest and returns to the kernel on exit. 117bf026e2eSThomas Gleixner 118bf026e2eSThomas Gleixnerkvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() 119bf026e2eSThomas Gleixnerand kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). 120bf026e2eSThomas GleixnerThe state operations have the same ordering. 121bf026e2eSThomas Gleixner 122bf026e2eSThomas GleixnerTask work handling is done separately for guest at the boundary of the 123bf026e2eSThomas Gleixnervcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of 124bf026e2eSThomas Gleixnerthe work handled on return to user space. 125bf026e2eSThomas Gleixner 126*e3aa43e9SNicolas Saenz JulienneDo not nest KVM entry/exit transitions because doing so is nonsensical. 127*e3aa43e9SNicolas Saenz Julienne 128bf026e2eSThomas GleixnerInterrupts and regular exceptions 129bf026e2eSThomas Gleixner--------------------------------- 130bf026e2eSThomas Gleixner 131bf026e2eSThomas GleixnerInterrupts entry and exit handling is slightly more complex than syscalls 132bf026e2eSThomas Gleixnerand KVM transitions. 133bf026e2eSThomas Gleixner 134bf026e2eSThomas GleixnerIf an interrupt is raised while the CPU executes in user space, the entry 135bf026e2eSThomas Gleixnerand exit handling is exactly the same as for syscalls. 136bf026e2eSThomas Gleixner 137bf026e2eSThomas GleixnerIf the interrupt is raised while the CPU executes in kernel space the entry and 138bf026e2eSThomas Gleixnerexit handling is slightly different. RCU state is only updated when the 139bf026e2eSThomas Gleixnerinterrupt is raised in the context of the CPU's idle task. Otherwise, RCU will 140bf026e2eSThomas Gleixneralready be watching. Lockdep and tracing have to be updated unconditionally. 141bf026e2eSThomas Gleixner 142bf026e2eSThomas Gleixnerirqentry_enter() and irqentry_exit() provide the implementation for this. 143bf026e2eSThomas Gleixner 144bf026e2eSThomas GleixnerThe architecture-specific part looks similar to syscall handling: 145bf026e2eSThomas Gleixner 146bf026e2eSThomas Gleixner.. code-block:: c 147bf026e2eSThomas Gleixner 148bf026e2eSThomas Gleixner noinstr void interrupt(struct pt_regs *regs, int nr) 149bf026e2eSThomas Gleixner { 150bf026e2eSThomas Gleixner arch_interrupt_enter(regs); 151bf026e2eSThomas Gleixner state = irqentry_enter(regs); 152bf026e2eSThomas Gleixner 153bf026e2eSThomas Gleixner instrumentation_begin(); 154bf026e2eSThomas Gleixner 155bf026e2eSThomas Gleixner irq_enter_rcu(); 156bf026e2eSThomas Gleixner invoke_irq_handler(regs, nr); 157bf026e2eSThomas Gleixner irq_exit_rcu(); 158bf026e2eSThomas Gleixner 159bf026e2eSThomas Gleixner instrumentation_end(); 160bf026e2eSThomas Gleixner 161bf026e2eSThomas Gleixner irqentry_exit(regs, state); 162bf026e2eSThomas Gleixner } 163bf026e2eSThomas Gleixner 164bf026e2eSThomas GleixnerNote that the invocation of the actual interrupt handler is within a 165bf026e2eSThomas Gleixnerirq_enter_rcu() and irq_exit_rcu() pair. 166bf026e2eSThomas Gleixner 167bf026e2eSThomas Gleixnerirq_enter_rcu() updates the preemption count which makes in_hardirq() 168bf026e2eSThomas Gleixnerreturn true, handles NOHZ tick state and interrupt time accounting. This 169bf026e2eSThomas Gleixnermeans that up to the point where irq_enter_rcu() is invoked in_hardirq() 170bf026e2eSThomas Gleixnerreturns false. 171bf026e2eSThomas Gleixner 172bf026e2eSThomas Gleixnerirq_exit_rcu() handles interrupt time accounting, undoes the preemption 173bf026e2eSThomas Gleixnercount update and eventually handles soft interrupts and NOHZ tick state. 174bf026e2eSThomas Gleixner 175bf026e2eSThomas GleixnerIn theory, the preemption count could be updated in irqentry_enter(). In 176bf026e2eSThomas Gleixnerpractice, deferring this update to irq_enter_rcu() allows the preemption-count 177bf026e2eSThomas Gleixnercode to be traced, while also maintaining symmetry with irq_exit_rcu() and 178bf026e2eSThomas Gleixnerirqentry_exit(), which are described in the next paragraph. The only downside 179bf026e2eSThomas Gleixneris that the early entry code up to irq_enter_rcu() must be aware that the 180bf026e2eSThomas Gleixnerpreemption count has not yet been updated with the HARDIRQ_OFFSET state. 181bf026e2eSThomas Gleixner 182bf026e2eSThomas GleixnerNote that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count 183bf026e2eSThomas Gleixnerbefore it handles soft interrupts, whose handlers must run in BH context rather 184bf026e2eSThomas Gleixnerthan irq-disabled context. In addition, irqentry_exit() might schedule, which 185bf026e2eSThomas Gleixneralso requires that HARDIRQ_OFFSET has been removed from the preemption count. 186bf026e2eSThomas Gleixner 187*e3aa43e9SNicolas Saenz JulienneEven though interrupt handlers are expected to run with local interrupts 188*e3aa43e9SNicolas Saenz Juliennedisabled, interrupt nesting is common from an entry/exit perspective. For 189*e3aa43e9SNicolas Saenz Julienneexample, softirq handling happens within an irqentry_{enter,exit}() block with 190*e3aa43e9SNicolas Saenz Juliennelocal interrupts enabled. Also, although uncommon, nothing prevents an 191*e3aa43e9SNicolas Saenz Julienneinterrupt handler from re-enabling interrupts. 192*e3aa43e9SNicolas Saenz Julienne 193*e3aa43e9SNicolas Saenz JulienneInterrupt entry/exit code doesn't strictly need to handle reentrancy, since it 194*e3aa43e9SNicolas Saenz Julienneruns with local interrupts disabled. But NMIs can happen anytime, and a lot of 195*e3aa43e9SNicolas Saenz Juliennethe entry code is shared between the two. 196*e3aa43e9SNicolas Saenz Julienne 197bf026e2eSThomas GleixnerNMI and NMI-like exceptions 198bf026e2eSThomas Gleixner--------------------------- 199bf026e2eSThomas Gleixner 200bf026e2eSThomas GleixnerNMIs and NMI-like exceptions (machine checks, double faults, debug 201bf026e2eSThomas Gleixnerinterrupts, etc.) can hit any context and must be extra careful with 202bf026e2eSThomas Gleixnerthe state. 203bf026e2eSThomas Gleixner 204bf026e2eSThomas GleixnerState changes for debug exceptions and machine-check exceptions depend on 205bf026e2eSThomas Gleixnerwhether these exceptions happened in user-space (breakpoints or watchpoints) or 206bf026e2eSThomas Gleixnerin kernel mode (code patching). From user-space, they are treated like 207bf026e2eSThomas Gleixnerinterrupts, while from kernel mode they are treated like NMIs. 208bf026e2eSThomas Gleixner 209bf026e2eSThomas GleixnerNMIs and other NMI-like exceptions handle state transitions without 210bf026e2eSThomas Gleixnerdistinguishing between user-mode and kernel-mode origin. 211bf026e2eSThomas Gleixner 212bf026e2eSThomas GleixnerThe state update on entry is handled in irqentry_nmi_enter() which updates 213bf026e2eSThomas Gleixnerstate in the following order: 214bf026e2eSThomas Gleixner 215bf026e2eSThomas Gleixner * Preemption counter 216bf026e2eSThomas Gleixner * Lockdep 217bf026e2eSThomas Gleixner * RCU / Context tracking 218bf026e2eSThomas Gleixner * Tracing 219bf026e2eSThomas Gleixner 220bf026e2eSThomas GleixnerThe exit counterpart irqentry_nmi_exit() does the reverse operation in the 221bf026e2eSThomas Gleixnerreverse order. 222bf026e2eSThomas Gleixner 223bf026e2eSThomas GleixnerNote that the update of the preemption counter has to be the first 224bf026e2eSThomas Gleixneroperation on enter and the last operation on exit. The reason is that both 225bf026e2eSThomas Gleixnerlockdep and RCU rely on in_nmi() returning true in this case. The 226bf026e2eSThomas Gleixnerpreemption count modification in the NMI entry/exit case must not be 227bf026e2eSThomas Gleixnertraced. 228bf026e2eSThomas Gleixner 229bf026e2eSThomas GleixnerArchitecture-specific code looks like this: 230bf026e2eSThomas Gleixner 231bf026e2eSThomas Gleixner.. code-block:: c 232bf026e2eSThomas Gleixner 233bf026e2eSThomas Gleixner noinstr void nmi(struct pt_regs *regs) 234bf026e2eSThomas Gleixner { 235bf026e2eSThomas Gleixner arch_nmi_enter(regs); 236bf026e2eSThomas Gleixner state = irqentry_nmi_enter(regs); 237bf026e2eSThomas Gleixner 238bf026e2eSThomas Gleixner instrumentation_begin(); 239bf026e2eSThomas Gleixner nmi_handler(regs); 240bf026e2eSThomas Gleixner instrumentation_end(); 241bf026e2eSThomas Gleixner 242bf026e2eSThomas Gleixner irqentry_nmi_exit(regs); 243bf026e2eSThomas Gleixner } 244bf026e2eSThomas Gleixner 245bf026e2eSThomas Gleixnerand for e.g. a debug exception it can look like this: 246bf026e2eSThomas Gleixner 247bf026e2eSThomas Gleixner.. code-block:: c 248bf026e2eSThomas Gleixner 249bf026e2eSThomas Gleixner noinstr void debug(struct pt_regs *regs) 250bf026e2eSThomas Gleixner { 251bf026e2eSThomas Gleixner arch_nmi_enter(regs); 252bf026e2eSThomas Gleixner 253bf026e2eSThomas Gleixner debug_regs = save_debug_regs(); 254bf026e2eSThomas Gleixner 255bf026e2eSThomas Gleixner if (user_mode(regs)) { 256bf026e2eSThomas Gleixner state = irqentry_enter(regs); 257bf026e2eSThomas Gleixner 258bf026e2eSThomas Gleixner instrumentation_begin(); 259bf026e2eSThomas Gleixner user_mode_debug_handler(regs, debug_regs); 260bf026e2eSThomas Gleixner instrumentation_end(); 261bf026e2eSThomas Gleixner 262bf026e2eSThomas Gleixner irqentry_exit(regs, state); 263bf026e2eSThomas Gleixner } else { 264bf026e2eSThomas Gleixner state = irqentry_nmi_enter(regs); 265bf026e2eSThomas Gleixner 266bf026e2eSThomas Gleixner instrumentation_begin(); 267bf026e2eSThomas Gleixner kernel_mode_debug_handler(regs, debug_regs); 268bf026e2eSThomas Gleixner instrumentation_end(); 269bf026e2eSThomas Gleixner 270bf026e2eSThomas Gleixner irqentry_nmi_exit(regs, state); 271bf026e2eSThomas Gleixner } 272bf026e2eSThomas Gleixner } 273bf026e2eSThomas Gleixner 274bf026e2eSThomas GleixnerThere is no combined irqentry_nmi_if_kernel() function available as the 275bf026e2eSThomas Gleixnerabove cannot be handled in an exception-agnostic way. 276*e3aa43e9SNicolas Saenz Julienne 277*e3aa43e9SNicolas Saenz JulienneNMIs can happen in any context. For example, an NMI-like exception triggered 278*e3aa43e9SNicolas Saenz Juliennewhile handling an NMI. So NMI entry code has to be reentrant and state updates 279*e3aa43e9SNicolas Saenz Julienneneed to handle nesting. 280