xref: /openbmc/linux/Documentation/core-api/entry.rst (revision 4f2c0a4acffbec01079c28f839422e64ddeff004)
1bf026e2eSThomas GleixnerEntry/exit handling for exceptions, interrupts, syscalls and KVM
2bf026e2eSThomas Gleixner================================================================
3bf026e2eSThomas Gleixner
4bf026e2eSThomas GleixnerAll transitions between execution domains require state updates which are
5bf026e2eSThomas Gleixnersubject to strict ordering constraints. State updates are required for the
6bf026e2eSThomas Gleixnerfollowing:
7bf026e2eSThomas Gleixner
8bf026e2eSThomas Gleixner  * Lockdep
9bf026e2eSThomas Gleixner  * RCU / Context tracking
10bf026e2eSThomas Gleixner  * Preemption counter
11bf026e2eSThomas Gleixner  * Tracing
12bf026e2eSThomas Gleixner  * Time accounting
13bf026e2eSThomas Gleixner
14bf026e2eSThomas GleixnerThe update order depends on the transition type and is explained below in
15bf026e2eSThomas Gleixnerthe transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
16bf026e2eSThomas Gleixnerexceptions`_, `NMI and NMI-like exceptions`_.
17bf026e2eSThomas Gleixner
18bf026e2eSThomas GleixnerNon-instrumentable code - noinstr
19bf026e2eSThomas Gleixner---------------------------------
20bf026e2eSThomas Gleixner
21bf026e2eSThomas GleixnerMost instrumentation facilities depend on RCU, so intrumentation is prohibited
22bf026e2eSThomas Gleixnerfor entry code before RCU starts watching and exit code after RCU stops
23bf026e2eSThomas Gleixnerwatching. In addition, many architectures must save and restore register state,
24bf026e2eSThomas Gleixnerwhich means that (for example) a breakpoint in the breakpoint entry code would
25bf026e2eSThomas Gleixneroverwrite the debug registers of the initial breakpoint.
26bf026e2eSThomas Gleixner
27bf026e2eSThomas GleixnerSuch code must be marked with the 'noinstr' attribute, placing that code into a
28bf026e2eSThomas Gleixnerspecial section inaccessible to instrumentation and debug facilities. Some
29bf026e2eSThomas Gleixnerfunctions are partially instrumentable, which is handled by marking them
30bf026e2eSThomas Gleixnernoinstr and using instrumentation_begin() and instrumentation_end() to flag the
31bf026e2eSThomas Gleixnerinstrumentable ranges of code:
32bf026e2eSThomas Gleixner
33bf026e2eSThomas Gleixner.. code-block:: c
34bf026e2eSThomas Gleixner
35bf026e2eSThomas Gleixner  noinstr void entry(void)
36bf026e2eSThomas Gleixner  {
37bf026e2eSThomas Gleixner  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
38bf026e2eSThomas Gleixner	...
39bf026e2eSThomas Gleixner
40bf026e2eSThomas Gleixner	instrumentation_begin();
41bf026e2eSThomas Gleixner	handle_context();   // <-- instrumentable code
42bf026e2eSThomas Gleixner	instrumentation_end();
43bf026e2eSThomas Gleixner
44bf026e2eSThomas Gleixner	...
45bf026e2eSThomas Gleixner	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
46bf026e2eSThomas Gleixner  }
47bf026e2eSThomas Gleixner
48bf026e2eSThomas GleixnerThis allows verification of the 'noinstr' restrictions via objtool on
49bf026e2eSThomas Gleixnersupported architectures.
50bf026e2eSThomas Gleixner
51bf026e2eSThomas GleixnerInvoking non-instrumentable functions from instrumentable context has no
52bf026e2eSThomas Gleixnerrestrictions and is useful to protect e.g. state switching which would
53bf026e2eSThomas Gleixnercause malfunction if instrumented.
54bf026e2eSThomas Gleixner
55bf026e2eSThomas GleixnerAll non-instrumentable entry/exit code sections before and after the RCU
56bf026e2eSThomas Gleixnerstate transitions must run with interrupts disabled.
57bf026e2eSThomas Gleixner
58bf026e2eSThomas GleixnerSyscalls
59bf026e2eSThomas Gleixner--------
60bf026e2eSThomas Gleixner
61bf026e2eSThomas GleixnerSyscall-entry code starts in assembly code and calls out into low-level C code
62bf026e2eSThomas Gleixnerafter establishing low-level architecture-specific state and stack frames. This
63bf026e2eSThomas Gleixnerlow-level C code must not be instrumented. A typical syscall handling function
64bf026e2eSThomas Gleixnerinvoked from low-level assembly code looks like this:
65bf026e2eSThomas Gleixner
66bf026e2eSThomas Gleixner.. code-block:: c
67bf026e2eSThomas Gleixner
68bf026e2eSThomas Gleixner  noinstr void syscall(struct pt_regs *regs, int nr)
69bf026e2eSThomas Gleixner  {
70bf026e2eSThomas Gleixner	arch_syscall_enter(regs);
71bf026e2eSThomas Gleixner	nr = syscall_enter_from_user_mode(regs, nr);
72bf026e2eSThomas Gleixner
73bf026e2eSThomas Gleixner	instrumentation_begin();
74bf026e2eSThomas Gleixner	if (!invoke_syscall(regs, nr) && nr != -1)
75bf026e2eSThomas Gleixner	 	result_reg(regs) = __sys_ni_syscall(regs);
76bf026e2eSThomas Gleixner	instrumentation_end();
77bf026e2eSThomas Gleixner
78bf026e2eSThomas Gleixner	syscall_exit_to_user_mode(regs);
79bf026e2eSThomas Gleixner  }
80bf026e2eSThomas Gleixner
81bf026e2eSThomas Gleixnersyscall_enter_from_user_mode() first invokes enter_from_user_mode() which
82bf026e2eSThomas Gleixnerestablishes state in the following order:
83bf026e2eSThomas Gleixner
84bf026e2eSThomas Gleixner  * Lockdep
85bf026e2eSThomas Gleixner  * RCU / Context tracking
86bf026e2eSThomas Gleixner  * Tracing
87bf026e2eSThomas Gleixner
88bf026e2eSThomas Gleixnerand then invokes the various entry work functions like ptrace, seccomp, audit,
89bf026e2eSThomas Gleixnersyscall tracing, etc. After all that is done, the instrumentable invoke_syscall
90bf026e2eSThomas Gleixnerfunction can be invoked. The instrumentable code section then ends, after which
91bf026e2eSThomas Gleixnersyscall_exit_to_user_mode() is invoked.
92bf026e2eSThomas Gleixner
93bf026e2eSThomas Gleixnersyscall_exit_to_user_mode() handles all work which needs to be done before
94bf026e2eSThomas Gleixnerreturning to user space like tracing, audit, signals, task work etc. After
95bf026e2eSThomas Gleixnerthat it invokes exit_to_user_mode() which again handles the state
96bf026e2eSThomas Gleixnertransition in the reverse order:
97bf026e2eSThomas Gleixner
98bf026e2eSThomas Gleixner  * Tracing
99bf026e2eSThomas Gleixner  * RCU / Context tracking
100bf026e2eSThomas Gleixner  * Lockdep
101bf026e2eSThomas Gleixner
102bf026e2eSThomas Gleixnersyscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
103bf026e2eSThomas Gleixneravailable as fine grained subfunctions in cases where the architecture code
104bf026e2eSThomas Gleixnerhas to do extra work between the various steps. In such cases it has to
105bf026e2eSThomas Gleixnerensure that enter_from_user_mode() is called first on entry and
106bf026e2eSThomas Gleixnerexit_to_user_mode() is called last on exit.
107bf026e2eSThomas Gleixner
108*e3aa43e9SNicolas Saenz JulienneDo not nest syscalls. Nested systcalls will cause RCU and/or context tracking
109*e3aa43e9SNicolas Saenz Julienneto print a warning.
110bf026e2eSThomas Gleixner
111bf026e2eSThomas GleixnerKVM
112bf026e2eSThomas Gleixner---
113bf026e2eSThomas Gleixner
114bf026e2eSThomas GleixnerEntering or exiting guest mode is very similar to syscalls. From the host
115bf026e2eSThomas Gleixnerkernel point of view the CPU goes off into user space when entering the
116bf026e2eSThomas Gleixnerguest and returns to the kernel on exit.
117bf026e2eSThomas Gleixner
118bf026e2eSThomas Gleixnerkvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
119bf026e2eSThomas Gleixnerand kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
120bf026e2eSThomas GleixnerThe state operations have the same ordering.
121bf026e2eSThomas Gleixner
122bf026e2eSThomas GleixnerTask work handling is done separately for guest at the boundary of the
123bf026e2eSThomas Gleixnervcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
124bf026e2eSThomas Gleixnerthe work handled on return to user space.
125bf026e2eSThomas Gleixner
126*e3aa43e9SNicolas Saenz JulienneDo not nest KVM entry/exit transitions because doing so is nonsensical.
127*e3aa43e9SNicolas Saenz Julienne
128bf026e2eSThomas GleixnerInterrupts and regular exceptions
129bf026e2eSThomas Gleixner---------------------------------
130bf026e2eSThomas Gleixner
131bf026e2eSThomas GleixnerInterrupts entry and exit handling is slightly more complex than syscalls
132bf026e2eSThomas Gleixnerand KVM transitions.
133bf026e2eSThomas Gleixner
134bf026e2eSThomas GleixnerIf an interrupt is raised while the CPU executes in user space, the entry
135bf026e2eSThomas Gleixnerand exit handling is exactly the same as for syscalls.
136bf026e2eSThomas Gleixner
137bf026e2eSThomas GleixnerIf the interrupt is raised while the CPU executes in kernel space the entry and
138bf026e2eSThomas Gleixnerexit handling is slightly different. RCU state is only updated when the
139bf026e2eSThomas Gleixnerinterrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
140bf026e2eSThomas Gleixneralready be watching. Lockdep and tracing have to be updated unconditionally.
141bf026e2eSThomas Gleixner
142bf026e2eSThomas Gleixnerirqentry_enter() and irqentry_exit() provide the implementation for this.
143bf026e2eSThomas Gleixner
144bf026e2eSThomas GleixnerThe architecture-specific part looks similar to syscall handling:
145bf026e2eSThomas Gleixner
146bf026e2eSThomas Gleixner.. code-block:: c
147bf026e2eSThomas Gleixner
148bf026e2eSThomas Gleixner  noinstr void interrupt(struct pt_regs *regs, int nr)
149bf026e2eSThomas Gleixner  {
150bf026e2eSThomas Gleixner	arch_interrupt_enter(regs);
151bf026e2eSThomas Gleixner	state = irqentry_enter(regs);
152bf026e2eSThomas Gleixner
153bf026e2eSThomas Gleixner	instrumentation_begin();
154bf026e2eSThomas Gleixner
155bf026e2eSThomas Gleixner	irq_enter_rcu();
156bf026e2eSThomas Gleixner	invoke_irq_handler(regs, nr);
157bf026e2eSThomas Gleixner	irq_exit_rcu();
158bf026e2eSThomas Gleixner
159bf026e2eSThomas Gleixner	instrumentation_end();
160bf026e2eSThomas Gleixner
161bf026e2eSThomas Gleixner	irqentry_exit(regs, state);
162bf026e2eSThomas Gleixner  }
163bf026e2eSThomas Gleixner
164bf026e2eSThomas GleixnerNote that the invocation of the actual interrupt handler is within a
165bf026e2eSThomas Gleixnerirq_enter_rcu() and irq_exit_rcu() pair.
166bf026e2eSThomas Gleixner
167bf026e2eSThomas Gleixnerirq_enter_rcu() updates the preemption count which makes in_hardirq()
168bf026e2eSThomas Gleixnerreturn true, handles NOHZ tick state and interrupt time accounting. This
169bf026e2eSThomas Gleixnermeans that up to the point where irq_enter_rcu() is invoked in_hardirq()
170bf026e2eSThomas Gleixnerreturns false.
171bf026e2eSThomas Gleixner
172bf026e2eSThomas Gleixnerirq_exit_rcu() handles interrupt time accounting, undoes the preemption
173bf026e2eSThomas Gleixnercount update and eventually handles soft interrupts and NOHZ tick state.
174bf026e2eSThomas Gleixner
175bf026e2eSThomas GleixnerIn theory, the preemption count could be updated in irqentry_enter(). In
176bf026e2eSThomas Gleixnerpractice, deferring this update to irq_enter_rcu() allows the preemption-count
177bf026e2eSThomas Gleixnercode to be traced, while also maintaining symmetry with irq_exit_rcu() and
178bf026e2eSThomas Gleixnerirqentry_exit(), which are described in the next paragraph. The only downside
179bf026e2eSThomas Gleixneris that the early entry code up to irq_enter_rcu() must be aware that the
180bf026e2eSThomas Gleixnerpreemption count has not yet been updated with the HARDIRQ_OFFSET state.
181bf026e2eSThomas Gleixner
182bf026e2eSThomas GleixnerNote that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
183bf026e2eSThomas Gleixnerbefore it handles soft interrupts, whose handlers must run in BH context rather
184bf026e2eSThomas Gleixnerthan irq-disabled context. In addition, irqentry_exit() might schedule, which
185bf026e2eSThomas Gleixneralso requires that HARDIRQ_OFFSET has been removed from the preemption count.
186bf026e2eSThomas Gleixner
187*e3aa43e9SNicolas Saenz JulienneEven though interrupt handlers are expected to run with local interrupts
188*e3aa43e9SNicolas Saenz Juliennedisabled, interrupt nesting is common from an entry/exit perspective. For
189*e3aa43e9SNicolas Saenz Julienneexample, softirq handling happens within an irqentry_{enter,exit}() block with
190*e3aa43e9SNicolas Saenz Juliennelocal interrupts enabled. Also, although uncommon, nothing prevents an
191*e3aa43e9SNicolas Saenz Julienneinterrupt handler from re-enabling interrupts.
192*e3aa43e9SNicolas Saenz Julienne
193*e3aa43e9SNicolas Saenz JulienneInterrupt entry/exit code doesn't strictly need to handle reentrancy, since it
194*e3aa43e9SNicolas Saenz Julienneruns with local interrupts disabled. But NMIs can happen anytime, and a lot of
195*e3aa43e9SNicolas Saenz Juliennethe entry code is shared between the two.
196*e3aa43e9SNicolas Saenz Julienne
197bf026e2eSThomas GleixnerNMI and NMI-like exceptions
198bf026e2eSThomas Gleixner---------------------------
199bf026e2eSThomas Gleixner
200bf026e2eSThomas GleixnerNMIs and NMI-like exceptions (machine checks, double faults, debug
201bf026e2eSThomas Gleixnerinterrupts, etc.) can hit any context and must be extra careful with
202bf026e2eSThomas Gleixnerthe state.
203bf026e2eSThomas Gleixner
204bf026e2eSThomas GleixnerState changes for debug exceptions and machine-check exceptions depend on
205bf026e2eSThomas Gleixnerwhether these exceptions happened in user-space (breakpoints or watchpoints) or
206bf026e2eSThomas Gleixnerin kernel mode (code patching). From user-space, they are treated like
207bf026e2eSThomas Gleixnerinterrupts, while from kernel mode they are treated like NMIs.
208bf026e2eSThomas Gleixner
209bf026e2eSThomas GleixnerNMIs and other NMI-like exceptions handle state transitions without
210bf026e2eSThomas Gleixnerdistinguishing between user-mode and kernel-mode origin.
211bf026e2eSThomas Gleixner
212bf026e2eSThomas GleixnerThe state update on entry is handled in irqentry_nmi_enter() which updates
213bf026e2eSThomas Gleixnerstate in the following order:
214bf026e2eSThomas Gleixner
215bf026e2eSThomas Gleixner  * Preemption counter
216bf026e2eSThomas Gleixner  * Lockdep
217bf026e2eSThomas Gleixner  * RCU / Context tracking
218bf026e2eSThomas Gleixner  * Tracing
219bf026e2eSThomas Gleixner
220bf026e2eSThomas GleixnerThe exit counterpart irqentry_nmi_exit() does the reverse operation in the
221bf026e2eSThomas Gleixnerreverse order.
222bf026e2eSThomas Gleixner
223bf026e2eSThomas GleixnerNote that the update of the preemption counter has to be the first
224bf026e2eSThomas Gleixneroperation on enter and the last operation on exit. The reason is that both
225bf026e2eSThomas Gleixnerlockdep and RCU rely on in_nmi() returning true in this case. The
226bf026e2eSThomas Gleixnerpreemption count modification in the NMI entry/exit case must not be
227bf026e2eSThomas Gleixnertraced.
228bf026e2eSThomas Gleixner
229bf026e2eSThomas GleixnerArchitecture-specific code looks like this:
230bf026e2eSThomas Gleixner
231bf026e2eSThomas Gleixner.. code-block:: c
232bf026e2eSThomas Gleixner
233bf026e2eSThomas Gleixner  noinstr void nmi(struct pt_regs *regs)
234bf026e2eSThomas Gleixner  {
235bf026e2eSThomas Gleixner	arch_nmi_enter(regs);
236bf026e2eSThomas Gleixner	state = irqentry_nmi_enter(regs);
237bf026e2eSThomas Gleixner
238bf026e2eSThomas Gleixner	instrumentation_begin();
239bf026e2eSThomas Gleixner	nmi_handler(regs);
240bf026e2eSThomas Gleixner	instrumentation_end();
241bf026e2eSThomas Gleixner
242bf026e2eSThomas Gleixner	irqentry_nmi_exit(regs);
243bf026e2eSThomas Gleixner  }
244bf026e2eSThomas Gleixner
245bf026e2eSThomas Gleixnerand for e.g. a debug exception it can look like this:
246bf026e2eSThomas Gleixner
247bf026e2eSThomas Gleixner.. code-block:: c
248bf026e2eSThomas Gleixner
249bf026e2eSThomas Gleixner  noinstr void debug(struct pt_regs *regs)
250bf026e2eSThomas Gleixner  {
251bf026e2eSThomas Gleixner	arch_nmi_enter(regs);
252bf026e2eSThomas Gleixner
253bf026e2eSThomas Gleixner	debug_regs = save_debug_regs();
254bf026e2eSThomas Gleixner
255bf026e2eSThomas Gleixner	if (user_mode(regs)) {
256bf026e2eSThomas Gleixner		state = irqentry_enter(regs);
257bf026e2eSThomas Gleixner
258bf026e2eSThomas Gleixner		instrumentation_begin();
259bf026e2eSThomas Gleixner		user_mode_debug_handler(regs, debug_regs);
260bf026e2eSThomas Gleixner		instrumentation_end();
261bf026e2eSThomas Gleixner
262bf026e2eSThomas Gleixner		irqentry_exit(regs, state);
263bf026e2eSThomas Gleixner  	} else {
264bf026e2eSThomas Gleixner  		state = irqentry_nmi_enter(regs);
265bf026e2eSThomas Gleixner
266bf026e2eSThomas Gleixner		instrumentation_begin();
267bf026e2eSThomas Gleixner		kernel_mode_debug_handler(regs, debug_regs);
268bf026e2eSThomas Gleixner		instrumentation_end();
269bf026e2eSThomas Gleixner
270bf026e2eSThomas Gleixner		irqentry_nmi_exit(regs, state);
271bf026e2eSThomas Gleixner	}
272bf026e2eSThomas Gleixner  }
273bf026e2eSThomas Gleixner
274bf026e2eSThomas GleixnerThere is no combined irqentry_nmi_if_kernel() function available as the
275bf026e2eSThomas Gleixnerabove cannot be handled in an exception-agnostic way.
276*e3aa43e9SNicolas Saenz Julienne
277*e3aa43e9SNicolas Saenz JulienneNMIs can happen in any context. For example, an NMI-like exception triggered
278*e3aa43e9SNicolas Saenz Juliennewhile handling an NMI. So NMI entry code has to be reentrant and state updates
279*e3aa43e9SNicolas Saenz Julienneneed to handle nesting.
280