1*ff61f079SJonathan Corbet.. SPDX-License-Identifier: GPL-2.0
2*ff61f079SJonathan Corbet
3*ff61f079SJonathan Corbet===============================
4*ff61f079SJonathan CorbetKernel level exception handling
5*ff61f079SJonathan Corbet===============================
6*ff61f079SJonathan Corbet
7*ff61f079SJonathan CorbetCommentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
8*ff61f079SJonathan Corbet
9*ff61f079SJonathan CorbetWhen a process runs in kernel mode, it often has to access user
10*ff61f079SJonathan Corbetmode memory whose address has been passed by an untrusted program.
11*ff61f079SJonathan CorbetTo protect itself the kernel has to verify this address.
12*ff61f079SJonathan Corbet
13*ff61f079SJonathan CorbetIn older versions of Linux this was done with the
14*ff61f079SJonathan Corbetint verify_area(int type, const void * addr, unsigned long size)
15*ff61f079SJonathan Corbetfunction (which has since been replaced by access_ok()).
16*ff61f079SJonathan Corbet
17*ff61f079SJonathan CorbetThis function verified that the memory area starting at address
18*ff61f079SJonathan Corbet'addr' and of size 'size' was accessible for the operation specified
19*ff61f079SJonathan Corbetin type (read or write). To do this, verify_read had to look up the
20*ff61f079SJonathan Corbetvirtual memory area (vma) that contained the address addr. In the
21*ff61f079SJonathan Corbetnormal case (correctly working program), this test was successful.
22*ff61f079SJonathan CorbetIt only failed for a few buggy programs. In some kernel profiling
23*ff61f079SJonathan Corbettests, this normally unneeded verification used up a considerable
24*ff61f079SJonathan Corbetamount of time.
25*ff61f079SJonathan Corbet
26*ff61f079SJonathan CorbetTo overcome this situation, Linus decided to let the virtual memory
27*ff61f079SJonathan Corbethardware present in every Linux-capable CPU handle this test.
28*ff61f079SJonathan Corbet
29*ff61f079SJonathan CorbetHow does this work?
30*ff61f079SJonathan Corbet
31*ff61f079SJonathan CorbetWhenever the kernel tries to access an address that is currently not
32*ff61f079SJonathan Corbetaccessible, the CPU generates a page fault exception and calls the
33*ff61f079SJonathan Corbetpage fault handler::
34*ff61f079SJonathan Corbet
35*ff61f079SJonathan Corbet  void exc_page_fault(struct pt_regs *regs, unsigned long error_code)
36*ff61f079SJonathan Corbet
37*ff61f079SJonathan Corbetin arch/x86/mm/fault.c. The parameters on the stack are set up by
38*ff61f079SJonathan Corbetthe low level assembly glue in arch/x86/entry/entry_32.S. The parameter
39*ff61f079SJonathan Corbetregs is a pointer to the saved registers on the stack, error_code
40*ff61f079SJonathan Corbetcontains a reason code for the exception.
41*ff61f079SJonathan Corbet
42*ff61f079SJonathan Corbetexc_page_fault() first obtains the inaccessible address from the CPU
43*ff61f079SJonathan Corbetcontrol register CR2. If the address is within the virtual address
44*ff61f079SJonathan Corbetspace of the process, the fault probably occurred, because the page
45*ff61f079SJonathan Corbetwas not swapped in, write protected or something similar. However,
46*ff61f079SJonathan Corbetwe are interested in the other case: the address is not valid, there
47*ff61f079SJonathan Corbetis no vma that contains this address. In this case, the kernel jumps
48*ff61f079SJonathan Corbetto the bad_area label.
49*ff61f079SJonathan Corbet
50*ff61f079SJonathan CorbetThere it uses the address of the instruction that caused the exception
51*ff61f079SJonathan Corbet(i.e. regs->eip) to find an address where the execution can continue
52*ff61f079SJonathan Corbet(fixup). If this search is successful, the fault handler modifies the
53*ff61f079SJonathan Corbetreturn address (again regs->eip) and returns. The execution will
54*ff61f079SJonathan Corbetcontinue at the address in fixup.
55*ff61f079SJonathan Corbet
56*ff61f079SJonathan CorbetWhere does fixup point to?
57*ff61f079SJonathan Corbet
58*ff61f079SJonathan CorbetSince we jump to the contents of fixup, fixup obviously points
59*ff61f079SJonathan Corbetto executable code. This code is hidden inside the user access macros.
60*ff61f079SJonathan CorbetI have picked the get_user() macro defined in arch/x86/include/asm/uaccess.h
61*ff61f079SJonathan Corbetas an example. The definition is somewhat hard to follow, so let's peek at
62*ff61f079SJonathan Corbetthe code generated by the preprocessor and the compiler. I selected
63*ff61f079SJonathan Corbetthe get_user() call in drivers/char/sysrq.c for a detailed examination.
64*ff61f079SJonathan Corbet
65*ff61f079SJonathan CorbetThe original code in sysrq.c line 587::
66*ff61f079SJonathan Corbet
67*ff61f079SJonathan Corbet        get_user(c, buf);
68*ff61f079SJonathan Corbet
69*ff61f079SJonathan CorbetThe preprocessor output (edited to become somewhat readable)::
70*ff61f079SJonathan Corbet
71*ff61f079SJonathan Corbet  (
72*ff61f079SJonathan Corbet    {
73*ff61f079SJonathan Corbet      long __gu_err = - 14 , __gu_val = 0;
74*ff61f079SJonathan Corbet      const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
75*ff61f079SJonathan Corbet      if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
76*ff61f079SJonathan Corbet        (((sizeof(*(buf))) <= 0xC0000000UL) &&
77*ff61f079SJonathan Corbet        ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
78*ff61f079SJonathan Corbet        do {
79*ff61f079SJonathan Corbet          __gu_err  = 0;
80*ff61f079SJonathan Corbet          switch ((sizeof(*(buf)))) {
81*ff61f079SJonathan Corbet            case 1:
82*ff61f079SJonathan Corbet              __asm__ __volatile__(
83*ff61f079SJonathan Corbet                "1:      mov" "b" " %2,%" "b" "1\n"
84*ff61f079SJonathan Corbet                "2:\n"
85*ff61f079SJonathan Corbet                ".section .fixup,\"ax\"\n"
86*ff61f079SJonathan Corbet                "3:      movl %3,%0\n"
87*ff61f079SJonathan Corbet                "        xor" "b" " %" "b" "1,%" "b" "1\n"
88*ff61f079SJonathan Corbet                "        jmp 2b\n"
89*ff61f079SJonathan Corbet                ".section __ex_table,\"a\"\n"
90*ff61f079SJonathan Corbet                "        .align 4\n"
91*ff61f079SJonathan Corbet                "        .long 1b,3b\n"
92*ff61f079SJonathan Corbet                ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
93*ff61f079SJonathan Corbet                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
94*ff61f079SJonathan Corbet                break;
95*ff61f079SJonathan Corbet            case 2:
96*ff61f079SJonathan Corbet              __asm__ __volatile__(
97*ff61f079SJonathan Corbet                "1:      mov" "w" " %2,%" "w" "1\n"
98*ff61f079SJonathan Corbet                "2:\n"
99*ff61f079SJonathan Corbet                ".section .fixup,\"ax\"\n"
100*ff61f079SJonathan Corbet                "3:      movl %3,%0\n"
101*ff61f079SJonathan Corbet                "        xor" "w" " %" "w" "1,%" "w" "1\n"
102*ff61f079SJonathan Corbet                "        jmp 2b\n"
103*ff61f079SJonathan Corbet                ".section __ex_table,\"a\"\n"
104*ff61f079SJonathan Corbet                "        .align 4\n"
105*ff61f079SJonathan Corbet                "        .long 1b,3b\n"
106*ff61f079SJonathan Corbet                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
107*ff61f079SJonathan Corbet                              (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
108*ff61f079SJonathan Corbet                break;
109*ff61f079SJonathan Corbet            case 4:
110*ff61f079SJonathan Corbet              __asm__ __volatile__(
111*ff61f079SJonathan Corbet                "1:      mov" "l" " %2,%" "" "1\n"
112*ff61f079SJonathan Corbet                "2:\n"
113*ff61f079SJonathan Corbet                ".section .fixup,\"ax\"\n"
114*ff61f079SJonathan Corbet                "3:      movl %3,%0\n"
115*ff61f079SJonathan Corbet                "        xor" "l" " %" "" "1,%" "" "1\n"
116*ff61f079SJonathan Corbet                "        jmp 2b\n"
117*ff61f079SJonathan Corbet                ".section __ex_table,\"a\"\n"
118*ff61f079SJonathan Corbet                "        .align 4\n"        "        .long 1b,3b\n"
119*ff61f079SJonathan Corbet                ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
120*ff61f079SJonathan Corbet                              (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
121*ff61f079SJonathan Corbet                break;
122*ff61f079SJonathan Corbet            default:
123*ff61f079SJonathan Corbet              (__gu_val) = __get_user_bad();
124*ff61f079SJonathan Corbet          }
125*ff61f079SJonathan Corbet        } while (0) ;
126*ff61f079SJonathan Corbet      ((c)) = (__typeof__(*((buf))))__gu_val;
127*ff61f079SJonathan Corbet      __gu_err;
128*ff61f079SJonathan Corbet    }
129*ff61f079SJonathan Corbet  );
130*ff61f079SJonathan Corbet
131*ff61f079SJonathan CorbetWOW! Black GCC/assembly magic. This is impossible to follow, so let's
132*ff61f079SJonathan Corbetsee what code gcc generates::
133*ff61f079SJonathan Corbet
134*ff61f079SJonathan Corbet >         xorl %edx,%edx
135*ff61f079SJonathan Corbet >         movl current_set,%eax
136*ff61f079SJonathan Corbet >         cmpl $24,788(%eax)
137*ff61f079SJonathan Corbet >         je .L1424
138*ff61f079SJonathan Corbet >         cmpl $-1073741825,64(%esp)
139*ff61f079SJonathan Corbet >         ja .L1423
140*ff61f079SJonathan Corbet > .L1424:
141*ff61f079SJonathan Corbet >         movl %edx,%eax
142*ff61f079SJonathan Corbet >         movl 64(%esp),%ebx
143*ff61f079SJonathan Corbet > #APP
144*ff61f079SJonathan Corbet > 1:      movb (%ebx),%dl                /* this is the actual user access */
145*ff61f079SJonathan Corbet > 2:
146*ff61f079SJonathan Corbet > .section .fixup,"ax"
147*ff61f079SJonathan Corbet > 3:      movl $-14,%eax
148*ff61f079SJonathan Corbet >         xorb %dl,%dl
149*ff61f079SJonathan Corbet >         jmp 2b
150*ff61f079SJonathan Corbet > .section __ex_table,"a"
151*ff61f079SJonathan Corbet >         .align 4
152*ff61f079SJonathan Corbet >         .long 1b,3b
153*ff61f079SJonathan Corbet > .text
154*ff61f079SJonathan Corbet > #NO_APP
155*ff61f079SJonathan Corbet > .L1423:
156*ff61f079SJonathan Corbet >         movzbl %dl,%esi
157*ff61f079SJonathan Corbet
158*ff61f079SJonathan CorbetThe optimizer does a good job and gives us something we can actually
159*ff61f079SJonathan Corbetunderstand. Can we? The actual user access is quite obvious. Thanks
160*ff61f079SJonathan Corbetto the unified address space we can just access the address in user
161*ff61f079SJonathan Corbetmemory. But what does the .section stuff do?????
162*ff61f079SJonathan Corbet
163*ff61f079SJonathan CorbetTo understand this we have to look at the final kernel::
164*ff61f079SJonathan Corbet
165*ff61f079SJonathan Corbet > objdump --section-headers vmlinux
166*ff61f079SJonathan Corbet >
167*ff61f079SJonathan Corbet > vmlinux:     file format elf32-i386
168*ff61f079SJonathan Corbet >
169*ff61f079SJonathan Corbet > Sections:
170*ff61f079SJonathan Corbet > Idx Name          Size      VMA       LMA       File off  Algn
171*ff61f079SJonathan Corbet >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
172*ff61f079SJonathan Corbet >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
173*ff61f079SJonathan Corbet >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
174*ff61f079SJonathan Corbet >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
175*ff61f079SJonathan Corbet >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
176*ff61f079SJonathan Corbet >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
177*ff61f079SJonathan Corbet >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
178*ff61f079SJonathan Corbet >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
179*ff61f079SJonathan Corbet >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
180*ff61f079SJonathan Corbet >                   CONTENTS, ALLOC, LOAD, DATA
181*ff61f079SJonathan Corbet >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
182*ff61f079SJonathan Corbet >                   ALLOC
183*ff61f079SJonathan Corbet >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
184*ff61f079SJonathan Corbet >                   CONTENTS, READONLY
185*ff61f079SJonathan Corbet >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
186*ff61f079SJonathan Corbet >                   CONTENTS, READONLY
187*ff61f079SJonathan Corbet
188*ff61f079SJonathan CorbetThere are obviously 2 non standard ELF sections in the generated object
189*ff61f079SJonathan Corbetfile. But first we want to find out what happened to our code in the
190*ff61f079SJonathan Corbetfinal kernel executable::
191*ff61f079SJonathan Corbet
192*ff61f079SJonathan Corbet > objdump --disassemble --section=.text vmlinux
193*ff61f079SJonathan Corbet >
194*ff61f079SJonathan Corbet > c017e785 <do_con_write+c1> xorl   %edx,%edx
195*ff61f079SJonathan Corbet > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
196*ff61f079SJonathan Corbet > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
197*ff61f079SJonathan Corbet > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
198*ff61f079SJonathan Corbet > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
199*ff61f079SJonathan Corbet > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
200*ff61f079SJonathan Corbet > c017e79f <do_con_write+db> movl   %edx,%eax
201*ff61f079SJonathan Corbet > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
202*ff61f079SJonathan Corbet > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
203*ff61f079SJonathan Corbet > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
204*ff61f079SJonathan Corbet
205*ff61f079SJonathan CorbetThe whole user memory access is reduced to 10 x86 machine instructions.
206*ff61f079SJonathan CorbetThe instructions bracketed in the .section directives are no longer
207*ff61f079SJonathan Corbetin the normal execution path. They are located in a different section
208*ff61f079SJonathan Corbetof the executable file::
209*ff61f079SJonathan Corbet
210*ff61f079SJonathan Corbet > objdump --disassemble --section=.fixup vmlinux
211*ff61f079SJonathan Corbet >
212*ff61f079SJonathan Corbet > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
213*ff61f079SJonathan Corbet > c0199ffa <.fixup+10ba> xorb   %dl,%dl
214*ff61f079SJonathan Corbet > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
215*ff61f079SJonathan Corbet
216*ff61f079SJonathan CorbetAnd finally::
217*ff61f079SJonathan Corbet
218*ff61f079SJonathan Corbet > objdump --full-contents --section=__ex_table vmlinux
219*ff61f079SJonathan Corbet >
220*ff61f079SJonathan Corbet >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
221*ff61f079SJonathan Corbet >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
222*ff61f079SJonathan Corbet >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
223*ff61f079SJonathan Corbet
224*ff61f079SJonathan Corbetor in human readable byte order::
225*ff61f079SJonathan Corbet
226*ff61f079SJonathan Corbet >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
227*ff61f079SJonathan Corbet >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
228*ff61f079SJonathan Corbet                               ^^^^^^^^^^^^^^^^^
229*ff61f079SJonathan Corbet                               this is the interesting part!
230*ff61f079SJonathan Corbet >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
231*ff61f079SJonathan Corbet
232*ff61f079SJonathan CorbetWhat happened? The assembly directives::
233*ff61f079SJonathan Corbet
234*ff61f079SJonathan Corbet  .section .fixup,"ax"
235*ff61f079SJonathan Corbet  .section __ex_table,"a"
236*ff61f079SJonathan Corbet
237*ff61f079SJonathan Corbettold the assembler to move the following code to the specified
238*ff61f079SJonathan Corbetsections in the ELF object file. So the instructions::
239*ff61f079SJonathan Corbet
240*ff61f079SJonathan Corbet  3:      movl $-14,%eax
241*ff61f079SJonathan Corbet          xorb %dl,%dl
242*ff61f079SJonathan Corbet          jmp 2b
243*ff61f079SJonathan Corbet
244*ff61f079SJonathan Corbetended up in the .fixup section of the object file and the addresses::
245*ff61f079SJonathan Corbet
246*ff61f079SJonathan Corbet        .long 1b,3b
247*ff61f079SJonathan Corbet
248*ff61f079SJonathan Corbetended up in the __ex_table section of the object file. 1b and 3b
249*ff61f079SJonathan Corbetare local labels. The local label 1b (1b stands for next label 1
250*ff61f079SJonathan Corbetbackward) is the address of the instruction that might fault, i.e.
251*ff61f079SJonathan Corbetin our case the address of the label 1 is c017e7a5:
252*ff61f079SJonathan Corbetthe original assembly code: > 1:      movb (%ebx),%dl
253*ff61f079SJonathan Corbetand linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
254*ff61f079SJonathan Corbet
255*ff61f079SJonathan CorbetThe local label 3 (backwards again) is the address of the code to handle
256*ff61f079SJonathan Corbetthe fault, in our case the actual value is c0199ff5:
257*ff61f079SJonathan Corbetthe original assembly code: > 3:      movl $-14,%eax
258*ff61f079SJonathan Corbetand linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
259*ff61f079SJonathan Corbet
260*ff61f079SJonathan CorbetIf the fixup was able to handle the exception, control flow may be returned
261*ff61f079SJonathan Corbetto the instruction after the one that triggered the fault, ie. local label 2b.
262*ff61f079SJonathan Corbet
263*ff61f079SJonathan CorbetThe assembly code::
264*ff61f079SJonathan Corbet
265*ff61f079SJonathan Corbet > .section __ex_table,"a"
266*ff61f079SJonathan Corbet >         .align 4
267*ff61f079SJonathan Corbet >         .long 1b,3b
268*ff61f079SJonathan Corbet
269*ff61f079SJonathan Corbetbecomes the value pair::
270*ff61f079SJonathan Corbet
271*ff61f079SJonathan Corbet >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
272*ff61f079SJonathan Corbet                               ^this is ^this is
273*ff61f079SJonathan Corbet                               1b       3b
274*ff61f079SJonathan Corbet
275*ff61f079SJonathan Corbetc017e7a5,c0199ff5 in the exception table of the kernel.
276*ff61f079SJonathan Corbet
277*ff61f079SJonathan CorbetSo, what actually happens if a fault from kernel mode with no suitable
278*ff61f079SJonathan Corbetvma occurs?
279*ff61f079SJonathan Corbet
280*ff61f079SJonathan Corbet#. access to invalid address::
281*ff61f079SJonathan Corbet
282*ff61f079SJonathan Corbet    > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
283*ff61f079SJonathan Corbet#. MMU generates exception
284*ff61f079SJonathan Corbet#. CPU calls exc_page_fault()
285*ff61f079SJonathan Corbet#. exc_page_fault() calls do_user_addr_fault()
286*ff61f079SJonathan Corbet#. do_user_addr_fault() calls kernelmode_fixup_or_oops()
287*ff61f079SJonathan Corbet#. kernelmode_fixup_or_oops() calls fixup_exception() (regs->eip == c017e7a5);
288*ff61f079SJonathan Corbet#. fixup_exception() calls search_exception_tables()
289*ff61f079SJonathan Corbet#. search_exception_tables() looks up the address c017e7a5 in the
290*ff61f079SJonathan Corbet   exception table (i.e. the contents of the ELF section __ex_table)
291*ff61f079SJonathan Corbet   and returns the address of the associated fault handle code c0199ff5.
292*ff61f079SJonathan Corbet#. fixup_exception() modifies its own return address to point to the fault
293*ff61f079SJonathan Corbet   handle code and returns.
294*ff61f079SJonathan Corbet#. execution continues in the fault handling code.
295*ff61f079SJonathan Corbet#. a) EAX becomes -EFAULT (== -14)
296*ff61f079SJonathan Corbet   b) DL  becomes zero (the value we "read" from user space)
297*ff61f079SJonathan Corbet   c) execution continues at local label 2 (address of the
298*ff61f079SJonathan Corbet      instruction immediately after the faulting user access).
299*ff61f079SJonathan Corbet
300*ff61f079SJonathan CorbetThe steps 8a to 8c in a certain way emulate the faulting instruction.
301*ff61f079SJonathan Corbet
302*ff61f079SJonathan CorbetThat's it, mostly. If you look at our example, you might ask why
303*ff61f079SJonathan Corbetwe set EAX to -EFAULT in the exception handler code. Well, the
304*ff61f079SJonathan Corbetget_user() macro actually returns a value: 0, if the user access was
305*ff61f079SJonathan Corbetsuccessful, -EFAULT on failure. Our original code did not test this
306*ff61f079SJonathan Corbetreturn value, however the inline assembly code in get_user() tries to
307*ff61f079SJonathan Corbetreturn -EFAULT. GCC selected EAX to return this value.
308*ff61f079SJonathan Corbet
309*ff61f079SJonathan CorbetNOTE:
310*ff61f079SJonathan CorbetDue to the way that the exception table is built and needs to be ordered,
311*ff61f079SJonathan Corbetonly use exceptions for code in the .text section.  Any other section
312*ff61f079SJonathan Corbetwill cause the exception table to not be sorted correctly, and the
313*ff61f079SJonathan Corbetexceptions will fail.
314*ff61f079SJonathan Corbet
315*ff61f079SJonathan CorbetThings changed when 64-bit support was added to x86 Linux. Rather than
316*ff61f079SJonathan Corbetdouble the size of the exception table by expanding the two entries
317*ff61f079SJonathan Corbetfrom 32-bits to 64 bits, a clever trick was used to store addresses
318*ff61f079SJonathan Corbetas relative offsets from the table itself. The assembly code changed
319*ff61f079SJonathan Corbetfrom::
320*ff61f079SJonathan Corbet
321*ff61f079SJonathan Corbet    .long 1b,3b
322*ff61f079SJonathan Corbet  to:
323*ff61f079SJonathan Corbet          .long (from) - .
324*ff61f079SJonathan Corbet          .long (to) - .
325*ff61f079SJonathan Corbet
326*ff61f079SJonathan Corbetand the C-code that uses these values converts back to absolute addresses
327*ff61f079SJonathan Corbetlike this::
328*ff61f079SJonathan Corbet
329*ff61f079SJonathan Corbet	ex_insn_addr(const struct exception_table_entry *x)
330*ff61f079SJonathan Corbet	{
331*ff61f079SJonathan Corbet		return (unsigned long)&x->insn + x->insn;
332*ff61f079SJonathan Corbet	}
333*ff61f079SJonathan Corbet
334*ff61f079SJonathan CorbetIn v4.6 the exception table entry was expanded with a new field "handler".
335*ff61f079SJonathan CorbetThis is also 32-bits wide and contains a third relative function
336*ff61f079SJonathan Corbetpointer which points to one of:
337*ff61f079SJonathan Corbet
338*ff61f079SJonathan Corbet1) ``int ex_handler_default(const struct exception_table_entry *fixup)``
339*ff61f079SJonathan Corbet     This is legacy case that just jumps to the fixup code
340*ff61f079SJonathan Corbet
341*ff61f079SJonathan Corbet2) ``int ex_handler_fault(const struct exception_table_entry *fixup)``
342*ff61f079SJonathan Corbet     This case provides the fault number of the trap that occurred at
343*ff61f079SJonathan Corbet     entry->insn. It is used to distinguish page faults from machine
344*ff61f079SJonathan Corbet     check.
345*ff61f079SJonathan Corbet
346*ff61f079SJonathan CorbetMore functions can easily be added.
347*ff61f079SJonathan Corbet
348*ff61f079SJonathan CorbetCONFIG_BUILDTIME_TABLE_SORT allows the __ex_table section to be sorted post
349*ff61f079SJonathan Corbetlink of the kernel image, via a host utility scripts/sorttable. It will set the
350*ff61f079SJonathan Corbetsymbol main_extable_sort_needed to 0, avoiding sorting the __ex_table section
351*ff61f079SJonathan Corbetat boot time. With the exception table sorted, at runtime when an exception
352*ff61f079SJonathan Corbetoccurs we can quickly lookup the __ex_table entry via binary search.
353*ff61f079SJonathan Corbet
354*ff61f079SJonathan CorbetThis is not just a boot time optimization, some architectures require this
355*ff61f079SJonathan Corbettable to be sorted in order to handle exceptions relatively early in the boot
356*ff61f079SJonathan Corbetprocess. For example, i386 makes use of this form of exception handling before
357*ff61f079SJonathan Corbetpaging support is even enabled!
358