1 2=================== 3Classic BPF vs eBPF 4=================== 5 6eBPF is designed to be JITed with one to one mapping, which can also open up 7the possibility for GCC/LLVM compilers to generate optimized eBPF code through 8an eBPF backend that performs almost as fast as natively compiled code. 9 10Some core changes of the eBPF format from classic BPF: 11 12- Number of registers increase from 2 to 10: 13 14 The old format had two registers A and X, and a hidden frame pointer. The 15 new layout extends this to be 10 internal registers and a read-only frame 16 pointer. Since 64-bit CPUs are passing arguments to functions via registers 17 the number of args from eBPF program to in-kernel function is restricted 18 to 5 and one register is used to accept return value from an in-kernel 19 function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ 20 sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved 21 registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. 22 23 Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, 24 etc, and eBPF calling convention maps directly to ABIs used by the kernel on 25 64-bit architectures. 26 27 On 32-bit architectures JIT may map programs that use only 32-bit arithmetic 28 and may let more complex programs to be interpreted. 29 30 R0 - R5 are scratch registers and eBPF program needs spill/fill them if 31 necessary across calls. Note that there is only one eBPF program (== one 32 eBPF main routine) and it cannot call other eBPF functions, it can only 33 call predefined in-kernel functions, though. 34 35- Register width increases from 32-bit to 64-bit: 36 37 Still, the semantics of the original 32-bit ALU operations are preserved 38 via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower 39 subregisters that zero-extend into 64-bit if they are being written to. 40 That behavior maps directly to x86_64 and arm64 subregister definition, but 41 makes other JITs more difficult. 42 43 32-bit architectures run 64-bit eBPF programs via interpreter. 44 Their JITs may convert BPF programs that only use 32-bit subregisters into 45 native instruction set and let the rest being interpreted. 46 47 Operation is 64-bit, because on 64-bit architectures, pointers are also 48 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, 49 so 32-bit eBPF registers would otherwise require to define register-pair 50 ABI, thus, there won't be able to use a direct eBPF register to HW register 51 mapping and JIT would need to do combine/split/move operations for every 52 register in and out of the function, which is complex, bug prone and slow. 53 Another reason is the use of atomic 64-bit counters. 54 55- Conditional jt/jf targets replaced with jt/fall-through: 56 57 While the original design has constructs such as ``if (cond) jump_true; 58 else jump_false;``, they are being replaced into alternative constructs like 59 ``if (cond) jump_true; /* else fall-through */``. 60 61- Introduces bpf_call insn and register passing convention for zero overhead 62 calls from/to other kernel functions: 63 64 Before an in-kernel function call, the eBPF program needs to 65 place function arguments into R1 to R5 registers to satisfy calling 66 convention, then the interpreter will take them from registers and pass 67 to in-kernel function. If R1 - R5 registers are mapped to CPU registers 68 that are used for argument passing on given architecture, the JIT compiler 69 doesn't need to emit extra moves. Function arguments will be in the correct 70 registers and BPF_CALL instruction will be JITed as single 'call' HW 71 instruction. This calling convention was picked to cover common call 72 situations without performance penalty. 73 74 After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has 75 a return value of the function. Since R6 - R9 are callee saved, their state 76 is preserved across the call. 77 78 For example, consider three C functions:: 79 80 u64 f1() { return (*_f2)(1); } 81 u64 f2(u64 a) { return f3(a + 1, a); } 82 u64 f3(u64 a, u64 b) { return a - b; } 83 84 GCC can compile f1, f3 into x86_64:: 85 86 f1: 87 movl $1, %edi 88 movq _f2(%rip), %rax 89 jmp *%rax 90 f3: 91 movq %rdi, %rax 92 subq %rsi, %rax 93 ret 94 95 Function f2 in eBPF may look like:: 96 97 f2: 98 bpf_mov R2, R1 99 bpf_add R1, 1 100 bpf_call f3 101 bpf_exit 102 103 If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and 104 returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to 105 be used to call into f2. 106 107 For practical reasons all eBPF programs have only one argument 'ctx' which is 108 already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs 109 can call kernel functions with up to 5 arguments. Calls with 6 or more arguments 110 are currently not supported, but these restrictions can be lifted if necessary 111 in the future. 112 113 On 64-bit architectures all register map to HW registers one to one. For 114 example, x86_64 JIT compiler can map them as ... 115 116 :: 117 118 R0 - rax 119 R1 - rdi 120 R2 - rsi 121 R3 - rdx 122 R4 - rcx 123 R5 - r8 124 R6 - rbx 125 R7 - r13 126 R8 - r14 127 R9 - r15 128 R10 - rbp 129 130 ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing 131 and rbx, r12 - r15 are callee saved. 132 133 Then the following eBPF pseudo-program:: 134 135 bpf_mov R6, R1 /* save ctx */ 136 bpf_mov R2, 2 137 bpf_mov R3, 3 138 bpf_mov R4, 4 139 bpf_mov R5, 5 140 bpf_call foo 141 bpf_mov R7, R0 /* save foo() return value */ 142 bpf_mov R1, R6 /* restore ctx for next call */ 143 bpf_mov R2, 6 144 bpf_mov R3, 7 145 bpf_mov R4, 8 146 bpf_mov R5, 9 147 bpf_call bar 148 bpf_add R0, R7 149 bpf_exit 150 151 After JIT to x86_64 may look like:: 152 153 push %rbp 154 mov %rsp,%rbp 155 sub $0x228,%rsp 156 mov %rbx,-0x228(%rbp) 157 mov %r13,-0x220(%rbp) 158 mov %rdi,%rbx 159 mov $0x2,%esi 160 mov $0x3,%edx 161 mov $0x4,%ecx 162 mov $0x5,%r8d 163 callq foo 164 mov %rax,%r13 165 mov %rbx,%rdi 166 mov $0x6,%esi 167 mov $0x7,%edx 168 mov $0x8,%ecx 169 mov $0x9,%r8d 170 callq bar 171 add %r13,%rax 172 mov -0x228(%rbp),%rbx 173 mov -0x220(%rbp),%r13 174 leaveq 175 retq 176 177 Which is in this example equivalent in C to:: 178 179 u64 bpf_filter(u64 ctx) 180 { 181 return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); 182 } 183 184 In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 185 arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper 186 registers and place their return value into ``%rax`` which is R0 in eBPF. 187 Prologue and epilogue are emitted by JIT and are implicit in the 188 interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve 189 them across the calls as defined by calling convention. 190 191 For example the following program is invalid:: 192 193 bpf_mov R1, 1 194 bpf_call foo 195 bpf_mov R0, R1 196 bpf_exit 197 198 After the call the registers R1-R5 contain junk values and cannot be read. 199 An in-kernel verifier.rst is used to validate eBPF programs. 200 201Also in the new design, eBPF is limited to 4096 insns, which means that any 202program will terminate quickly and will only call a fixed number of kernel 203functions. Original BPF and eBPF are two operand instructions, 204which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. 205 206The input context pointer for invoking the interpreter function is generic, 207its content is defined by a specific use case. For seccomp register R1 points 208to seccomp_data, for converted BPF filters R1 points to a skb. 209 210A program, that is translated internally consists of the following elements:: 211 212 op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 213 214So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field 215has room for new instructions. Some of them may use 16/24/32 byte encoding. New 216instructions must be multiple of 8 bytes to preserve backward compatibility. 217 218eBPF is a general purpose RISC instruction set. Not every register and 219every instruction are used during translation from original BPF to eBPF. 220For example, socket filters are not using ``exclusive add`` instruction, but 221tracing filters may do to maintain counters of events, for example. Register R9 222is not used by socket filters either, but more complex filters may be running 223out of registers and would have to resort to spill/fill to stack. 224 225eBPF can be used as a generic assembler for last step performance 226optimizations, socket filters and seccomp are using it as assembler. Tracing 227filters may use it as assembler to generate code from kernel. In kernel usage 228may not be bounded by security considerations, since generated eBPF code 229may be optimizing internal code path and not being exposed to the user space. 230Safety of eBPF can come from the verifier.rst. In such use cases as 231described, it may be used as safe instruction set. 232 233Just like the original BPF, eBPF runs within a controlled environment, 234is deterministic and the kernel can easily prove that. The safety of the program 235can be determined in two steps: first step does depth-first-search to disallow 236loops and other CFG validation; second step starts from the first insn and 237descends all possible paths. It simulates execution of every insn and observes 238the state change of registers and stack. 239 240opcode encoding 241=============== 242 243eBPF is reusing most of the opcode encoding from classic to simplify conversion 244of classic BPF to eBPF. 245 246For arithmetic and jump instructions the 8-bit 'code' field is divided into three 247parts:: 248 249 +----------------+--------+--------------------+ 250 | 4 bits | 1 bit | 3 bits | 251 | operation code | source | instruction class | 252 +----------------+--------+--------------------+ 253 (MSB) (LSB) 254 255Three LSB bits store instruction class which is one of: 256 257 =================== =============== 258 Classic BPF classes eBPF classes 259 =================== =============== 260 BPF_LD 0x00 BPF_LD 0x00 261 BPF_LDX 0x01 BPF_LDX 0x01 262 BPF_ST 0x02 BPF_ST 0x02 263 BPF_STX 0x03 BPF_STX 0x03 264 BPF_ALU 0x04 BPF_ALU 0x04 265 BPF_JMP 0x05 BPF_JMP 0x05 266 BPF_RET 0x06 BPF_JMP32 0x06 267 BPF_MISC 0x07 BPF_ALU64 0x07 268 =================== =============== 269 270The 4th bit encodes the source operand ... 271 272 :: 273 274 BPF_K 0x00 275 BPF_X 0x08 276 277 * in classic BPF, this means:: 278 279 BPF_SRC(code) == BPF_X - use register X as source operand 280 BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 281 282 * in eBPF, this means:: 283 284 BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand 285 BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand 286 287... and four MSB bits store operation code. 288 289If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: 290 291 BPF_ADD 0x00 292 BPF_SUB 0x10 293 BPF_MUL 0x20 294 BPF_DIV 0x30 295 BPF_OR 0x40 296 BPF_AND 0x50 297 BPF_LSH 0x60 298 BPF_RSH 0x70 299 BPF_NEG 0x80 300 BPF_MOD 0x90 301 BPF_XOR 0xa0 302 BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ 303 BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ 304 BPF_END 0xd0 /* eBPF only: endianness conversion */ 305 306If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: 307 308 BPF_JA 0x00 /* BPF_JMP only */ 309 BPF_JEQ 0x10 310 BPF_JGT 0x20 311 BPF_JGE 0x30 312 BPF_JSET 0x40 313 BPF_JNE 0x50 /* eBPF only: jump != */ 314 BPF_JSGT 0x60 /* eBPF only: signed '>' */ 315 BPF_JSGE 0x70 /* eBPF only: signed '>=' */ 316 BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ 317 BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ 318 BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ 319 BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ 320 BPF_JSLT 0xc0 /* eBPF only: signed '<' */ 321 BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ 322 323So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF 324and eBPF. There are only two registers in classic BPF, so it means A += X. 325In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, 326BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous 327src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. 328 329Classic BPF is using BPF_MISC class to represent A = X and X = A moves. 330eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no 331BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean 332exactly the same operations as BPF_ALU, but with 64-bit wide operands 333instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: 334dst_reg = dst_reg + src_reg 335 336Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` 337operation. Classic BPF_RET | BPF_K means copy imm32 into return register 338and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT 339in eBPF means function exit only. The eBPF program needs to store return 340value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as 341BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide 342operands for the comparisons instead. 343 344For load and store instructions the 8-bit 'code' field is divided as:: 345 346 +--------+--------+-------------------+ 347 | 3 bits | 2 bits | 3 bits | 348 | mode | size | instruction class | 349 +--------+--------+-------------------+ 350 (MSB) (LSB) 351 352Size modifier is one of ... 353 354:: 355 356 BPF_W 0x00 /* word */ 357 BPF_H 0x08 /* half word */ 358 BPF_B 0x10 /* byte */ 359 BPF_DW 0x18 /* eBPF only, double word */ 360 361... which encodes size of load/store operation:: 362 363 B - 1 byte 364 H - 2 byte 365 W - 4 byte 366 DW - 8 byte (eBPF only) 367 368Mode modifier is one of:: 369 370 BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ 371 BPF_ABS 0x20 372 BPF_IND 0x40 373 BPF_MEM 0x60 374 BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ 375 BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ 376 BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */ 377