1 2============= 3eBPF verifier 4============= 5 6The safety of the eBPF program is determined in two steps. 7 8First step does DAG check to disallow loops and other CFG validation. 9In particular it will detect programs that have unreachable instructions. 10(though classic BPF checker allows them) 11 12Second step starts from the first insn and descends all possible paths. 13It simulates execution of every insn and observes the state change of 14registers and stack. 15 16At the start of the program the register R1 contains a pointer to context 17and has type PTR_TO_CTX. 18If verifier sees an insn that does R2=R1, then R2 has now type 19PTR_TO_CTX as well and can be used on the right hand side of expression. 20If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, 21since addition of two valid pointers makes invalid pointer. 22(In 'secure' mode verifier will reject any type of pointer arithmetic to make 23sure that kernel addresses don't leak to unprivileged users) 24 25If register was never written to, it's not readable:: 26 27 bpf_mov R0 = R2 28 bpf_exit 29 30will be rejected, since R2 is unreadable at the start of the program. 31 32After kernel function call, R1-R5 are reset to unreadable and 33R0 has a return type of the function. 34 35Since R6-R9 are callee saved, their state is preserved across the call. 36 37:: 38 39 bpf_mov R6 = 1 40 bpf_call foo 41 bpf_mov R0 = R6 42 bpf_exit 43 44is a correct program. If there was R1 instead of R6, it would have 45been rejected. 46 47load/store instructions are allowed only with registers of valid types, which 48are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. 49For example:: 50 51 bpf_mov R1 = 1 52 bpf_mov R2 = 2 53 bpf_xadd *(u32 *)(R1 + 3) += R2 54 bpf_exit 55 56will be rejected, since R1 doesn't have a valid pointer type at the time of 57execution of instruction bpf_xadd. 58 59At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) 60A callback is used to customize verifier to restrict eBPF program access to only 61certain fields within ctx structure with specified size and alignment. 62 63For example, the following insn:: 64 65 bpf_ld R0 = *(u32 *)(R6 + 8) 66 67intends to load a word from address R6 + 8 and store it into R0 68If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know 69that offset 8 of size 4 bytes can be accessed for reading, otherwise 70the verifier will reject the program. 71If R6=PTR_TO_STACK, then access should be aligned and be within 72stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, 73so it will fail verification, since it's out of bounds. 74 75The verifier will allow eBPF program to read data from stack only after 76it wrote into it. 77 78Classic BPF verifier does similar check with M[0-15] memory slots. 79For example:: 80 81 bpf_ld R0 = *(u32 *)(R10 - 4) 82 bpf_exit 83 84is invalid program. 85Though R10 is correct read-only register and has type PTR_TO_STACK 86and R10 - 4 is within stack bounds, there were no stores into that location. 87 88Pointer register spill/fill is tracked as well, since four (R6-R9) 89callee saved registers may not be enough for some programs. 90 91Allowed function calls are customized with bpf_verifier_ops->get_func_proto() 92The eBPF verifier will check that registers match argument constraints. 93After the call register R0 will be set to return type of the function. 94 95Function calls is a main mechanism to extend functionality of eBPF programs. 96Socket filters may let programs to call one set of functions, whereas tracing 97filters may allow completely different set. 98 99If a function made accessible to eBPF program, it needs to be thought through 100from safety point of view. The verifier will guarantee that the function is 101called with valid arguments. 102 103seccomp vs socket filters have different security restrictions for classic BPF. 104Seccomp solves this by two stage verifier: classic BPF verifier is followed 105by seccomp verifier. In case of eBPF one configurable verifier is shared for 106all use cases. 107 108See details of eBPF verifier in kernel/bpf/verifier.c 109 110Register value tracking 111======================= 112 113In order to determine the safety of an eBPF program, the verifier must track 114the range of possible values in each register and also in each stack slot. 115This is done with ``struct bpf_reg_state``, defined in include/linux/ 116bpf_verifier.h, which unifies tracking of scalar and pointer values. Each 117register state has a type, which is either NOT_INIT (the register has not been 118written to), SCALAR_VALUE (some value which is not usable as a pointer), or a 119pointer type. The types of pointers describe their base, as follows: 120 121 122 PTR_TO_CTX 123 Pointer to bpf_context. 124 CONST_PTR_TO_MAP 125 Pointer to struct bpf_map. "Const" because arithmetic 126 on these pointers is forbidden. 127 PTR_TO_MAP_VALUE 128 Pointer to the value stored in a map element. 129 PTR_TO_MAP_VALUE_OR_NULL 130 Either a pointer to a map value, or NULL; map accesses 131 (see maps.rst) return this type, which becomes a 132 PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on 133 these pointers is forbidden. 134 PTR_TO_STACK 135 Frame pointer. 136 PTR_TO_PACKET 137 skb->data. 138 PTR_TO_PACKET_END 139 skb->data + headlen; arithmetic forbidden. 140 PTR_TO_SOCKET 141 Pointer to struct bpf_sock_ops, implicitly refcounted. 142 PTR_TO_SOCKET_OR_NULL 143 Either a pointer to a socket, or NULL; socket lookup 144 returns this type, which becomes a PTR_TO_SOCKET when 145 checked != NULL. PTR_TO_SOCKET is reference-counted, 146 so programs must release the reference through the 147 socket release function before the end of the program. 148 Arithmetic on these pointers is forbidden. 149 150However, a pointer may be offset from this base (as a result of pointer 151arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable 152offset'. The former is used when an exactly-known value (e.g. an immediate 153operand) is added to a pointer, while the latter is used for values which are 154not exactly known. The variable offset is also used in SCALAR_VALUEs, to track 155the range of possible values in the register. 156 157The verifier's knowledge about the variable offset consists of: 158 159* minimum and maximum values as unsigned 160* minimum and maximum values as signed 161 162* knowledge of the values of individual bits, in the form of a 'tnum': a u64 163 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 164 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both 165 mask and value; no bit should ever be 1 in both. For example, if a byte is read 166 into a register from memory, the register's top 56 bits are known zero, while 167 the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we 168 then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 169 0x1ff), because of potential carries. 170 171Besides arithmetic, the register state can also be updated by conditional 172branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch 173it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' 174branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or 175BPF_JSGE) would instead update the signed minimum/maximum values. Information 176from the signed and unsigned bounds can be combined; for instance if a value is 177first tested < 8 and then tested s> 4, the verifier will conclude that the value 178is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. 179 180PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all 181pointers sharing that same variable offset. This is important for packet range 182checks: after adding a variable to a packet pointer register A, if you then copy 183it to another register B and then add a constant 4 to A, both registers will 184share the same 'id' but the A will have a fixed offset of +4. Then if A is 185bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is 186now known to have a safe range of at least 4 bytes. See 'Direct packet access', 187below, for more on PTR_TO_PACKET ranges. 188 189The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of 190the pointer returned from a map lookup. This means that when one copy is 191checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. 192As well as range-checking, the tracked information is also used for enforcing 193alignment of pointer accesses. For instance, on most systems the packet pointer 194is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump 195over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting 196pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 197bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through 198that pointer are safe. 199The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common 200to all copies of the pointer returned from a socket lookup. This has similar 201behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but 202it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly 203represents a reference to the corresponding ``struct sock``. To ensure that the 204reference is not leaked, it is imperative to NULL-check the reference and in 205the non-NULL case, and pass the valid reference to the socket release function. 206 207Direct packet access 208==================== 209 210In cls_bpf and act_bpf programs the verifier allows direct access to the packet 211data via skb->data and skb->data_end pointers. 212Ex:: 213 214 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ 215 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ 216 3: r5 = r3 217 4: r5 += 14 218 5: if r5 > r4 goto pc+16 219 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 220 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ 221 222this 2byte load from the packet is safe to do, since the program author 223did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which 224means that in the fall-through case the register R3 (which points to skb->data) 225has at least 14 directly accessible bytes. The verifier marks it 226as R3=pkt(id=0,off=0,r=14). 227id=0 means that no additional variables were added to the register. 228off=0 means that no additional constants were added. 229r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. 230Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points 231to the packet data, but constant 14 was added to the register, so 232it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) 233which is zero bytes. 234 235More complex packet access may look like:: 236 237 238 R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 239 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 240 7: r4 = *(u8 *)(r3 +12) 241 8: r4 *= 14 242 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ 243 10: r3 += r4 244 11: r2 = r1 245 12: r2 <<= 48 246 13: r2 >>= 48 247 14: r3 += r2 248 15: r2 = r3 249 16: r2 += 8 250 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 251 18: if r2 > r1 goto pc+2 252 R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 253 19: r1 = *(u8 *)(r3 +4) 254 255The state of the register R3 is R3=pkt(id=2,off=0,r=8) 256id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some 257offset within a packet and since the program author did 258``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). 259The verifier only allows 'add'/'sub' operations on packet registers. Any other 260operation will set the register state to 'SCALAR_VALUE' and it won't be 261available for direct packet access. 262 263Operation ``r3 += rX`` may overflow and become less than original skb->data, 264therefore the verifier has to prevent that. So when it sees ``r3 += rX`` 265instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 266against skb->data_end will not give us 'range' information, so attempts to read 267through the pointer will give "invalid access to packet" error. 268 269Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is 270R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits 271of the register are guaranteed to be zero, and nothing is known about the lower 2728 bits. After insn ``r4 *= 14`` the state becomes 273R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit 274value by constant 14 will keep upper 52 bits as zero, also the least significant 275bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make 276R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign 277extending. This logic is implemented in adjust_reg_min_max_vals() function, 278which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice 279versa) and adjust_scalar_min_max_vals() for operations on two scalars. 280 281The end result is that bpf program author can access packet directly 282using normal C code as:: 283 284 void *data = (void *)(long)skb->data; 285 void *data_end = (void *)(long)skb->data_end; 286 struct eth_hdr *eth = data; 287 struct iphdr *iph = data + sizeof(*eth); 288 struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); 289 290 if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) 291 return 0; 292 if (eth->h_proto != htons(ETH_P_IP)) 293 return 0; 294 if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) 295 return 0; 296 if (udp->dest == 53 || udp->source == 9) 297 ...; 298 299which makes such programs easier to write comparing to LD_ABS insn 300and significantly faster. 301 302Pruning 303======= 304 305The verifier does not actually walk all possible paths through the program. For 306each new branch to analyse, the verifier looks at all the states it's previously 307been in when at this instruction. If any of them contain the current state as a 308subset, the branch is 'pruned' - that is, the fact that the previous state was 309accepted implies the current state would be as well. For instance, if in the 310previous state, r1 held a packet-pointer, and in the current state, r1 holds a 311packet-pointer with a range as long or longer and at least as strict an 312alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't 313have been used by any path from that point, so any value in r2 (including 314another NOT_INIT) is safe. The implementation is in the function regsafe(). 315Pruning considers not only the registers but also the stack (and any spilled 316registers it may hold). They must all be safe for the branch to be pruned. 317This is implemented in states_equal(). 318 319Understanding eBPF verifier messages 320==================================== 321 322The following are few examples of invalid eBPF programs and verifier error 323messages as seen in the log: 324 325Program with unreachable instructions:: 326 327 static struct bpf_insn prog[] = { 328 BPF_EXIT_INSN(), 329 BPF_EXIT_INSN(), 330 }; 331 332Error: 333 334 unreachable insn 1 335 336Program that reads uninitialized register:: 337 338 BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), 339 BPF_EXIT_INSN(), 340 341Error:: 342 343 0: (bf) r0 = r2 344 R2 !read_ok 345 346Program that doesn't initialize R0 before exiting:: 347 348 BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), 349 BPF_EXIT_INSN(), 350 351Error:: 352 353 0: (bf) r2 = r1 354 1: (95) exit 355 R0 !read_ok 356 357Program that accesses stack out of bounds:: 358 359 BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), 360 BPF_EXIT_INSN(), 361 362Error:: 363 364 0: (7a) *(u64 *)(r10 +8) = 0 365 invalid stack off=8 size=8 366 367Program that doesn't initialize stack before passing its address into function:: 368 369 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 370 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 371 BPF_LD_MAP_FD(BPF_REG_1, 0), 372 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 373 BPF_EXIT_INSN(), 374 375Error:: 376 377 0: (bf) r2 = r10 378 1: (07) r2 += -8 379 2: (b7) r1 = 0x0 380 3: (85) call 1 381 invalid indirect read from stack off -8+0 size 8 382 383Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: 384 385 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 386 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 387 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 388 BPF_LD_MAP_FD(BPF_REG_1, 0), 389 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 390 BPF_EXIT_INSN(), 391 392Error:: 393 394 0: (7a) *(u64 *)(r10 -8) = 0 395 1: (bf) r2 = r10 396 2: (07) r2 += -8 397 3: (b7) r1 = 0x0 398 4: (85) call 1 399 fd 0 is not pointing to valid bpf_map 400 401Program that doesn't check return value of map_lookup_elem() before accessing 402map element:: 403 404 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 405 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 406 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 407 BPF_LD_MAP_FD(BPF_REG_1, 0), 408 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 409 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), 410 BPF_EXIT_INSN(), 411 412Error:: 413 414 0: (7a) *(u64 *)(r10 -8) = 0 415 1: (bf) r2 = r10 416 2: (07) r2 += -8 417 3: (b7) r1 = 0x0 418 4: (85) call 1 419 5: (7a) *(u64 *)(r0 +0) = 0 420 R0 invalid mem access 'map_value_or_null' 421 422Program that correctly checks map_lookup_elem() returned value for NULL, but 423accesses the memory with incorrect alignment:: 424 425 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 426 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 427 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 428 BPF_LD_MAP_FD(BPF_REG_1, 0), 429 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 430 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), 431 BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), 432 BPF_EXIT_INSN(), 433 434Error:: 435 436 0: (7a) *(u64 *)(r10 -8) = 0 437 1: (bf) r2 = r10 438 2: (07) r2 += -8 439 3: (b7) r1 = 1 440 4: (85) call 1 441 5: (15) if r0 == 0x0 goto pc+1 442 R0=map_ptr R10=fp 443 6: (7a) *(u64 *)(r0 +4) = 0 444 misaligned access off 4 size 8 445 446Program that correctly checks map_lookup_elem() returned value for NULL and 447accesses memory with correct alignment in one side of 'if' branch, but fails 448to do so in the other side of 'if' branch:: 449 450 BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), 451 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 452 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 453 BPF_LD_MAP_FD(BPF_REG_1, 0), 454 BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), 455 BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), 456 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), 457 BPF_EXIT_INSN(), 458 BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), 459 BPF_EXIT_INSN(), 460 461Error:: 462 463 0: (7a) *(u64 *)(r10 -8) = 0 464 1: (bf) r2 = r10 465 2: (07) r2 += -8 466 3: (b7) r1 = 1 467 4: (85) call 1 468 5: (15) if r0 == 0x0 goto pc+2 469 R0=map_ptr R10=fp 470 6: (7a) *(u64 *)(r0 +0) = 0 471 7: (95) exit 472 473 from 5 to 8: R0=imm0 R10=fp 474 8: (7a) *(u64 *)(r0 +0) = 1 475 R0 invalid mem access 'imm' 476 477Program that performs a socket lookup then sets the pointer to NULL without 478checking it:: 479 480 BPF_MOV64_IMM(BPF_REG_2, 0), 481 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 482 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 483 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 484 BPF_MOV64_IMM(BPF_REG_3, 4), 485 BPF_MOV64_IMM(BPF_REG_4, 0), 486 BPF_MOV64_IMM(BPF_REG_5, 0), 487 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 488 BPF_MOV64_IMM(BPF_REG_0, 0), 489 BPF_EXIT_INSN(), 490 491Error:: 492 493 0: (b7) r2 = 0 494 1: (63) *(u32 *)(r10 -8) = r2 495 2: (bf) r2 = r10 496 3: (07) r2 += -8 497 4: (b7) r3 = 4 498 5: (b7) r4 = 0 499 6: (b7) r5 = 0 500 7: (85) call bpf_sk_lookup_tcp#65 501 8: (b7) r0 = 0 502 9: (95) exit 503 Unreleased reference id=1, alloc_insn=7 504 505Program that performs a socket lookup but does not NULL-check the returned 506value:: 507 508 BPF_MOV64_IMM(BPF_REG_2, 0), 509 BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), 510 BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), 511 BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), 512 BPF_MOV64_IMM(BPF_REG_3, 4), 513 BPF_MOV64_IMM(BPF_REG_4, 0), 514 BPF_MOV64_IMM(BPF_REG_5, 0), 515 BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), 516 BPF_EXIT_INSN(), 517 518Error:: 519 520 0: (b7) r2 = 0 521 1: (63) *(u32 *)(r10 -8) = r2 522 2: (bf) r2 = r10 523 3: (07) r2 += -8 524 4: (b7) r3 = 4 525 5: (b7) r4 = 0 526 6: (b7) r5 = 0 527 7: (85) call bpf_sk_lookup_tcp#65 528 8: (95) exit 529 Unreleased reference id=1, alloc_insn=7 530