14ba7bc9fSMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 24ba7bc9fSMauro Carvalho Chehab 34ba7bc9fSMauro Carvalho Chehab=========== 44ba7bc9fSMauro Carvalho ChehabPacket MMAP 54ba7bc9fSMauro Carvalho Chehab=========== 64ba7bc9fSMauro Carvalho Chehab 74ba7bc9fSMauro Carvalho ChehabAbstract 84ba7bc9fSMauro Carvalho Chehab======== 94ba7bc9fSMauro Carvalho Chehab 104ba7bc9fSMauro Carvalho ChehabThis file documents the mmap() facility available with the PACKET 11e4da63cdSBaruch Siachsocket interface. This type of sockets is used for 124ba7bc9fSMauro Carvalho Chehab 134ba7bc9fSMauro Carvalho Chehabi) capture network traffic with utilities like tcpdump, 144ba7bc9fSMauro Carvalho Chehabii) transmit network traffic, or any other that needs raw 154ba7bc9fSMauro Carvalho Chehab access to network interface. 164ba7bc9fSMauro Carvalho Chehab 174ba7bc9fSMauro Carvalho ChehabHowto can be found at: 184ba7bc9fSMauro Carvalho Chehab 194ba7bc9fSMauro Carvalho Chehab https://sites.google.com/site/packetmmap/ 204ba7bc9fSMauro Carvalho Chehab 214ba7bc9fSMauro Carvalho ChehabPlease send your comments to 224ba7bc9fSMauro Carvalho Chehab - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> 234ba7bc9fSMauro Carvalho Chehab - Johann Baudy 244ba7bc9fSMauro Carvalho Chehab 254ba7bc9fSMauro Carvalho ChehabWhy use PACKET_MMAP 264ba7bc9fSMauro Carvalho Chehab=================== 274ba7bc9fSMauro Carvalho Chehab 28e4da63cdSBaruch SiachNon PACKET_MMAP capture process (plain AF_PACKET) is very 294ba7bc9fSMauro Carvalho Chehabinefficient. It uses very limited buffers and requires one system call to 304ba7bc9fSMauro Carvalho Chehabcapture each packet, it requires two if you want to get packet's timestamp 314ba7bc9fSMauro Carvalho Chehab(like libpcap always does). 324ba7bc9fSMauro Carvalho Chehab 33e4da63cdSBaruch SiachOn the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size 344ba7bc9fSMauro Carvalho Chehabconfigurable circular buffer mapped in user space that can be used to either 354ba7bc9fSMauro Carvalho Chehabsend or receive packets. This way reading packets just needs to wait for them, 364ba7bc9fSMauro Carvalho Chehabmost of the time there is no need to issue a single system call. Concerning 374ba7bc9fSMauro Carvalho Chehabtransmission, multiple packets can be sent through one system call to get the 384ba7bc9fSMauro Carvalho Chehabhighest bandwidth. By using a shared buffer between the kernel and the user 394ba7bc9fSMauro Carvalho Chehabalso has the benefit of minimizing packet copies. 404ba7bc9fSMauro Carvalho Chehab 414ba7bc9fSMauro Carvalho ChehabIt's fine to use PACKET_MMAP to improve the performance of the capture and 424ba7bc9fSMauro Carvalho Chehabtransmission process, but it isn't everything. At least, if you are capturing 434ba7bc9fSMauro Carvalho Chehabat high speeds (this is relative to the cpu speed), you should check if the 444ba7bc9fSMauro Carvalho Chehabdevice driver of your network interface card supports some sort of interrupt 454ba7bc9fSMauro Carvalho Chehabload mitigation or (even better) if it supports NAPI, also make sure it is 464ba7bc9fSMauro Carvalho Chehabenabled. For transmission, check the MTU (Maximum Transmission Unit) used and 474ba7bc9fSMauro Carvalho Chehabsupported by devices of your network. CPU IRQ pinning of your network interface 484ba7bc9fSMauro Carvalho Chehabcard can also be an advantage. 494ba7bc9fSMauro Carvalho Chehab 504ba7bc9fSMauro Carvalho ChehabHow to use mmap() to improve capture process 514ba7bc9fSMauro Carvalho Chehab============================================ 524ba7bc9fSMauro Carvalho Chehab 534ba7bc9fSMauro Carvalho ChehabFrom the user standpoint, you should use the higher level libpcap library, which 544ba7bc9fSMauro Carvalho Chehabis a de facto standard, portable across nearly all operating systems 554ba7bc9fSMauro Carvalho Chehabincluding Win32. 564ba7bc9fSMauro Carvalho Chehab 574ba7bc9fSMauro Carvalho ChehabPacket MMAP support was integrated into libpcap around the time of version 1.3.0; 584ba7bc9fSMauro Carvalho ChehabTPACKET_V3 support was added in version 1.5.0 594ba7bc9fSMauro Carvalho Chehab 604ba7bc9fSMauro Carvalho ChehabHow to use mmap() directly to improve capture process 614ba7bc9fSMauro Carvalho Chehab===================================================== 624ba7bc9fSMauro Carvalho Chehab 634ba7bc9fSMauro Carvalho ChehabFrom the system calls stand point, the use of PACKET_MMAP involves 644ba7bc9fSMauro Carvalho Chehabthe following process:: 654ba7bc9fSMauro Carvalho Chehab 664ba7bc9fSMauro Carvalho Chehab 674ba7bc9fSMauro Carvalho Chehab [setup] socket() -------> creation of the capture socket 684ba7bc9fSMauro Carvalho Chehab setsockopt() ---> allocation of the circular buffer (ring) 694ba7bc9fSMauro Carvalho Chehab option: PACKET_RX_RING 704ba7bc9fSMauro Carvalho Chehab mmap() ---------> mapping of the allocated buffer to the 714ba7bc9fSMauro Carvalho Chehab user process 724ba7bc9fSMauro Carvalho Chehab 734ba7bc9fSMauro Carvalho Chehab [capture] poll() ---------> to wait for incoming packets 744ba7bc9fSMauro Carvalho Chehab 754ba7bc9fSMauro Carvalho Chehab [shutdown] close() --------> destruction of the capture socket and 764ba7bc9fSMauro Carvalho Chehab deallocation of all associated 774ba7bc9fSMauro Carvalho Chehab resources. 784ba7bc9fSMauro Carvalho Chehab 794ba7bc9fSMauro Carvalho Chehab 804ba7bc9fSMauro Carvalho Chehabsocket creation and destruction is straight forward, and is done 814ba7bc9fSMauro Carvalho Chehabthe same way with or without PACKET_MMAP:: 824ba7bc9fSMauro Carvalho Chehab 834ba7bc9fSMauro Carvalho Chehab int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); 844ba7bc9fSMauro Carvalho Chehab 854ba7bc9fSMauro Carvalho Chehabwhere mode is SOCK_RAW for the raw interface were link level 864ba7bc9fSMauro Carvalho Chehabinformation can be captured or SOCK_DGRAM for the cooked 874ba7bc9fSMauro Carvalho Chehabinterface where link level information capture is not 884ba7bc9fSMauro Carvalho Chehabsupported and a link level pseudo-header is provided 894ba7bc9fSMauro Carvalho Chehabby the kernel. 904ba7bc9fSMauro Carvalho Chehab 914ba7bc9fSMauro Carvalho ChehabThe destruction of the socket and all associated resources 924ba7bc9fSMauro Carvalho Chehabis done by a simple call to close(fd). 934ba7bc9fSMauro Carvalho Chehab 944ba7bc9fSMauro Carvalho ChehabSimilarly as without PACKET_MMAP, it is possible to use one socket 954ba7bc9fSMauro Carvalho Chehabfor capture and transmission. This can be done by mapping the 964ba7bc9fSMauro Carvalho Chehaballocated RX and TX buffer ring with a single mmap() call. 974ba7bc9fSMauro Carvalho ChehabSee "Mapping and use of the circular buffer (ring)". 984ba7bc9fSMauro Carvalho Chehab 994ba7bc9fSMauro Carvalho ChehabNext I will describe PACKET_MMAP settings and its constraints, 1004ba7bc9fSMauro Carvalho Chehabalso the mapping of the circular buffer in the user process and 1014ba7bc9fSMauro Carvalho Chehabthe use of this buffer. 1024ba7bc9fSMauro Carvalho Chehab 1034ba7bc9fSMauro Carvalho ChehabHow to use mmap() directly to improve transmission process 1044ba7bc9fSMauro Carvalho Chehab========================================================== 1054ba7bc9fSMauro Carvalho ChehabTransmission process is similar to capture as shown below:: 1064ba7bc9fSMauro Carvalho Chehab 1074ba7bc9fSMauro Carvalho Chehab [setup] socket() -------> creation of the transmission socket 1084ba7bc9fSMauro Carvalho Chehab setsockopt() ---> allocation of the circular buffer (ring) 1094ba7bc9fSMauro Carvalho Chehab option: PACKET_TX_RING 1104ba7bc9fSMauro Carvalho Chehab bind() ---------> bind transmission socket with a network interface 1114ba7bc9fSMauro Carvalho Chehab mmap() ---------> mapping of the allocated buffer to the 1124ba7bc9fSMauro Carvalho Chehab user process 1134ba7bc9fSMauro Carvalho Chehab 1144ba7bc9fSMauro Carvalho Chehab [transmission] poll() ---------> wait for free packets (optional) 1154ba7bc9fSMauro Carvalho Chehab send() ---------> send all packets that are set as ready in 1164ba7bc9fSMauro Carvalho Chehab the ring 1174ba7bc9fSMauro Carvalho Chehab The flag MSG_DONTWAIT can be used to return 1184ba7bc9fSMauro Carvalho Chehab before end of transfer. 1194ba7bc9fSMauro Carvalho Chehab 1204ba7bc9fSMauro Carvalho Chehab [shutdown] close() --------> destruction of the transmission socket and 1214ba7bc9fSMauro Carvalho Chehab deallocation of all associated resources. 1224ba7bc9fSMauro Carvalho Chehab 1234ba7bc9fSMauro Carvalho ChehabSocket creation and destruction is also straight forward, and is done 1244ba7bc9fSMauro Carvalho Chehabthe same way as in capturing described in the previous paragraph:: 1254ba7bc9fSMauro Carvalho Chehab 1264ba7bc9fSMauro Carvalho Chehab int fd = socket(PF_PACKET, mode, 0); 1274ba7bc9fSMauro Carvalho Chehab 1284ba7bc9fSMauro Carvalho ChehabThe protocol can optionally be 0 in case we only want to transmit 1294ba7bc9fSMauro Carvalho Chehabvia this socket, which avoids an expensive call to packet_rcv(). 1304ba7bc9fSMauro Carvalho ChehabIn this case, you also need to bind(2) the TX_RING with sll_protocol = 0 1314ba7bc9fSMauro Carvalho Chehabset. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. 1324ba7bc9fSMauro Carvalho Chehab 1334ba7bc9fSMauro Carvalho ChehabBinding the socket to your network interface is mandatory (with zero copy) to 1344ba7bc9fSMauro Carvalho Chehabknow the header size of frames used in the circular buffer. 1354ba7bc9fSMauro Carvalho Chehab 1364ba7bc9fSMauro Carvalho ChehabAs capture, each frame contains two parts:: 1374ba7bc9fSMauro Carvalho Chehab 1384ba7bc9fSMauro Carvalho Chehab -------------------- 1394ba7bc9fSMauro Carvalho Chehab | struct tpacket_hdr | Header. It contains the status of 1404ba7bc9fSMauro Carvalho Chehab | | of this frame 1414ba7bc9fSMauro Carvalho Chehab |--------------------| 1424ba7bc9fSMauro Carvalho Chehab | data buffer | 1434ba7bc9fSMauro Carvalho Chehab . . Data that will be sent over the network interface. 1444ba7bc9fSMauro Carvalho Chehab . . 1454ba7bc9fSMauro Carvalho Chehab -------------------- 1464ba7bc9fSMauro Carvalho Chehab 1474ba7bc9fSMauro Carvalho Chehab bind() associates the socket to your network interface thanks to 1484ba7bc9fSMauro Carvalho Chehab sll_ifindex parameter of struct sockaddr_ll. 1494ba7bc9fSMauro Carvalho Chehab 1504ba7bc9fSMauro Carvalho Chehab Initialization example:: 1514ba7bc9fSMauro Carvalho Chehab 1524ba7bc9fSMauro Carvalho Chehab struct sockaddr_ll my_addr; 1534ba7bc9fSMauro Carvalho Chehab struct ifreq s_ifr; 1544ba7bc9fSMauro Carvalho Chehab ... 1554ba7bc9fSMauro Carvalho Chehab 156f9ce26c5SKees Cook strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); 1574ba7bc9fSMauro Carvalho Chehab 1584ba7bc9fSMauro Carvalho Chehab /* get interface index of eth0 */ 1594ba7bc9fSMauro Carvalho Chehab ioctl(this->socket, SIOCGIFINDEX, &s_ifr); 1604ba7bc9fSMauro Carvalho Chehab 1614ba7bc9fSMauro Carvalho Chehab /* fill sockaddr_ll struct to prepare binding */ 1624ba7bc9fSMauro Carvalho Chehab my_addr.sll_family = AF_PACKET; 1634ba7bc9fSMauro Carvalho Chehab my_addr.sll_protocol = htons(ETH_P_ALL); 1644ba7bc9fSMauro Carvalho Chehab my_addr.sll_ifindex = s_ifr.ifr_ifindex; 1654ba7bc9fSMauro Carvalho Chehab 1664ba7bc9fSMauro Carvalho Chehab /* bind socket to eth0 */ 1674ba7bc9fSMauro Carvalho Chehab bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); 1684ba7bc9fSMauro Carvalho Chehab 1694ba7bc9fSMauro Carvalho Chehab A complete tutorial is available at: https://sites.google.com/site/packetmmap/ 1704ba7bc9fSMauro Carvalho Chehab 1714ba7bc9fSMauro Carvalho ChehabBy default, the user should put data at:: 1724ba7bc9fSMauro Carvalho Chehab 1734ba7bc9fSMauro Carvalho Chehab frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) 1744ba7bc9fSMauro Carvalho Chehab 1754ba7bc9fSMauro Carvalho ChehabSo, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), 1764ba7bc9fSMauro Carvalho Chehabthe beginning of the user data will be at:: 1774ba7bc9fSMauro Carvalho Chehab 1784ba7bc9fSMauro Carvalho Chehab frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) 1794ba7bc9fSMauro Carvalho Chehab 1804ba7bc9fSMauro Carvalho ChehabIf you wish to put user data at a custom offset from the beginning of 1814ba7bc9fSMauro Carvalho Chehabthe frame (for payload alignment with SOCK_RAW mode for instance) you 1824ba7bc9fSMauro Carvalho Chehabcan set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order 1834ba7bc9fSMauro Carvalho Chehabto make this work it must be enabled previously with setsockopt() 1844ba7bc9fSMauro Carvalho Chehaband the PACKET_TX_HAS_OFF option. 1854ba7bc9fSMauro Carvalho Chehab 1864ba7bc9fSMauro Carvalho ChehabPACKET_MMAP settings 1874ba7bc9fSMauro Carvalho Chehab==================== 1884ba7bc9fSMauro Carvalho Chehab 1894ba7bc9fSMauro Carvalho ChehabTo setup PACKET_MMAP from user level code is done with a call like 1904ba7bc9fSMauro Carvalho Chehab 1914ba7bc9fSMauro Carvalho Chehab - Capture process:: 1924ba7bc9fSMauro Carvalho Chehab 1934ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) 1944ba7bc9fSMauro Carvalho Chehab 1954ba7bc9fSMauro Carvalho Chehab - Transmission process:: 1964ba7bc9fSMauro Carvalho Chehab 1974ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) 1984ba7bc9fSMauro Carvalho Chehab 1994ba7bc9fSMauro Carvalho ChehabThe most significant argument in the previous call is the req parameter, 2004ba7bc9fSMauro Carvalho Chehabthis parameter must to have the following structure:: 2014ba7bc9fSMauro Carvalho Chehab 2024ba7bc9fSMauro Carvalho Chehab struct tpacket_req 2034ba7bc9fSMauro Carvalho Chehab { 2044ba7bc9fSMauro Carvalho Chehab unsigned int tp_block_size; /* Minimal size of contiguous block */ 2054ba7bc9fSMauro Carvalho Chehab unsigned int tp_block_nr; /* Number of blocks */ 2064ba7bc9fSMauro Carvalho Chehab unsigned int tp_frame_size; /* Size of frame */ 2074ba7bc9fSMauro Carvalho Chehab unsigned int tp_frame_nr; /* Total number of frames */ 2084ba7bc9fSMauro Carvalho Chehab }; 2094ba7bc9fSMauro Carvalho Chehab 2104ba7bc9fSMauro Carvalho ChehabThis structure is defined in /usr/include/linux/if_packet.h and establishes a 2114ba7bc9fSMauro Carvalho Chehabcircular buffer (ring) of unswappable memory. 2124ba7bc9fSMauro Carvalho ChehabBeing mapped in the capture process allows reading the captured frames and 2134ba7bc9fSMauro Carvalho Chehabrelated meta-information like timestamps without requiring a system call. 2144ba7bc9fSMauro Carvalho Chehab 2154ba7bc9fSMauro Carvalho ChehabFrames are grouped in blocks. Each block is a physically contiguous 2164ba7bc9fSMauro Carvalho Chehabregion of memory and holds tp_block_size/tp_frame_size frames. The total number 2174ba7bc9fSMauro Carvalho Chehabof blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because:: 2184ba7bc9fSMauro Carvalho Chehab 2194ba7bc9fSMauro Carvalho Chehab frames_per_block = tp_block_size/tp_frame_size 2204ba7bc9fSMauro Carvalho Chehab 2214ba7bc9fSMauro Carvalho Chehabindeed, packet_set_ring checks that the following condition is true:: 2224ba7bc9fSMauro Carvalho Chehab 2234ba7bc9fSMauro Carvalho Chehab frames_per_block * tp_block_nr == tp_frame_nr 2244ba7bc9fSMauro Carvalho Chehab 2254ba7bc9fSMauro Carvalho ChehabLets see an example, with the following values:: 2264ba7bc9fSMauro Carvalho Chehab 2274ba7bc9fSMauro Carvalho Chehab tp_block_size= 4096 2284ba7bc9fSMauro Carvalho Chehab tp_frame_size= 2048 2294ba7bc9fSMauro Carvalho Chehab tp_block_nr = 4 2304ba7bc9fSMauro Carvalho Chehab tp_frame_nr = 8 2314ba7bc9fSMauro Carvalho Chehab 2324ba7bc9fSMauro Carvalho Chehabwe will get the following buffer structure:: 2334ba7bc9fSMauro Carvalho Chehab 2344ba7bc9fSMauro Carvalho Chehab block #1 block #2 2354ba7bc9fSMauro Carvalho Chehab +---------+---------+ +---------+---------+ 2364ba7bc9fSMauro Carvalho Chehab | frame 1 | frame 2 | | frame 3 | frame 4 | 2374ba7bc9fSMauro Carvalho Chehab +---------+---------+ +---------+---------+ 2384ba7bc9fSMauro Carvalho Chehab 2394ba7bc9fSMauro Carvalho Chehab block #3 block #4 2404ba7bc9fSMauro Carvalho Chehab +---------+---------+ +---------+---------+ 2414ba7bc9fSMauro Carvalho Chehab | frame 5 | frame 6 | | frame 7 | frame 8 | 2424ba7bc9fSMauro Carvalho Chehab +---------+---------+ +---------+---------+ 2434ba7bc9fSMauro Carvalho Chehab 2444ba7bc9fSMauro Carvalho ChehabA frame can be of any size with the only condition it can fit in a block. A block 2454ba7bc9fSMauro Carvalho Chehabcan only hold an integer number of frames, or in other words, a frame cannot 2464ba7bc9fSMauro Carvalho Chehabbe spawned across two blocks, so there are some details you have to take into 2474ba7bc9fSMauro Carvalho Chehabaccount when choosing the frame_size. See "Mapping and use of the circular 2484ba7bc9fSMauro Carvalho Chehabbuffer (ring)". 2494ba7bc9fSMauro Carvalho Chehab 2504ba7bc9fSMauro Carvalho ChehabPACKET_MMAP setting constraints 2514ba7bc9fSMauro Carvalho Chehab=============================== 2524ba7bc9fSMauro Carvalho Chehab 2534ba7bc9fSMauro Carvalho ChehabIn kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), 2544ba7bc9fSMauro Carvalho Chehabthe PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or 255e4da63cdSBaruch Siach16384 in a 64 bit architecture. 2564ba7bc9fSMauro Carvalho Chehab 2574ba7bc9fSMauro Carvalho ChehabBlock size limit 2584ba7bc9fSMauro Carvalho Chehab---------------- 2594ba7bc9fSMauro Carvalho Chehab 2604ba7bc9fSMauro Carvalho ChehabAs stated earlier, each block is a contiguous physical region of memory. These 2614ba7bc9fSMauro Carvalho Chehabmemory regions are allocated with calls to the __get_free_pages() function. As 2624ba7bc9fSMauro Carvalho Chehabthe name indicates, this function allocates pages of memory, and the second 2634ba7bc9fSMauro Carvalho Chehabargument is "order" or a power of two number of pages, that is 2644ba7bc9fSMauro Carvalho Chehab(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, 2654ba7bc9fSMauro Carvalho Chehaborder=2 ==> 16384 bytes, etc. The maximum size of a 2664ba7bc9fSMauro Carvalho Chehabregion allocated by __get_free_pages is determined by the MAX_ORDER macro. More 2674ba7bc9fSMauro Carvalho Chehabprecisely the limit can be calculated as:: 2684ba7bc9fSMauro Carvalho Chehab 2694ba7bc9fSMauro Carvalho Chehab PAGE_SIZE << MAX_ORDER 2704ba7bc9fSMauro Carvalho Chehab 2714ba7bc9fSMauro Carvalho Chehab In a i386 architecture PAGE_SIZE is 4096 bytes 2724ba7bc9fSMauro Carvalho Chehab In a 2.4/i386 kernel MAX_ORDER is 10 2734ba7bc9fSMauro Carvalho Chehab In a 2.6/i386 kernel MAX_ORDER is 11 2744ba7bc9fSMauro Carvalho Chehab 2754ba7bc9fSMauro Carvalho ChehabSo get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel 2764ba7bc9fSMauro Carvalho Chehabrespectively, with an i386 architecture. 2774ba7bc9fSMauro Carvalho Chehab 2784ba7bc9fSMauro Carvalho ChehabUser space programs can include /usr/include/sys/user.h and 2794ba7bc9fSMauro Carvalho Chehab/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. 2804ba7bc9fSMauro Carvalho Chehab 2814ba7bc9fSMauro Carvalho ChehabThe pagesize can also be determined dynamically with the getpagesize (2) 2824ba7bc9fSMauro Carvalho Chehabsystem call. 2834ba7bc9fSMauro Carvalho Chehab 2844ba7bc9fSMauro Carvalho ChehabBlock number limit 2854ba7bc9fSMauro Carvalho Chehab------------------ 2864ba7bc9fSMauro Carvalho Chehab 2874ba7bc9fSMauro Carvalho ChehabTo understand the constraints of PACKET_MMAP, we have to see the structure 2884ba7bc9fSMauro Carvalho Chehabused to hold the pointers to each block. 2894ba7bc9fSMauro Carvalho Chehab 2904ba7bc9fSMauro Carvalho ChehabCurrently, this structure is a dynamically allocated vector with kmalloc 2914ba7bc9fSMauro Carvalho Chehabcalled pg_vec, its size limits the number of blocks that can be allocated:: 2924ba7bc9fSMauro Carvalho Chehab 2934ba7bc9fSMauro Carvalho Chehab +---+---+---+---+ 2944ba7bc9fSMauro Carvalho Chehab | x | x | x | x | 2954ba7bc9fSMauro Carvalho Chehab +---+---+---+---+ 2964ba7bc9fSMauro Carvalho Chehab | | | | 2974ba7bc9fSMauro Carvalho Chehab | | | v 2984ba7bc9fSMauro Carvalho Chehab | | v block #4 2994ba7bc9fSMauro Carvalho Chehab | v block #3 3004ba7bc9fSMauro Carvalho Chehab v block #2 3014ba7bc9fSMauro Carvalho Chehab block #1 3024ba7bc9fSMauro Carvalho Chehab 3034ba7bc9fSMauro Carvalho Chehabkmalloc allocates any number of bytes of physically contiguous memory from 3044ba7bc9fSMauro Carvalho Chehaba pool of pre-determined sizes. This pool of memory is maintained by the slab 3054ba7bc9fSMauro Carvalho Chehaballocator which is at the end the responsible for doing the allocation and 3064ba7bc9fSMauro Carvalho Chehabhence which imposes the maximum memory that kmalloc can allocate. 3074ba7bc9fSMauro Carvalho Chehab 3084ba7bc9fSMauro Carvalho ChehabIn a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The 3094ba7bc9fSMauro Carvalho Chehabpredetermined sizes that kmalloc uses can be checked in the "size-<bytes>" 3104ba7bc9fSMauro Carvalho Chehabentries of /proc/slabinfo 3114ba7bc9fSMauro Carvalho Chehab 3124ba7bc9fSMauro Carvalho ChehabIn a 32 bit architecture, pointers are 4 bytes long, so the total number of 3134ba7bc9fSMauro Carvalho Chehabpointers to blocks is:: 3144ba7bc9fSMauro Carvalho Chehab 3154ba7bc9fSMauro Carvalho Chehab 131072/4 = 32768 blocks 3164ba7bc9fSMauro Carvalho Chehab 3174ba7bc9fSMauro Carvalho ChehabPACKET_MMAP buffer size calculator 3184ba7bc9fSMauro Carvalho Chehab================================== 3194ba7bc9fSMauro Carvalho Chehab 3204ba7bc9fSMauro Carvalho ChehabDefinitions: 3214ba7bc9fSMauro Carvalho Chehab 3224ba7bc9fSMauro Carvalho Chehab============== ================================================================ 3234ba7bc9fSMauro Carvalho Chehab<size-max> is the maximum size of allocable with kmalloc 3244ba7bc9fSMauro Carvalho Chehab (see /proc/slabinfo) 3254ba7bc9fSMauro Carvalho Chehab<pointer size> depends on the architecture -- ``sizeof(void *)`` 3264ba7bc9fSMauro Carvalho Chehab<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2) 3274ba7bc9fSMauro Carvalho Chehab<max-order> is the value defined with MAX_ORDER 3284ba7bc9fSMauro Carvalho Chehab<frame size> it's an upper bound of frame's capture size (more on this later) 3294ba7bc9fSMauro Carvalho Chehab============== ================================================================ 3304ba7bc9fSMauro Carvalho Chehab 3314ba7bc9fSMauro Carvalho Chehabfrom these definitions we will derive:: 3324ba7bc9fSMauro Carvalho Chehab 3334ba7bc9fSMauro Carvalho Chehab <block number> = <size-max>/<pointer size> 3344ba7bc9fSMauro Carvalho Chehab <block size> = <pagesize> << <max-order> 3354ba7bc9fSMauro Carvalho Chehab 3364ba7bc9fSMauro Carvalho Chehabso, the max buffer size is:: 3374ba7bc9fSMauro Carvalho Chehab 3384ba7bc9fSMauro Carvalho Chehab <block number> * <block size> 3394ba7bc9fSMauro Carvalho Chehab 3404ba7bc9fSMauro Carvalho Chehaband, the number of frames be:: 3414ba7bc9fSMauro Carvalho Chehab 3424ba7bc9fSMauro Carvalho Chehab <block number> * <block size> / <frame size> 3434ba7bc9fSMauro Carvalho Chehab 3444ba7bc9fSMauro Carvalho ChehabSuppose the following parameters, which apply for 2.6 kernel and an 3454ba7bc9fSMauro Carvalho Chehabi386 architecture:: 3464ba7bc9fSMauro Carvalho Chehab 3474ba7bc9fSMauro Carvalho Chehab <size-max> = 131072 bytes 3484ba7bc9fSMauro Carvalho Chehab <pointer size> = 4 bytes 3494ba7bc9fSMauro Carvalho Chehab <pagesize> = 4096 bytes 3504ba7bc9fSMauro Carvalho Chehab <max-order> = 11 3514ba7bc9fSMauro Carvalho Chehab 3524ba7bc9fSMauro Carvalho Chehaband a value for <frame size> of 2048 bytes. These parameters will yield:: 3534ba7bc9fSMauro Carvalho Chehab 3544ba7bc9fSMauro Carvalho Chehab <block number> = 131072/4 = 32768 blocks 3554ba7bc9fSMauro Carvalho Chehab <block size> = 4096 << 11 = 8 MiB. 3564ba7bc9fSMauro Carvalho Chehab 3574ba7bc9fSMauro Carvalho Chehaband hence the buffer will have a 262144 MiB size. So it can hold 3584ba7bc9fSMauro Carvalho Chehab262144 MiB / 2048 bytes = 134217728 frames 3594ba7bc9fSMauro Carvalho Chehab 3604ba7bc9fSMauro Carvalho ChehabActually, this buffer size is not possible with an i386 architecture. 3614ba7bc9fSMauro Carvalho ChehabRemember that the memory is allocated in kernel space, in the case of 3624ba7bc9fSMauro Carvalho Chehaban i386 kernel's memory size is limited to 1GiB. 3634ba7bc9fSMauro Carvalho Chehab 3644ba7bc9fSMauro Carvalho ChehabAll memory allocations are not freed until the socket is closed. The memory 3654ba7bc9fSMauro Carvalho Chehaballocations are done with GFP_KERNEL priority, this basically means that 3664ba7bc9fSMauro Carvalho Chehabthe allocation can wait and swap other process' memory in order to allocate 3674ba7bc9fSMauro Carvalho Chehabthe necessary memory, so normally limits can be reached. 3684ba7bc9fSMauro Carvalho Chehab 3694ba7bc9fSMauro Carvalho ChehabOther constraints 3704ba7bc9fSMauro Carvalho Chehab----------------- 3714ba7bc9fSMauro Carvalho Chehab 3724ba7bc9fSMauro Carvalho ChehabIf you check the source code you will see that what I draw here as a frame 3734ba7bc9fSMauro Carvalho Chehabis not only the link level frame. At the beginning of each frame there is a 3744ba7bc9fSMauro Carvalho Chehabheader called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame 3754ba7bc9fSMauro Carvalho Chehabmeta information like timestamp. So what we draw here a frame it's really 3764ba7bc9fSMauro Carvalho Chehabthe following (from include/linux/if_packet.h):: 3774ba7bc9fSMauro Carvalho Chehab 3784ba7bc9fSMauro Carvalho Chehab /* 3794ba7bc9fSMauro Carvalho Chehab Frame structure: 3804ba7bc9fSMauro Carvalho Chehab 3814ba7bc9fSMauro Carvalho Chehab - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 3824ba7bc9fSMauro Carvalho Chehab - struct tpacket_hdr 3834ba7bc9fSMauro Carvalho Chehab - pad to TPACKET_ALIGNMENT=16 3844ba7bc9fSMauro Carvalho Chehab - struct sockaddr_ll 3854ba7bc9fSMauro Carvalho Chehab - Gap, chosen so that packet data (Start+tp_net) aligns to 3864ba7bc9fSMauro Carvalho Chehab TPACKET_ALIGNMENT=16 3874ba7bc9fSMauro Carvalho Chehab - Start+tp_mac: [ Optional MAC header ] 3884ba7bc9fSMauro Carvalho Chehab - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. 3894ba7bc9fSMauro Carvalho Chehab - Pad to align to TPACKET_ALIGNMENT=16 3904ba7bc9fSMauro Carvalho Chehab */ 3914ba7bc9fSMauro Carvalho Chehab 3924ba7bc9fSMauro Carvalho ChehabThe following are conditions that are checked in packet_set_ring 3934ba7bc9fSMauro Carvalho Chehab 3944ba7bc9fSMauro Carvalho Chehab - tp_block_size must be a multiple of PAGE_SIZE (1) 3954ba7bc9fSMauro Carvalho Chehab - tp_frame_size must be greater than TPACKET_HDRLEN (obvious) 3964ba7bc9fSMauro Carvalho Chehab - tp_frame_size must be a multiple of TPACKET_ALIGNMENT 3974ba7bc9fSMauro Carvalho Chehab - tp_frame_nr must be exactly frames_per_block*tp_block_nr 3984ba7bc9fSMauro Carvalho Chehab 3994ba7bc9fSMauro Carvalho ChehabNote that tp_block_size should be chosen to be a power of two or there will 4004ba7bc9fSMauro Carvalho Chehabbe a waste of memory. 4014ba7bc9fSMauro Carvalho Chehab 4024ba7bc9fSMauro Carvalho ChehabMapping and use of the circular buffer (ring) 4034ba7bc9fSMauro Carvalho Chehab--------------------------------------------- 4044ba7bc9fSMauro Carvalho Chehab 4054ba7bc9fSMauro Carvalho ChehabThe mapping of the buffer in the user process is done with the conventional 4064ba7bc9fSMauro Carvalho Chehabmmap function. Even the circular buffer is compound of several physically 4074ba7bc9fSMauro Carvalho Chehabdiscontiguous blocks of memory, they are contiguous to the user space, hence 4084ba7bc9fSMauro Carvalho Chehabjust one call to mmap is needed:: 4094ba7bc9fSMauro Carvalho Chehab 4104ba7bc9fSMauro Carvalho Chehab mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 4114ba7bc9fSMauro Carvalho Chehab 4124ba7bc9fSMauro Carvalho ChehabIf tp_frame_size is a divisor of tp_block_size frames will be 4134ba7bc9fSMauro Carvalho Chehabcontiguously spaced by tp_frame_size bytes. If not, each 4144ba7bc9fSMauro Carvalho Chehabtp_block_size/tp_frame_size frames there will be a gap between 4154ba7bc9fSMauro Carvalho Chehabthe frames. This is because a frame cannot be spawn across two 4164ba7bc9fSMauro Carvalho Chehabblocks. 4174ba7bc9fSMauro Carvalho Chehab 4184ba7bc9fSMauro Carvalho ChehabTo use one socket for capture and transmission, the mapping of both the 4194ba7bc9fSMauro Carvalho ChehabRX and TX buffer ring has to be done with one call to mmap:: 4204ba7bc9fSMauro Carvalho Chehab 4214ba7bc9fSMauro Carvalho Chehab ... 4224ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); 4234ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); 4244ba7bc9fSMauro Carvalho Chehab ... 4254ba7bc9fSMauro Carvalho Chehab rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 4264ba7bc9fSMauro Carvalho Chehab tx_ring = rx_ring + size; 4274ba7bc9fSMauro Carvalho Chehab 4284ba7bc9fSMauro Carvalho ChehabRX must be the first as the kernel maps the TX ring memory right 4294ba7bc9fSMauro Carvalho Chehabafter the RX one. 4304ba7bc9fSMauro Carvalho Chehab 4314ba7bc9fSMauro Carvalho ChehabAt the beginning of each frame there is an status field (see 4324ba7bc9fSMauro Carvalho Chehabstruct tpacket_hdr). If this field is 0 means that the frame is ready 4334ba7bc9fSMauro Carvalho Chehabto be used for the kernel, If not, there is a frame the user can read 4344ba7bc9fSMauro Carvalho Chehaband the following flags apply: 4354ba7bc9fSMauro Carvalho Chehab 4364ba7bc9fSMauro Carvalho ChehabCapture process 4374ba7bc9fSMauro Carvalho Chehab^^^^^^^^^^^^^^^ 4384ba7bc9fSMauro Carvalho Chehab 43917e94567SBaruch SiachFrom include/linux/if_packet.h:: 4404ba7bc9fSMauro Carvalho Chehab 4414ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_COPY (1 << 1) 4424ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_LOSING (1 << 2) 4434ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_CSUMNOTREADY (1 << 3) 4444ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_CSUM_VALID (1 << 7) 4454ba7bc9fSMauro Carvalho Chehab 4464ba7bc9fSMauro Carvalho Chehab====================== ======================================================= 4474ba7bc9fSMauro Carvalho ChehabTP_STATUS_COPY This flag indicates that the frame (and associated 4484ba7bc9fSMauro Carvalho Chehab meta information) has been truncated because it's 4494ba7bc9fSMauro Carvalho Chehab larger than tp_frame_size. This packet can be 4504ba7bc9fSMauro Carvalho Chehab read entirely with recvfrom(). 4514ba7bc9fSMauro Carvalho Chehab 4524ba7bc9fSMauro Carvalho Chehab In order to make this work it must to be 4534ba7bc9fSMauro Carvalho Chehab enabled previously with setsockopt() and 4544ba7bc9fSMauro Carvalho Chehab the PACKET_COPY_THRESH option. 4554ba7bc9fSMauro Carvalho Chehab 4564ba7bc9fSMauro Carvalho Chehab The number of frames that can be buffered to 4574ba7bc9fSMauro Carvalho Chehab be read with recvfrom is limited like a normal socket. 4584ba7bc9fSMauro Carvalho Chehab See the SO_RCVBUF option in the socket (7) man page. 4594ba7bc9fSMauro Carvalho Chehab 4604ba7bc9fSMauro Carvalho ChehabTP_STATUS_LOSING indicates there were packet drops from last time 4614ba7bc9fSMauro Carvalho Chehab statistics where checked with getsockopt() and 4624ba7bc9fSMauro Carvalho Chehab the PACKET_STATISTICS option. 4634ba7bc9fSMauro Carvalho Chehab 4644ba7bc9fSMauro Carvalho ChehabTP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which 4654ba7bc9fSMauro Carvalho Chehab its checksum will be done in hardware. So while 4664ba7bc9fSMauro Carvalho Chehab reading the packet we should not try to check the 4674ba7bc9fSMauro Carvalho Chehab checksum. 4684ba7bc9fSMauro Carvalho Chehab 4694ba7bc9fSMauro Carvalho ChehabTP_STATUS_CSUM_VALID This flag indicates that at least the transport 4704ba7bc9fSMauro Carvalho Chehab header checksum of the packet has been already 4714ba7bc9fSMauro Carvalho Chehab validated on the kernel side. If the flag is not set 4724ba7bc9fSMauro Carvalho Chehab then we are free to check the checksum by ourselves 4734ba7bc9fSMauro Carvalho Chehab provided that TP_STATUS_CSUMNOTREADY is also not set. 4744ba7bc9fSMauro Carvalho Chehab====================== ======================================================= 4754ba7bc9fSMauro Carvalho Chehab 4764ba7bc9fSMauro Carvalho Chehabfor convenience there are also the following defines:: 4774ba7bc9fSMauro Carvalho Chehab 4784ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_KERNEL 0 4794ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_USER 1 4804ba7bc9fSMauro Carvalho Chehab 4814ba7bc9fSMauro Carvalho ChehabThe kernel initializes all frames to TP_STATUS_KERNEL, when the kernel 4824ba7bc9fSMauro Carvalho Chehabreceives a packet it puts in the buffer and updates the status with 4834ba7bc9fSMauro Carvalho Chehabat least the TP_STATUS_USER flag. Then the user can read the packet, 4844ba7bc9fSMauro Carvalho Chehabonce the packet is read the user must zero the status field, so the kernel 4854ba7bc9fSMauro Carvalho Chehabcan use again that frame buffer. 4864ba7bc9fSMauro Carvalho Chehab 4874ba7bc9fSMauro Carvalho ChehabThe user can use poll (any other variant should apply too) to check if new 4884ba7bc9fSMauro Carvalho Chehabpackets are in the ring:: 4894ba7bc9fSMauro Carvalho Chehab 4904ba7bc9fSMauro Carvalho Chehab struct pollfd pfd; 4914ba7bc9fSMauro Carvalho Chehab 4924ba7bc9fSMauro Carvalho Chehab pfd.fd = fd; 4934ba7bc9fSMauro Carvalho Chehab pfd.revents = 0; 4944ba7bc9fSMauro Carvalho Chehab pfd.events = POLLIN|POLLRDNORM|POLLERR; 4954ba7bc9fSMauro Carvalho Chehab 4964ba7bc9fSMauro Carvalho Chehab if (status == TP_STATUS_KERNEL) 4974ba7bc9fSMauro Carvalho Chehab retval = poll(&pfd, 1, timeout); 4984ba7bc9fSMauro Carvalho Chehab 4994ba7bc9fSMauro Carvalho ChehabIt doesn't incur in a race condition to first check the status value and 5004ba7bc9fSMauro Carvalho Chehabthen poll for frames. 5014ba7bc9fSMauro Carvalho Chehab 5024ba7bc9fSMauro Carvalho ChehabTransmission process 5034ba7bc9fSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^ 5044ba7bc9fSMauro Carvalho Chehab 5054ba7bc9fSMauro Carvalho ChehabThose defines are also used for transmission:: 5064ba7bc9fSMauro Carvalho Chehab 5074ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_AVAILABLE 0 // Frame is available 5084ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() 5094ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_SENDING 2 // Frame is currently in transmission 5104ba7bc9fSMauro Carvalho Chehab #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct 5114ba7bc9fSMauro Carvalho Chehab 5124ba7bc9fSMauro Carvalho ChehabFirst, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a 5134ba7bc9fSMauro Carvalho Chehabpacket, the user fills a data buffer of an available frame, sets tp_len to 5144ba7bc9fSMauro Carvalho Chehabcurrent data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. 5154ba7bc9fSMauro Carvalho ChehabThis can be done on multiple frames. Once the user is ready to transmit, it 5164ba7bc9fSMauro Carvalho Chehabcalls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are 5174ba7bc9fSMauro Carvalho Chehabforwarded to the network device. The kernel updates each status of sent 5184ba7bc9fSMauro Carvalho Chehabframes with TP_STATUS_SENDING until the end of transfer. 5194ba7bc9fSMauro Carvalho Chehab 5204ba7bc9fSMauro Carvalho ChehabAt the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. 5214ba7bc9fSMauro Carvalho Chehab 5224ba7bc9fSMauro Carvalho Chehab:: 5234ba7bc9fSMauro Carvalho Chehab 5244ba7bc9fSMauro Carvalho Chehab header->tp_len = in_i_size; 5254ba7bc9fSMauro Carvalho Chehab header->tp_status = TP_STATUS_SEND_REQUEST; 5264ba7bc9fSMauro Carvalho Chehab retval = send(this->socket, NULL, 0, 0); 5274ba7bc9fSMauro Carvalho Chehab 5284ba7bc9fSMauro Carvalho ChehabThe user can also use poll() to check if a buffer is available: 5294ba7bc9fSMauro Carvalho Chehab 5304ba7bc9fSMauro Carvalho Chehab(status == TP_STATUS_SENDING) 5314ba7bc9fSMauro Carvalho Chehab 5324ba7bc9fSMauro Carvalho Chehab:: 5334ba7bc9fSMauro Carvalho Chehab 5344ba7bc9fSMauro Carvalho Chehab struct pollfd pfd; 5354ba7bc9fSMauro Carvalho Chehab pfd.fd = fd; 5364ba7bc9fSMauro Carvalho Chehab pfd.revents = 0; 5374ba7bc9fSMauro Carvalho Chehab pfd.events = POLLOUT; 5384ba7bc9fSMauro Carvalho Chehab retval = poll(&pfd, 1, timeout); 5394ba7bc9fSMauro Carvalho Chehab 5404ba7bc9fSMauro Carvalho ChehabWhat TPACKET versions are available and when to use them? 5414ba7bc9fSMauro Carvalho Chehab========================================================= 5424ba7bc9fSMauro Carvalho Chehab 5434ba7bc9fSMauro Carvalho Chehab:: 5444ba7bc9fSMauro Carvalho Chehab 5454ba7bc9fSMauro Carvalho Chehab int val = tpacket_version; 5464ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 5474ba7bc9fSMauro Carvalho Chehab getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); 5484ba7bc9fSMauro Carvalho Chehab 5494ba7bc9fSMauro Carvalho Chehabwhere 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. 5504ba7bc9fSMauro Carvalho Chehab 5514ba7bc9fSMauro Carvalho ChehabTPACKET_V1: 5524ba7bc9fSMauro Carvalho Chehab - Default if not otherwise specified by setsockopt(2) 5534ba7bc9fSMauro Carvalho Chehab - RX_RING, TX_RING available 5544ba7bc9fSMauro Carvalho Chehab 5554ba7bc9fSMauro Carvalho ChehabTPACKET_V1 --> TPACKET_V2: 5564ba7bc9fSMauro Carvalho Chehab - Made 64 bit clean due to unsigned long usage in TPACKET_V1 5574ba7bc9fSMauro Carvalho Chehab structures, thus this also works on 64 bit kernel with 32 bit 5584ba7bc9fSMauro Carvalho Chehab userspace and the like 5594ba7bc9fSMauro Carvalho Chehab - Timestamp resolution in nanoseconds instead of microseconds 5604ba7bc9fSMauro Carvalho Chehab - RX_RING, TX_RING available 5614ba7bc9fSMauro Carvalho Chehab - VLAN metadata information available for packets 5624ba7bc9fSMauro Carvalho Chehab (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), 5634ba7bc9fSMauro Carvalho Chehab in the tpacket2_hdr structure: 5644ba7bc9fSMauro Carvalho Chehab 5654ba7bc9fSMauro Carvalho Chehab - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates 5664ba7bc9fSMauro Carvalho Chehab that the tp_vlan_tci field has valid VLAN TCI value 5674ba7bc9fSMauro Carvalho Chehab - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field 5684ba7bc9fSMauro Carvalho Chehab indicates that the tp_vlan_tpid field has valid VLAN TPID value 5694ba7bc9fSMauro Carvalho Chehab 5704ba7bc9fSMauro Carvalho Chehab - How to switch to TPACKET_V2: 5714ba7bc9fSMauro Carvalho Chehab 5724ba7bc9fSMauro Carvalho Chehab 1. Replace struct tpacket_hdr by struct tpacket2_hdr 5734ba7bc9fSMauro Carvalho Chehab 2. Query header len and save 5744ba7bc9fSMauro Carvalho Chehab 3. Set protocol version to 2, set up ring as usual 5754ba7bc9fSMauro Carvalho Chehab 4. For getting the sockaddr_ll, 5764ba7bc9fSMauro Carvalho Chehab use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of 5774ba7bc9fSMauro Carvalho Chehab ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))`` 5784ba7bc9fSMauro Carvalho Chehab 5794ba7bc9fSMauro Carvalho ChehabTPACKET_V2 --> TPACKET_V3: 5804ba7bc9fSMauro Carvalho Chehab - Flexible buffer implementation for RX_RING: 5814ba7bc9fSMauro Carvalho Chehab 1. Blocks can be configured with non-static frame-size 5824ba7bc9fSMauro Carvalho Chehab 2. Read/poll is at a block-level (as opposed to packet-level) 5834ba7bc9fSMauro Carvalho Chehab 3. Added poll timeout to avoid indefinite user-space wait 5844ba7bc9fSMauro Carvalho Chehab on idle links 5854ba7bc9fSMauro Carvalho Chehab 4. Added user-configurable knobs: 5864ba7bc9fSMauro Carvalho Chehab 5874ba7bc9fSMauro Carvalho Chehab 4.1 block::timeout 5884ba7bc9fSMauro Carvalho Chehab 4.2 tpkt_hdr::sk_rxhash 5894ba7bc9fSMauro Carvalho Chehab 5904ba7bc9fSMauro Carvalho Chehab - RX Hash data available in user space 5914ba7bc9fSMauro Carvalho Chehab - TX_RING semantics are conceptually similar to TPACKET_V2; 5924ba7bc9fSMauro Carvalho Chehab use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN 5934ba7bc9fSMauro Carvalho Chehab instead of TPACKET2_HDRLEN. In the current implementation, 5944ba7bc9fSMauro Carvalho Chehab the tp_next_offset field in the tpacket3_hdr MUST be set to 5954ba7bc9fSMauro Carvalho Chehab zero, indicating that the ring does not hold variable sized frames. 5964ba7bc9fSMauro Carvalho Chehab Packets with non-zero values of tp_next_offset will be dropped. 5974ba7bc9fSMauro Carvalho Chehab 5984ba7bc9fSMauro Carvalho ChehabAF_PACKET fanout mode 5994ba7bc9fSMauro Carvalho Chehab===================== 6004ba7bc9fSMauro Carvalho Chehab 6014ba7bc9fSMauro Carvalho ChehabIn the AF_PACKET fanout mode, packet reception can be load balanced among 6024ba7bc9fSMauro Carvalho Chehabprocesses. This also works in combination with mmap(2) on packet sockets. 6034ba7bc9fSMauro Carvalho Chehab 6044ba7bc9fSMauro Carvalho ChehabCurrently implemented fanout policies are: 6054ba7bc9fSMauro Carvalho Chehab 6064ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash 6074ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_LB: schedule to socket by round-robin 6084ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on 6094ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_RND: schedule to socket by random selection 6104ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another 6114ba7bc9fSMauro Carvalho Chehab - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping 6124ba7bc9fSMauro Carvalho Chehab 6134ba7bc9fSMauro Carvalho ChehabMinimal example code by David S. Miller (try things like "./test eth0 hash", 6144ba7bc9fSMauro Carvalho Chehab"./test eth0 lb", etc.):: 6154ba7bc9fSMauro Carvalho Chehab 6164ba7bc9fSMauro Carvalho Chehab #include <stddef.h> 6174ba7bc9fSMauro Carvalho Chehab #include <stdlib.h> 6184ba7bc9fSMauro Carvalho Chehab #include <stdio.h> 6194ba7bc9fSMauro Carvalho Chehab #include <string.h> 6204ba7bc9fSMauro Carvalho Chehab 6214ba7bc9fSMauro Carvalho Chehab #include <sys/types.h> 6224ba7bc9fSMauro Carvalho Chehab #include <sys/wait.h> 6234ba7bc9fSMauro Carvalho Chehab #include <sys/socket.h> 6244ba7bc9fSMauro Carvalho Chehab #include <sys/ioctl.h> 6254ba7bc9fSMauro Carvalho Chehab 6264ba7bc9fSMauro Carvalho Chehab #include <unistd.h> 6274ba7bc9fSMauro Carvalho Chehab 6284ba7bc9fSMauro Carvalho Chehab #include <linux/if_ether.h> 6294ba7bc9fSMauro Carvalho Chehab #include <linux/if_packet.h> 6304ba7bc9fSMauro Carvalho Chehab 6314ba7bc9fSMauro Carvalho Chehab #include <net/if.h> 6324ba7bc9fSMauro Carvalho Chehab 6334ba7bc9fSMauro Carvalho Chehab static const char *device_name; 6344ba7bc9fSMauro Carvalho Chehab static int fanout_type; 6354ba7bc9fSMauro Carvalho Chehab static int fanout_id; 6364ba7bc9fSMauro Carvalho Chehab 6374ba7bc9fSMauro Carvalho Chehab #ifndef PACKET_FANOUT 6384ba7bc9fSMauro Carvalho Chehab # define PACKET_FANOUT 18 6394ba7bc9fSMauro Carvalho Chehab # define PACKET_FANOUT_HASH 0 6404ba7bc9fSMauro Carvalho Chehab # define PACKET_FANOUT_LB 1 6414ba7bc9fSMauro Carvalho Chehab #endif 6424ba7bc9fSMauro Carvalho Chehab 6434ba7bc9fSMauro Carvalho Chehab static int setup_socket(void) 6444ba7bc9fSMauro Carvalho Chehab { 6454ba7bc9fSMauro Carvalho Chehab int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); 6464ba7bc9fSMauro Carvalho Chehab struct sockaddr_ll ll; 6474ba7bc9fSMauro Carvalho Chehab struct ifreq ifr; 6484ba7bc9fSMauro Carvalho Chehab int fanout_arg; 6494ba7bc9fSMauro Carvalho Chehab 6504ba7bc9fSMauro Carvalho Chehab if (fd < 0) { 6514ba7bc9fSMauro Carvalho Chehab perror("socket"); 6524ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 6534ba7bc9fSMauro Carvalho Chehab } 6544ba7bc9fSMauro Carvalho Chehab 6554ba7bc9fSMauro Carvalho Chehab memset(&ifr, 0, sizeof(ifr)); 6564ba7bc9fSMauro Carvalho Chehab strcpy(ifr.ifr_name, device_name); 6574ba7bc9fSMauro Carvalho Chehab err = ioctl(fd, SIOCGIFINDEX, &ifr); 6584ba7bc9fSMauro Carvalho Chehab if (err < 0) { 6594ba7bc9fSMauro Carvalho Chehab perror("SIOCGIFINDEX"); 6604ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 6614ba7bc9fSMauro Carvalho Chehab } 6624ba7bc9fSMauro Carvalho Chehab 6634ba7bc9fSMauro Carvalho Chehab memset(&ll, 0, sizeof(ll)); 6644ba7bc9fSMauro Carvalho Chehab ll.sll_family = AF_PACKET; 6654ba7bc9fSMauro Carvalho Chehab ll.sll_ifindex = ifr.ifr_ifindex; 6664ba7bc9fSMauro Carvalho Chehab err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 6674ba7bc9fSMauro Carvalho Chehab if (err < 0) { 6684ba7bc9fSMauro Carvalho Chehab perror("bind"); 6694ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 6704ba7bc9fSMauro Carvalho Chehab } 6714ba7bc9fSMauro Carvalho Chehab 6724ba7bc9fSMauro Carvalho Chehab fanout_arg = (fanout_id | (fanout_type << 16)); 6734ba7bc9fSMauro Carvalho Chehab err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, 6744ba7bc9fSMauro Carvalho Chehab &fanout_arg, sizeof(fanout_arg)); 6754ba7bc9fSMauro Carvalho Chehab if (err) { 6764ba7bc9fSMauro Carvalho Chehab perror("setsockopt"); 6774ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 6784ba7bc9fSMauro Carvalho Chehab } 6794ba7bc9fSMauro Carvalho Chehab 6804ba7bc9fSMauro Carvalho Chehab return fd; 6814ba7bc9fSMauro Carvalho Chehab } 6824ba7bc9fSMauro Carvalho Chehab 6834ba7bc9fSMauro Carvalho Chehab static void fanout_thread(void) 6844ba7bc9fSMauro Carvalho Chehab { 6854ba7bc9fSMauro Carvalho Chehab int fd = setup_socket(); 6864ba7bc9fSMauro Carvalho Chehab int limit = 10000; 6874ba7bc9fSMauro Carvalho Chehab 6884ba7bc9fSMauro Carvalho Chehab if (fd < 0) 6894ba7bc9fSMauro Carvalho Chehab exit(fd); 6904ba7bc9fSMauro Carvalho Chehab 6914ba7bc9fSMauro Carvalho Chehab while (limit-- > 0) { 6924ba7bc9fSMauro Carvalho Chehab char buf[1600]; 6934ba7bc9fSMauro Carvalho Chehab int err; 6944ba7bc9fSMauro Carvalho Chehab 6954ba7bc9fSMauro Carvalho Chehab err = read(fd, buf, sizeof(buf)); 6964ba7bc9fSMauro Carvalho Chehab if (err < 0) { 6974ba7bc9fSMauro Carvalho Chehab perror("read"); 6984ba7bc9fSMauro Carvalho Chehab exit(EXIT_FAILURE); 6994ba7bc9fSMauro Carvalho Chehab } 7004ba7bc9fSMauro Carvalho Chehab if ((limit % 10) == 0) 7014ba7bc9fSMauro Carvalho Chehab fprintf(stdout, "(%d) \n", getpid()); 7024ba7bc9fSMauro Carvalho Chehab } 7034ba7bc9fSMauro Carvalho Chehab 7044ba7bc9fSMauro Carvalho Chehab fprintf(stdout, "%d: Received 10000 packets\n", getpid()); 7054ba7bc9fSMauro Carvalho Chehab 7064ba7bc9fSMauro Carvalho Chehab close(fd); 7074ba7bc9fSMauro Carvalho Chehab exit(0); 7084ba7bc9fSMauro Carvalho Chehab } 7094ba7bc9fSMauro Carvalho Chehab 7104ba7bc9fSMauro Carvalho Chehab int main(int argc, char **argp) 7114ba7bc9fSMauro Carvalho Chehab { 7124ba7bc9fSMauro Carvalho Chehab int fd, err; 7134ba7bc9fSMauro Carvalho Chehab int i; 7144ba7bc9fSMauro Carvalho Chehab 7154ba7bc9fSMauro Carvalho Chehab if (argc != 3) { 7164ba7bc9fSMauro Carvalho Chehab fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); 7174ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 7184ba7bc9fSMauro Carvalho Chehab } 7194ba7bc9fSMauro Carvalho Chehab 7204ba7bc9fSMauro Carvalho Chehab if (!strcmp(argp[2], "hash")) 7214ba7bc9fSMauro Carvalho Chehab fanout_type = PACKET_FANOUT_HASH; 7224ba7bc9fSMauro Carvalho Chehab else if (!strcmp(argp[2], "lb")) 7234ba7bc9fSMauro Carvalho Chehab fanout_type = PACKET_FANOUT_LB; 7244ba7bc9fSMauro Carvalho Chehab else { 7254ba7bc9fSMauro Carvalho Chehab fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); 7264ba7bc9fSMauro Carvalho Chehab exit(EXIT_FAILURE); 7274ba7bc9fSMauro Carvalho Chehab } 7284ba7bc9fSMauro Carvalho Chehab 7294ba7bc9fSMauro Carvalho Chehab device_name = argp[1]; 7304ba7bc9fSMauro Carvalho Chehab fanout_id = getpid() & 0xffff; 7314ba7bc9fSMauro Carvalho Chehab 7324ba7bc9fSMauro Carvalho Chehab for (i = 0; i < 4; i++) { 7334ba7bc9fSMauro Carvalho Chehab pid_t pid = fork(); 7344ba7bc9fSMauro Carvalho Chehab 7354ba7bc9fSMauro Carvalho Chehab switch (pid) { 7364ba7bc9fSMauro Carvalho Chehab case 0: 7374ba7bc9fSMauro Carvalho Chehab fanout_thread(); 7384ba7bc9fSMauro Carvalho Chehab 7394ba7bc9fSMauro Carvalho Chehab case -1: 7404ba7bc9fSMauro Carvalho Chehab perror("fork"); 7414ba7bc9fSMauro Carvalho Chehab exit(EXIT_FAILURE); 7424ba7bc9fSMauro Carvalho Chehab } 7434ba7bc9fSMauro Carvalho Chehab } 7444ba7bc9fSMauro Carvalho Chehab 7454ba7bc9fSMauro Carvalho Chehab for (i = 0; i < 4; i++) { 7464ba7bc9fSMauro Carvalho Chehab int status; 7474ba7bc9fSMauro Carvalho Chehab 7484ba7bc9fSMauro Carvalho Chehab wait(&status); 7494ba7bc9fSMauro Carvalho Chehab } 7504ba7bc9fSMauro Carvalho Chehab 7514ba7bc9fSMauro Carvalho Chehab return 0; 7524ba7bc9fSMauro Carvalho Chehab } 7534ba7bc9fSMauro Carvalho Chehab 7544ba7bc9fSMauro Carvalho ChehabAF_PACKET TPACKET_V3 example 7554ba7bc9fSMauro Carvalho Chehab============================ 7564ba7bc9fSMauro Carvalho Chehab 7574ba7bc9fSMauro Carvalho ChehabAF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame 758*d56b699dSBjorn Helgaassizes by doing its own memory management. It is based on blocks where polling 7594ba7bc9fSMauro Carvalho Chehabworks on a per block basis instead of per ring as in TPACKET_V2 and predecessor. 7604ba7bc9fSMauro Carvalho Chehab 7614ba7bc9fSMauro Carvalho ChehabIt is said that TPACKET_V3 brings the following benefits: 7624ba7bc9fSMauro Carvalho Chehab 7634ba7bc9fSMauro Carvalho Chehab * ~15% - 20% reduction in CPU-usage 7644ba7bc9fSMauro Carvalho Chehab * ~20% increase in packet capture rate 7654ba7bc9fSMauro Carvalho Chehab * ~2x increase in packet density 7664ba7bc9fSMauro Carvalho Chehab * Port aggregation analysis 7674ba7bc9fSMauro Carvalho Chehab * Non static frame size to capture entire packet payload 7684ba7bc9fSMauro Carvalho Chehab 7694ba7bc9fSMauro Carvalho ChehabSo it seems to be a good candidate to be used with packet fanout. 7704ba7bc9fSMauro Carvalho Chehab 7714ba7bc9fSMauro Carvalho ChehabMinimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile 7724ba7bc9fSMauro Carvalho Chehabit with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):: 7734ba7bc9fSMauro Carvalho Chehab 7744ba7bc9fSMauro Carvalho Chehab /* Written from scratch, but kernel-to-user space API usage 7754ba7bc9fSMauro Carvalho Chehab * dissected from lolpcap: 7764ba7bc9fSMauro Carvalho Chehab * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> 7774ba7bc9fSMauro Carvalho Chehab * License: GPL, version 2.0 7784ba7bc9fSMauro Carvalho Chehab */ 7794ba7bc9fSMauro Carvalho Chehab 7804ba7bc9fSMauro Carvalho Chehab #include <stdio.h> 7814ba7bc9fSMauro Carvalho Chehab #include <stdlib.h> 7824ba7bc9fSMauro Carvalho Chehab #include <stdint.h> 7834ba7bc9fSMauro Carvalho Chehab #include <string.h> 7844ba7bc9fSMauro Carvalho Chehab #include <assert.h> 7854ba7bc9fSMauro Carvalho Chehab #include <net/if.h> 7864ba7bc9fSMauro Carvalho Chehab #include <arpa/inet.h> 7874ba7bc9fSMauro Carvalho Chehab #include <netdb.h> 7884ba7bc9fSMauro Carvalho Chehab #include <poll.h> 7894ba7bc9fSMauro Carvalho Chehab #include <unistd.h> 7904ba7bc9fSMauro Carvalho Chehab #include <signal.h> 7914ba7bc9fSMauro Carvalho Chehab #include <inttypes.h> 7924ba7bc9fSMauro Carvalho Chehab #include <sys/socket.h> 7934ba7bc9fSMauro Carvalho Chehab #include <sys/mman.h> 7944ba7bc9fSMauro Carvalho Chehab #include <linux/if_packet.h> 7954ba7bc9fSMauro Carvalho Chehab #include <linux/if_ether.h> 7964ba7bc9fSMauro Carvalho Chehab #include <linux/ip.h> 7974ba7bc9fSMauro Carvalho Chehab 7984ba7bc9fSMauro Carvalho Chehab #ifndef likely 7994ba7bc9fSMauro Carvalho Chehab # define likely(x) __builtin_expect(!!(x), 1) 8004ba7bc9fSMauro Carvalho Chehab #endif 8014ba7bc9fSMauro Carvalho Chehab #ifndef unlikely 8024ba7bc9fSMauro Carvalho Chehab # define unlikely(x) __builtin_expect(!!(x), 0) 8034ba7bc9fSMauro Carvalho Chehab #endif 8044ba7bc9fSMauro Carvalho Chehab 8054ba7bc9fSMauro Carvalho Chehab struct block_desc { 8064ba7bc9fSMauro Carvalho Chehab uint32_t version; 8074ba7bc9fSMauro Carvalho Chehab uint32_t offset_to_priv; 8084ba7bc9fSMauro Carvalho Chehab struct tpacket_hdr_v1 h1; 8094ba7bc9fSMauro Carvalho Chehab }; 8104ba7bc9fSMauro Carvalho Chehab 8114ba7bc9fSMauro Carvalho Chehab struct ring { 8124ba7bc9fSMauro Carvalho Chehab struct iovec *rd; 8134ba7bc9fSMauro Carvalho Chehab uint8_t *map; 8144ba7bc9fSMauro Carvalho Chehab struct tpacket_req3 req; 8154ba7bc9fSMauro Carvalho Chehab }; 8164ba7bc9fSMauro Carvalho Chehab 8174ba7bc9fSMauro Carvalho Chehab static unsigned long packets_total = 0, bytes_total = 0; 8184ba7bc9fSMauro Carvalho Chehab static sig_atomic_t sigint = 0; 8194ba7bc9fSMauro Carvalho Chehab 8204ba7bc9fSMauro Carvalho Chehab static void sighandler(int num) 8214ba7bc9fSMauro Carvalho Chehab { 8224ba7bc9fSMauro Carvalho Chehab sigint = 1; 8234ba7bc9fSMauro Carvalho Chehab } 8244ba7bc9fSMauro Carvalho Chehab 8254ba7bc9fSMauro Carvalho Chehab static int setup_socket(struct ring *ring, char *netdev) 8264ba7bc9fSMauro Carvalho Chehab { 8274ba7bc9fSMauro Carvalho Chehab int err, i, fd, v = TPACKET_V3; 8284ba7bc9fSMauro Carvalho Chehab struct sockaddr_ll ll; 8294ba7bc9fSMauro Carvalho Chehab unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; 8304ba7bc9fSMauro Carvalho Chehab unsigned int blocknum = 64; 8314ba7bc9fSMauro Carvalho Chehab 8324ba7bc9fSMauro Carvalho Chehab fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 8334ba7bc9fSMauro Carvalho Chehab if (fd < 0) { 8344ba7bc9fSMauro Carvalho Chehab perror("socket"); 8354ba7bc9fSMauro Carvalho Chehab exit(1); 8364ba7bc9fSMauro Carvalho Chehab } 8374ba7bc9fSMauro Carvalho Chehab 8384ba7bc9fSMauro Carvalho Chehab err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); 8394ba7bc9fSMauro Carvalho Chehab if (err < 0) { 8404ba7bc9fSMauro Carvalho Chehab perror("setsockopt"); 8414ba7bc9fSMauro Carvalho Chehab exit(1); 8424ba7bc9fSMauro Carvalho Chehab } 8434ba7bc9fSMauro Carvalho Chehab 8444ba7bc9fSMauro Carvalho Chehab memset(&ring->req, 0, sizeof(ring->req)); 8454ba7bc9fSMauro Carvalho Chehab ring->req.tp_block_size = blocksiz; 8464ba7bc9fSMauro Carvalho Chehab ring->req.tp_frame_size = framesiz; 8474ba7bc9fSMauro Carvalho Chehab ring->req.tp_block_nr = blocknum; 8484ba7bc9fSMauro Carvalho Chehab ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; 8494ba7bc9fSMauro Carvalho Chehab ring->req.tp_retire_blk_tov = 60; 8504ba7bc9fSMauro Carvalho Chehab ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; 8514ba7bc9fSMauro Carvalho Chehab 8524ba7bc9fSMauro Carvalho Chehab err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, 8534ba7bc9fSMauro Carvalho Chehab sizeof(ring->req)); 8544ba7bc9fSMauro Carvalho Chehab if (err < 0) { 8554ba7bc9fSMauro Carvalho Chehab perror("setsockopt"); 8564ba7bc9fSMauro Carvalho Chehab exit(1); 8574ba7bc9fSMauro Carvalho Chehab } 8584ba7bc9fSMauro Carvalho Chehab 8594ba7bc9fSMauro Carvalho Chehab ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, 8604ba7bc9fSMauro Carvalho Chehab PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); 8614ba7bc9fSMauro Carvalho Chehab if (ring->map == MAP_FAILED) { 8624ba7bc9fSMauro Carvalho Chehab perror("mmap"); 8634ba7bc9fSMauro Carvalho Chehab exit(1); 8644ba7bc9fSMauro Carvalho Chehab } 8654ba7bc9fSMauro Carvalho Chehab 8664ba7bc9fSMauro Carvalho Chehab ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); 8674ba7bc9fSMauro Carvalho Chehab assert(ring->rd); 8684ba7bc9fSMauro Carvalho Chehab for (i = 0; i < ring->req.tp_block_nr; ++i) { 8694ba7bc9fSMauro Carvalho Chehab ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); 8704ba7bc9fSMauro Carvalho Chehab ring->rd[i].iov_len = ring->req.tp_block_size; 8714ba7bc9fSMauro Carvalho Chehab } 8724ba7bc9fSMauro Carvalho Chehab 8734ba7bc9fSMauro Carvalho Chehab memset(&ll, 0, sizeof(ll)); 8744ba7bc9fSMauro Carvalho Chehab ll.sll_family = PF_PACKET; 8754ba7bc9fSMauro Carvalho Chehab ll.sll_protocol = htons(ETH_P_ALL); 8764ba7bc9fSMauro Carvalho Chehab ll.sll_ifindex = if_nametoindex(netdev); 8774ba7bc9fSMauro Carvalho Chehab ll.sll_hatype = 0; 8784ba7bc9fSMauro Carvalho Chehab ll.sll_pkttype = 0; 8794ba7bc9fSMauro Carvalho Chehab ll.sll_halen = 0; 8804ba7bc9fSMauro Carvalho Chehab 8814ba7bc9fSMauro Carvalho Chehab err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); 8824ba7bc9fSMauro Carvalho Chehab if (err < 0) { 8834ba7bc9fSMauro Carvalho Chehab perror("bind"); 8844ba7bc9fSMauro Carvalho Chehab exit(1); 8854ba7bc9fSMauro Carvalho Chehab } 8864ba7bc9fSMauro Carvalho Chehab 8874ba7bc9fSMauro Carvalho Chehab return fd; 8884ba7bc9fSMauro Carvalho Chehab } 8894ba7bc9fSMauro Carvalho Chehab 8904ba7bc9fSMauro Carvalho Chehab static void display(struct tpacket3_hdr *ppd) 8914ba7bc9fSMauro Carvalho Chehab { 8924ba7bc9fSMauro Carvalho Chehab struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); 8934ba7bc9fSMauro Carvalho Chehab struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); 8944ba7bc9fSMauro Carvalho Chehab 8954ba7bc9fSMauro Carvalho Chehab if (eth->h_proto == htons(ETH_P_IP)) { 8964ba7bc9fSMauro Carvalho Chehab struct sockaddr_in ss, sd; 8974ba7bc9fSMauro Carvalho Chehab char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; 8984ba7bc9fSMauro Carvalho Chehab 8994ba7bc9fSMauro Carvalho Chehab memset(&ss, 0, sizeof(ss)); 9004ba7bc9fSMauro Carvalho Chehab ss.sin_family = PF_INET; 9014ba7bc9fSMauro Carvalho Chehab ss.sin_addr.s_addr = ip->saddr; 9024ba7bc9fSMauro Carvalho Chehab getnameinfo((struct sockaddr *) &ss, sizeof(ss), 9034ba7bc9fSMauro Carvalho Chehab sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); 9044ba7bc9fSMauro Carvalho Chehab 9054ba7bc9fSMauro Carvalho Chehab memset(&sd, 0, sizeof(sd)); 9064ba7bc9fSMauro Carvalho Chehab sd.sin_family = PF_INET; 9074ba7bc9fSMauro Carvalho Chehab sd.sin_addr.s_addr = ip->daddr; 9084ba7bc9fSMauro Carvalho Chehab getnameinfo((struct sockaddr *) &sd, sizeof(sd), 9094ba7bc9fSMauro Carvalho Chehab dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); 9104ba7bc9fSMauro Carvalho Chehab 9114ba7bc9fSMauro Carvalho Chehab printf("%s -> %s, ", sbuff, dbuff); 9124ba7bc9fSMauro Carvalho Chehab } 9134ba7bc9fSMauro Carvalho Chehab 9144ba7bc9fSMauro Carvalho Chehab printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); 9154ba7bc9fSMauro Carvalho Chehab } 9164ba7bc9fSMauro Carvalho Chehab 9174ba7bc9fSMauro Carvalho Chehab static void walk_block(struct block_desc *pbd, const int block_num) 9184ba7bc9fSMauro Carvalho Chehab { 9194ba7bc9fSMauro Carvalho Chehab int num_pkts = pbd->h1.num_pkts, i; 9204ba7bc9fSMauro Carvalho Chehab unsigned long bytes = 0; 9214ba7bc9fSMauro Carvalho Chehab struct tpacket3_hdr *ppd; 9224ba7bc9fSMauro Carvalho Chehab 9234ba7bc9fSMauro Carvalho Chehab ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + 9244ba7bc9fSMauro Carvalho Chehab pbd->h1.offset_to_first_pkt); 9254ba7bc9fSMauro Carvalho Chehab for (i = 0; i < num_pkts; ++i) { 9264ba7bc9fSMauro Carvalho Chehab bytes += ppd->tp_snaplen; 9274ba7bc9fSMauro Carvalho Chehab display(ppd); 9284ba7bc9fSMauro Carvalho Chehab 9294ba7bc9fSMauro Carvalho Chehab ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + 9304ba7bc9fSMauro Carvalho Chehab ppd->tp_next_offset); 9314ba7bc9fSMauro Carvalho Chehab } 9324ba7bc9fSMauro Carvalho Chehab 9334ba7bc9fSMauro Carvalho Chehab packets_total += num_pkts; 9344ba7bc9fSMauro Carvalho Chehab bytes_total += bytes; 9354ba7bc9fSMauro Carvalho Chehab } 9364ba7bc9fSMauro Carvalho Chehab 9374ba7bc9fSMauro Carvalho Chehab static void flush_block(struct block_desc *pbd) 9384ba7bc9fSMauro Carvalho Chehab { 9394ba7bc9fSMauro Carvalho Chehab pbd->h1.block_status = TP_STATUS_KERNEL; 9404ba7bc9fSMauro Carvalho Chehab } 9414ba7bc9fSMauro Carvalho Chehab 9424ba7bc9fSMauro Carvalho Chehab static void teardown_socket(struct ring *ring, int fd) 9434ba7bc9fSMauro Carvalho Chehab { 9444ba7bc9fSMauro Carvalho Chehab munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); 9454ba7bc9fSMauro Carvalho Chehab free(ring->rd); 9464ba7bc9fSMauro Carvalho Chehab close(fd); 9474ba7bc9fSMauro Carvalho Chehab } 9484ba7bc9fSMauro Carvalho Chehab 9494ba7bc9fSMauro Carvalho Chehab int main(int argc, char **argp) 9504ba7bc9fSMauro Carvalho Chehab { 9514ba7bc9fSMauro Carvalho Chehab int fd, err; 9524ba7bc9fSMauro Carvalho Chehab socklen_t len; 9534ba7bc9fSMauro Carvalho Chehab struct ring ring; 9544ba7bc9fSMauro Carvalho Chehab struct pollfd pfd; 9554ba7bc9fSMauro Carvalho Chehab unsigned int block_num = 0, blocks = 64; 9564ba7bc9fSMauro Carvalho Chehab struct block_desc *pbd; 9574ba7bc9fSMauro Carvalho Chehab struct tpacket_stats_v3 stats; 9584ba7bc9fSMauro Carvalho Chehab 9594ba7bc9fSMauro Carvalho Chehab if (argc != 2) { 9604ba7bc9fSMauro Carvalho Chehab fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); 9614ba7bc9fSMauro Carvalho Chehab return EXIT_FAILURE; 9624ba7bc9fSMauro Carvalho Chehab } 9634ba7bc9fSMauro Carvalho Chehab 9644ba7bc9fSMauro Carvalho Chehab signal(SIGINT, sighandler); 9654ba7bc9fSMauro Carvalho Chehab 9664ba7bc9fSMauro Carvalho Chehab memset(&ring, 0, sizeof(ring)); 9674ba7bc9fSMauro Carvalho Chehab fd = setup_socket(&ring, argp[argc - 1]); 9684ba7bc9fSMauro Carvalho Chehab assert(fd > 0); 9694ba7bc9fSMauro Carvalho Chehab 9704ba7bc9fSMauro Carvalho Chehab memset(&pfd, 0, sizeof(pfd)); 9714ba7bc9fSMauro Carvalho Chehab pfd.fd = fd; 9724ba7bc9fSMauro Carvalho Chehab pfd.events = POLLIN | POLLERR; 9734ba7bc9fSMauro Carvalho Chehab pfd.revents = 0; 9744ba7bc9fSMauro Carvalho Chehab 9754ba7bc9fSMauro Carvalho Chehab while (likely(!sigint)) { 9764ba7bc9fSMauro Carvalho Chehab pbd = (struct block_desc *) ring.rd[block_num].iov_base; 9774ba7bc9fSMauro Carvalho Chehab 9784ba7bc9fSMauro Carvalho Chehab if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { 9794ba7bc9fSMauro Carvalho Chehab poll(&pfd, 1, -1); 9804ba7bc9fSMauro Carvalho Chehab continue; 9814ba7bc9fSMauro Carvalho Chehab } 9824ba7bc9fSMauro Carvalho Chehab 9834ba7bc9fSMauro Carvalho Chehab walk_block(pbd, block_num); 9844ba7bc9fSMauro Carvalho Chehab flush_block(pbd); 9854ba7bc9fSMauro Carvalho Chehab block_num = (block_num + 1) % blocks; 9864ba7bc9fSMauro Carvalho Chehab } 9874ba7bc9fSMauro Carvalho Chehab 9884ba7bc9fSMauro Carvalho Chehab len = sizeof(stats); 9894ba7bc9fSMauro Carvalho Chehab err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); 9904ba7bc9fSMauro Carvalho Chehab if (err < 0) { 9914ba7bc9fSMauro Carvalho Chehab perror("getsockopt"); 9924ba7bc9fSMauro Carvalho Chehab exit(1); 9934ba7bc9fSMauro Carvalho Chehab } 9944ba7bc9fSMauro Carvalho Chehab 9954ba7bc9fSMauro Carvalho Chehab fflush(stdout); 9964ba7bc9fSMauro Carvalho Chehab printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", 9974ba7bc9fSMauro Carvalho Chehab stats.tp_packets, bytes_total, stats.tp_drops, 9984ba7bc9fSMauro Carvalho Chehab stats.tp_freeze_q_cnt); 9994ba7bc9fSMauro Carvalho Chehab 10004ba7bc9fSMauro Carvalho Chehab teardown_socket(&ring, fd); 10014ba7bc9fSMauro Carvalho Chehab return 0; 10024ba7bc9fSMauro Carvalho Chehab } 10034ba7bc9fSMauro Carvalho Chehab 10044ba7bc9fSMauro Carvalho ChehabPACKET_QDISC_BYPASS 10054ba7bc9fSMauro Carvalho Chehab=================== 10064ba7bc9fSMauro Carvalho Chehab 10074ba7bc9fSMauro Carvalho ChehabIf there is a requirement to load the network with many packets in a similar 10084ba7bc9fSMauro Carvalho Chehabfashion as pktgen does, you might set the following option after socket 10094ba7bc9fSMauro Carvalho Chehabcreation:: 10104ba7bc9fSMauro Carvalho Chehab 10114ba7bc9fSMauro Carvalho Chehab int one = 1; 10124ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); 10134ba7bc9fSMauro Carvalho Chehab 10144ba7bc9fSMauro Carvalho ChehabThis has the side-effect, that packets sent through PF_PACKET will bypass the 10154ba7bc9fSMauro Carvalho Chehabkernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, 10164ba7bc9fSMauro Carvalho Chehabpacket are not buffered, tc disciplines are ignored, increased loss can occur 10174ba7bc9fSMauro Carvalho Chehaband such packets are also not visible to other PF_PACKET sockets anymore. So, 10184ba7bc9fSMauro Carvalho Chehabyou have been warned; generally, this can be useful for stress testing various 10194ba7bc9fSMauro Carvalho Chehabcomponents of a system. 10204ba7bc9fSMauro Carvalho Chehab 10214ba7bc9fSMauro Carvalho ChehabOn default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled 10224ba7bc9fSMauro Carvalho Chehabon PF_PACKET sockets. 10234ba7bc9fSMauro Carvalho Chehab 10244ba7bc9fSMauro Carvalho ChehabPACKET_TIMESTAMP 10254ba7bc9fSMauro Carvalho Chehab================ 10264ba7bc9fSMauro Carvalho Chehab 10274ba7bc9fSMauro Carvalho ChehabThe PACKET_TIMESTAMP setting determines the source of the timestamp in 10284ba7bc9fSMauro Carvalho Chehabthe packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your 10294ba7bc9fSMauro Carvalho ChehabNIC is capable of timestamping packets in hardware, you can request those 10304ba7bc9fSMauro Carvalho Chehabhardware timestamps to be used. Note: you may need to enable the generation 10314ba7bc9fSMauro Carvalho Chehabof hardware timestamps with SIOCSHWTSTAMP (see related information from 103206bfa47eSMauro Carvalho ChehabDocumentation/networking/timestamping.rst). 10334ba7bc9fSMauro Carvalho Chehab 10344ba7bc9fSMauro Carvalho ChehabPACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:: 10354ba7bc9fSMauro Carvalho Chehab 10364ba7bc9fSMauro Carvalho Chehab int req = SOF_TIMESTAMPING_RAW_HARDWARE; 10374ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) 10384ba7bc9fSMauro Carvalho Chehab 10394ba7bc9fSMauro Carvalho ChehabFor the mmap(2)ed ring buffers, such timestamps are stored in the 10404ba7bc9fSMauro Carvalho Chehab``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members. 10414ba7bc9fSMauro Carvalho ChehabTo determine what kind of timestamp has been reported, the tp_status field 10424ba7bc9fSMauro Carvalho Chehabis binary or'ed with the following possible bits ... 10434ba7bc9fSMauro Carvalho Chehab 10444ba7bc9fSMauro Carvalho Chehab:: 10454ba7bc9fSMauro Carvalho Chehab 10464ba7bc9fSMauro Carvalho Chehab TP_STATUS_TS_RAW_HARDWARE 10474ba7bc9fSMauro Carvalho Chehab TP_STATUS_TS_SOFTWARE 10484ba7bc9fSMauro Carvalho Chehab 10494ba7bc9fSMauro Carvalho Chehab... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the 10504ba7bc9fSMauro Carvalho ChehabRX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a 10514ba7bc9fSMauro Carvalho Chehabsoftware fallback was invoked *within* PF_PACKET's processing code (less 10524ba7bc9fSMauro Carvalho Chehabprecise). 10534ba7bc9fSMauro Carvalho Chehab 10544ba7bc9fSMauro Carvalho ChehabGetting timestamps for the TX_RING works as follows: i) fill the ring frames, 10554ba7bc9fSMauro Carvalho Chehabii) call sendto() e.g. in blocking mode, iii) wait for status of relevant 10564ba7bc9fSMauro Carvalho Chehabframes to be updated resp. the frame handed over to the application, iv) walk 10574ba7bc9fSMauro Carvalho Chehabthrough the frames to pick up the individual hw/sw timestamps. 10584ba7bc9fSMauro Carvalho Chehab 10594ba7bc9fSMauro Carvalho ChehabOnly (!) if transmit timestamping is enabled, then these bits are combined 10604ba7bc9fSMauro Carvalho Chehabwith binary | with TP_STATUS_AVAILABLE, so you must check for that in your 10614ba7bc9fSMauro Carvalho Chehabapplication (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) 10624ba7bc9fSMauro Carvalho Chehabin a first step to see if the frame belongs to the application, and then 10634ba7bc9fSMauro Carvalho Chehabone can extract the type of timestamp in a second step from tp_status)! 10644ba7bc9fSMauro Carvalho Chehab 10654ba7bc9fSMauro Carvalho ChehabIf you don't care about them, thus having it disabled, checking for 10664ba7bc9fSMauro Carvalho ChehabTP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the 10674ba7bc9fSMauro Carvalho ChehabTX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec 10684ba7bc9fSMauro Carvalho Chehabmembers do not contain a valid value. For TX_RINGs, by default no timestamp 10694ba7bc9fSMauro Carvalho Chehabis generated! 10704ba7bc9fSMauro Carvalho Chehab 107106bfa47eSMauro Carvalho ChehabSee include/linux/net_tstamp.h and Documentation/networking/timestamping.rst 10724ba7bc9fSMauro Carvalho Chehabfor more information on hardware timestamps. 10734ba7bc9fSMauro Carvalho Chehab 10744ba7bc9fSMauro Carvalho ChehabMiscellaneous bits 10754ba7bc9fSMauro Carvalho Chehab================== 10764ba7bc9fSMauro Carvalho Chehab 10774ba7bc9fSMauro Carvalho Chehab- Packet sockets work well together with Linux socket filters, thus you also 10786e94eaaaSMauro Carvalho Chehab might want to have a look at Documentation/networking/filter.rst 10794ba7bc9fSMauro Carvalho Chehab 10804ba7bc9fSMauro Carvalho ChehabTHANKS 10814ba7bc9fSMauro Carvalho Chehab====== 10824ba7bc9fSMauro Carvalho Chehab 10834ba7bc9fSMauro Carvalho Chehab Jesse Brandeburg, for fixing my grammathical/spelling errors 1084