xref: /openbmc/linux/Documentation/networking/packet_mmap.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
14ba7bc9fSMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
24ba7bc9fSMauro Carvalho Chehab
34ba7bc9fSMauro Carvalho Chehab===========
44ba7bc9fSMauro Carvalho ChehabPacket MMAP
54ba7bc9fSMauro Carvalho Chehab===========
64ba7bc9fSMauro Carvalho Chehab
74ba7bc9fSMauro Carvalho ChehabAbstract
84ba7bc9fSMauro Carvalho Chehab========
94ba7bc9fSMauro Carvalho Chehab
104ba7bc9fSMauro Carvalho ChehabThis file documents the mmap() facility available with the PACKET
11e4da63cdSBaruch Siachsocket interface. This type of sockets is used for
124ba7bc9fSMauro Carvalho Chehab
134ba7bc9fSMauro Carvalho Chehabi) capture network traffic with utilities like tcpdump,
144ba7bc9fSMauro Carvalho Chehabii) transmit network traffic, or any other that needs raw
154ba7bc9fSMauro Carvalho Chehab    access to network interface.
164ba7bc9fSMauro Carvalho Chehab
174ba7bc9fSMauro Carvalho ChehabHowto can be found at:
184ba7bc9fSMauro Carvalho Chehab
194ba7bc9fSMauro Carvalho Chehab    https://sites.google.com/site/packetmmap/
204ba7bc9fSMauro Carvalho Chehab
214ba7bc9fSMauro Carvalho ChehabPlease send your comments to
224ba7bc9fSMauro Carvalho Chehab    - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
234ba7bc9fSMauro Carvalho Chehab    - Johann Baudy
244ba7bc9fSMauro Carvalho Chehab
254ba7bc9fSMauro Carvalho ChehabWhy use PACKET_MMAP
264ba7bc9fSMauro Carvalho Chehab===================
274ba7bc9fSMauro Carvalho Chehab
28e4da63cdSBaruch SiachNon PACKET_MMAP capture process (plain AF_PACKET) is very
294ba7bc9fSMauro Carvalho Chehabinefficient. It uses very limited buffers and requires one system call to
304ba7bc9fSMauro Carvalho Chehabcapture each packet, it requires two if you want to get packet's timestamp
314ba7bc9fSMauro Carvalho Chehab(like libpcap always does).
324ba7bc9fSMauro Carvalho Chehab
33e4da63cdSBaruch SiachOn the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
344ba7bc9fSMauro Carvalho Chehabconfigurable circular buffer mapped in user space that can be used to either
354ba7bc9fSMauro Carvalho Chehabsend or receive packets. This way reading packets just needs to wait for them,
364ba7bc9fSMauro Carvalho Chehabmost of the time there is no need to issue a single system call. Concerning
374ba7bc9fSMauro Carvalho Chehabtransmission, multiple packets can be sent through one system call to get the
384ba7bc9fSMauro Carvalho Chehabhighest bandwidth. By using a shared buffer between the kernel and the user
394ba7bc9fSMauro Carvalho Chehabalso has the benefit of minimizing packet copies.
404ba7bc9fSMauro Carvalho Chehab
414ba7bc9fSMauro Carvalho ChehabIt's fine to use PACKET_MMAP to improve the performance of the capture and
424ba7bc9fSMauro Carvalho Chehabtransmission process, but it isn't everything. At least, if you are capturing
434ba7bc9fSMauro Carvalho Chehabat high speeds (this is relative to the cpu speed), you should check if the
444ba7bc9fSMauro Carvalho Chehabdevice driver of your network interface card supports some sort of interrupt
454ba7bc9fSMauro Carvalho Chehabload mitigation or (even better) if it supports NAPI, also make sure it is
464ba7bc9fSMauro Carvalho Chehabenabled. For transmission, check the MTU (Maximum Transmission Unit) used and
474ba7bc9fSMauro Carvalho Chehabsupported by devices of your network. CPU IRQ pinning of your network interface
484ba7bc9fSMauro Carvalho Chehabcard can also be an advantage.
494ba7bc9fSMauro Carvalho Chehab
504ba7bc9fSMauro Carvalho ChehabHow to use mmap() to improve capture process
514ba7bc9fSMauro Carvalho Chehab============================================
524ba7bc9fSMauro Carvalho Chehab
534ba7bc9fSMauro Carvalho ChehabFrom the user standpoint, you should use the higher level libpcap library, which
544ba7bc9fSMauro Carvalho Chehabis a de facto standard, portable across nearly all operating systems
554ba7bc9fSMauro Carvalho Chehabincluding Win32.
564ba7bc9fSMauro Carvalho Chehab
574ba7bc9fSMauro Carvalho ChehabPacket MMAP support was integrated into libpcap around the time of version 1.3.0;
584ba7bc9fSMauro Carvalho ChehabTPACKET_V3 support was added in version 1.5.0
594ba7bc9fSMauro Carvalho Chehab
604ba7bc9fSMauro Carvalho ChehabHow to use mmap() directly to improve capture process
614ba7bc9fSMauro Carvalho Chehab=====================================================
624ba7bc9fSMauro Carvalho Chehab
634ba7bc9fSMauro Carvalho ChehabFrom the system calls stand point, the use of PACKET_MMAP involves
644ba7bc9fSMauro Carvalho Chehabthe following process::
654ba7bc9fSMauro Carvalho Chehab
664ba7bc9fSMauro Carvalho Chehab
674ba7bc9fSMauro Carvalho Chehab    [setup]     socket() -------> creation of the capture socket
684ba7bc9fSMauro Carvalho Chehab		setsockopt() ---> allocation of the circular buffer (ring)
694ba7bc9fSMauro Carvalho Chehab				  option: PACKET_RX_RING
704ba7bc9fSMauro Carvalho Chehab		mmap() ---------> mapping of the allocated buffer to the
714ba7bc9fSMauro Carvalho Chehab				  user process
724ba7bc9fSMauro Carvalho Chehab
734ba7bc9fSMauro Carvalho Chehab    [capture]   poll() ---------> to wait for incoming packets
744ba7bc9fSMauro Carvalho Chehab
754ba7bc9fSMauro Carvalho Chehab    [shutdown]  close() --------> destruction of the capture socket and
764ba7bc9fSMauro Carvalho Chehab				  deallocation of all associated
774ba7bc9fSMauro Carvalho Chehab				  resources.
784ba7bc9fSMauro Carvalho Chehab
794ba7bc9fSMauro Carvalho Chehab
804ba7bc9fSMauro Carvalho Chehabsocket creation and destruction is straight forward, and is done
814ba7bc9fSMauro Carvalho Chehabthe same way with or without PACKET_MMAP::
824ba7bc9fSMauro Carvalho Chehab
834ba7bc9fSMauro Carvalho Chehab int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
844ba7bc9fSMauro Carvalho Chehab
854ba7bc9fSMauro Carvalho Chehabwhere mode is SOCK_RAW for the raw interface were link level
864ba7bc9fSMauro Carvalho Chehabinformation can be captured or SOCK_DGRAM for the cooked
874ba7bc9fSMauro Carvalho Chehabinterface where link level information capture is not
884ba7bc9fSMauro Carvalho Chehabsupported and a link level pseudo-header is provided
894ba7bc9fSMauro Carvalho Chehabby the kernel.
904ba7bc9fSMauro Carvalho Chehab
914ba7bc9fSMauro Carvalho ChehabThe destruction of the socket and all associated resources
924ba7bc9fSMauro Carvalho Chehabis done by a simple call to close(fd).
934ba7bc9fSMauro Carvalho Chehab
944ba7bc9fSMauro Carvalho ChehabSimilarly as without PACKET_MMAP, it is possible to use one socket
954ba7bc9fSMauro Carvalho Chehabfor capture and transmission. This can be done by mapping the
964ba7bc9fSMauro Carvalho Chehaballocated RX and TX buffer ring with a single mmap() call.
974ba7bc9fSMauro Carvalho ChehabSee "Mapping and use of the circular buffer (ring)".
984ba7bc9fSMauro Carvalho Chehab
994ba7bc9fSMauro Carvalho ChehabNext I will describe PACKET_MMAP settings and its constraints,
1004ba7bc9fSMauro Carvalho Chehabalso the mapping of the circular buffer in the user process and
1014ba7bc9fSMauro Carvalho Chehabthe use of this buffer.
1024ba7bc9fSMauro Carvalho Chehab
1034ba7bc9fSMauro Carvalho ChehabHow to use mmap() directly to improve transmission process
1044ba7bc9fSMauro Carvalho Chehab==========================================================
1054ba7bc9fSMauro Carvalho ChehabTransmission process is similar to capture as shown below::
1064ba7bc9fSMauro Carvalho Chehab
1074ba7bc9fSMauro Carvalho Chehab    [setup]         socket() -------> creation of the transmission socket
1084ba7bc9fSMauro Carvalho Chehab		    setsockopt() ---> allocation of the circular buffer (ring)
1094ba7bc9fSMauro Carvalho Chehab				      option: PACKET_TX_RING
1104ba7bc9fSMauro Carvalho Chehab		    bind() ---------> bind transmission socket with a network interface
1114ba7bc9fSMauro Carvalho Chehab		    mmap() ---------> mapping of the allocated buffer to the
1124ba7bc9fSMauro Carvalho Chehab				      user process
1134ba7bc9fSMauro Carvalho Chehab
1144ba7bc9fSMauro Carvalho Chehab    [transmission]  poll() ---------> wait for free packets (optional)
1154ba7bc9fSMauro Carvalho Chehab		    send() ---------> send all packets that are set as ready in
1164ba7bc9fSMauro Carvalho Chehab				      the ring
1174ba7bc9fSMauro Carvalho Chehab				      The flag MSG_DONTWAIT can be used to return
1184ba7bc9fSMauro Carvalho Chehab				      before end of transfer.
1194ba7bc9fSMauro Carvalho Chehab
1204ba7bc9fSMauro Carvalho Chehab    [shutdown]      close() --------> destruction of the transmission socket and
1214ba7bc9fSMauro Carvalho Chehab				      deallocation of all associated resources.
1224ba7bc9fSMauro Carvalho Chehab
1234ba7bc9fSMauro Carvalho ChehabSocket creation and destruction is also straight forward, and is done
1244ba7bc9fSMauro Carvalho Chehabthe same way as in capturing described in the previous paragraph::
1254ba7bc9fSMauro Carvalho Chehab
1264ba7bc9fSMauro Carvalho Chehab int fd = socket(PF_PACKET, mode, 0);
1274ba7bc9fSMauro Carvalho Chehab
1284ba7bc9fSMauro Carvalho ChehabThe protocol can optionally be 0 in case we only want to transmit
1294ba7bc9fSMauro Carvalho Chehabvia this socket, which avoids an expensive call to packet_rcv().
1304ba7bc9fSMauro Carvalho ChehabIn this case, you also need to bind(2) the TX_RING with sll_protocol = 0
1314ba7bc9fSMauro Carvalho Chehabset. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
1324ba7bc9fSMauro Carvalho Chehab
1334ba7bc9fSMauro Carvalho ChehabBinding the socket to your network interface is mandatory (with zero copy) to
1344ba7bc9fSMauro Carvalho Chehabknow the header size of frames used in the circular buffer.
1354ba7bc9fSMauro Carvalho Chehab
1364ba7bc9fSMauro Carvalho ChehabAs capture, each frame contains two parts::
1374ba7bc9fSMauro Carvalho Chehab
1384ba7bc9fSMauro Carvalho Chehab    --------------------
1394ba7bc9fSMauro Carvalho Chehab    | struct tpacket_hdr | Header. It contains the status of
1404ba7bc9fSMauro Carvalho Chehab    |                    | of this frame
1414ba7bc9fSMauro Carvalho Chehab    |--------------------|
1424ba7bc9fSMauro Carvalho Chehab    | data buffer        |
1434ba7bc9fSMauro Carvalho Chehab    .                    .  Data that will be sent over the network interface.
1444ba7bc9fSMauro Carvalho Chehab    .                    .
1454ba7bc9fSMauro Carvalho Chehab    --------------------
1464ba7bc9fSMauro Carvalho Chehab
1474ba7bc9fSMauro Carvalho Chehab bind() associates the socket to your network interface thanks to
1484ba7bc9fSMauro Carvalho Chehab sll_ifindex parameter of struct sockaddr_ll.
1494ba7bc9fSMauro Carvalho Chehab
1504ba7bc9fSMauro Carvalho Chehab Initialization example::
1514ba7bc9fSMauro Carvalho Chehab
1524ba7bc9fSMauro Carvalho Chehab    struct sockaddr_ll my_addr;
1534ba7bc9fSMauro Carvalho Chehab    struct ifreq s_ifr;
1544ba7bc9fSMauro Carvalho Chehab    ...
1554ba7bc9fSMauro Carvalho Chehab
156f9ce26c5SKees Cook    strscpy_pad (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
1574ba7bc9fSMauro Carvalho Chehab
1584ba7bc9fSMauro Carvalho Chehab    /* get interface index of eth0 */
1594ba7bc9fSMauro Carvalho Chehab    ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
1604ba7bc9fSMauro Carvalho Chehab
1614ba7bc9fSMauro Carvalho Chehab    /* fill sockaddr_ll struct to prepare binding */
1624ba7bc9fSMauro Carvalho Chehab    my_addr.sll_family = AF_PACKET;
1634ba7bc9fSMauro Carvalho Chehab    my_addr.sll_protocol = htons(ETH_P_ALL);
1644ba7bc9fSMauro Carvalho Chehab    my_addr.sll_ifindex =  s_ifr.ifr_ifindex;
1654ba7bc9fSMauro Carvalho Chehab
1664ba7bc9fSMauro Carvalho Chehab    /* bind socket to eth0 */
1674ba7bc9fSMauro Carvalho Chehab    bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
1684ba7bc9fSMauro Carvalho Chehab
1694ba7bc9fSMauro Carvalho Chehab A complete tutorial is available at: https://sites.google.com/site/packetmmap/
1704ba7bc9fSMauro Carvalho Chehab
1714ba7bc9fSMauro Carvalho ChehabBy default, the user should put data at::
1724ba7bc9fSMauro Carvalho Chehab
1734ba7bc9fSMauro Carvalho Chehab frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
1744ba7bc9fSMauro Carvalho Chehab
1754ba7bc9fSMauro Carvalho ChehabSo, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
1764ba7bc9fSMauro Carvalho Chehabthe beginning of the user data will be at::
1774ba7bc9fSMauro Carvalho Chehab
1784ba7bc9fSMauro Carvalho Chehab frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
1794ba7bc9fSMauro Carvalho Chehab
1804ba7bc9fSMauro Carvalho ChehabIf you wish to put user data at a custom offset from the beginning of
1814ba7bc9fSMauro Carvalho Chehabthe frame (for payload alignment with SOCK_RAW mode for instance) you
1824ba7bc9fSMauro Carvalho Chehabcan set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
1834ba7bc9fSMauro Carvalho Chehabto make this work it must be enabled previously with setsockopt()
1844ba7bc9fSMauro Carvalho Chehaband the PACKET_TX_HAS_OFF option.
1854ba7bc9fSMauro Carvalho Chehab
1864ba7bc9fSMauro Carvalho ChehabPACKET_MMAP settings
1874ba7bc9fSMauro Carvalho Chehab====================
1884ba7bc9fSMauro Carvalho Chehab
1894ba7bc9fSMauro Carvalho ChehabTo setup PACKET_MMAP from user level code is done with a call like
1904ba7bc9fSMauro Carvalho Chehab
1914ba7bc9fSMauro Carvalho Chehab - Capture process::
1924ba7bc9fSMauro Carvalho Chehab
1934ba7bc9fSMauro Carvalho Chehab     setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
1944ba7bc9fSMauro Carvalho Chehab
1954ba7bc9fSMauro Carvalho Chehab - Transmission process::
1964ba7bc9fSMauro Carvalho Chehab
1974ba7bc9fSMauro Carvalho Chehab     setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
1984ba7bc9fSMauro Carvalho Chehab
1994ba7bc9fSMauro Carvalho ChehabThe most significant argument in the previous call is the req parameter,
2004ba7bc9fSMauro Carvalho Chehabthis parameter must to have the following structure::
2014ba7bc9fSMauro Carvalho Chehab
2024ba7bc9fSMauro Carvalho Chehab    struct tpacket_req
2034ba7bc9fSMauro Carvalho Chehab    {
2044ba7bc9fSMauro Carvalho Chehab	unsigned int    tp_block_size;  /* Minimal size of contiguous block */
2054ba7bc9fSMauro Carvalho Chehab	unsigned int    tp_block_nr;    /* Number of blocks */
2064ba7bc9fSMauro Carvalho Chehab	unsigned int    tp_frame_size;  /* Size of frame */
2074ba7bc9fSMauro Carvalho Chehab	unsigned int    tp_frame_nr;    /* Total number of frames */
2084ba7bc9fSMauro Carvalho Chehab    };
2094ba7bc9fSMauro Carvalho Chehab
2104ba7bc9fSMauro Carvalho ChehabThis structure is defined in /usr/include/linux/if_packet.h and establishes a
2114ba7bc9fSMauro Carvalho Chehabcircular buffer (ring) of unswappable memory.
2124ba7bc9fSMauro Carvalho ChehabBeing mapped in the capture process allows reading the captured frames and
2134ba7bc9fSMauro Carvalho Chehabrelated meta-information like timestamps without requiring a system call.
2144ba7bc9fSMauro Carvalho Chehab
2154ba7bc9fSMauro Carvalho ChehabFrames are grouped in blocks. Each block is a physically contiguous
2164ba7bc9fSMauro Carvalho Chehabregion of memory and holds tp_block_size/tp_frame_size frames. The total number
2174ba7bc9fSMauro Carvalho Chehabof blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
2184ba7bc9fSMauro Carvalho Chehab
2194ba7bc9fSMauro Carvalho Chehab    frames_per_block = tp_block_size/tp_frame_size
2204ba7bc9fSMauro Carvalho Chehab
2214ba7bc9fSMauro Carvalho Chehabindeed, packet_set_ring checks that the following condition is true::
2224ba7bc9fSMauro Carvalho Chehab
2234ba7bc9fSMauro Carvalho Chehab    frames_per_block * tp_block_nr == tp_frame_nr
2244ba7bc9fSMauro Carvalho Chehab
2254ba7bc9fSMauro Carvalho ChehabLets see an example, with the following values::
2264ba7bc9fSMauro Carvalho Chehab
2274ba7bc9fSMauro Carvalho Chehab     tp_block_size= 4096
2284ba7bc9fSMauro Carvalho Chehab     tp_frame_size= 2048
2294ba7bc9fSMauro Carvalho Chehab     tp_block_nr  = 4
2304ba7bc9fSMauro Carvalho Chehab     tp_frame_nr  = 8
2314ba7bc9fSMauro Carvalho Chehab
2324ba7bc9fSMauro Carvalho Chehabwe will get the following buffer structure::
2334ba7bc9fSMauro Carvalho Chehab
2344ba7bc9fSMauro Carvalho Chehab	    block #1                 block #2
2354ba7bc9fSMauro Carvalho Chehab    +---------+---------+    +---------+---------+
2364ba7bc9fSMauro Carvalho Chehab    | frame 1 | frame 2 |    | frame 3 | frame 4 |
2374ba7bc9fSMauro Carvalho Chehab    +---------+---------+    +---------+---------+
2384ba7bc9fSMauro Carvalho Chehab
2394ba7bc9fSMauro Carvalho Chehab	    block #3                 block #4
2404ba7bc9fSMauro Carvalho Chehab    +---------+---------+    +---------+---------+
2414ba7bc9fSMauro Carvalho Chehab    | frame 5 | frame 6 |    | frame 7 | frame 8 |
2424ba7bc9fSMauro Carvalho Chehab    +---------+---------+    +---------+---------+
2434ba7bc9fSMauro Carvalho Chehab
2444ba7bc9fSMauro Carvalho ChehabA frame can be of any size with the only condition it can fit in a block. A block
2454ba7bc9fSMauro Carvalho Chehabcan only hold an integer number of frames, or in other words, a frame cannot
2464ba7bc9fSMauro Carvalho Chehabbe spawned across two blocks, so there are some details you have to take into
2474ba7bc9fSMauro Carvalho Chehabaccount when choosing the frame_size. See "Mapping and use of the circular
2484ba7bc9fSMauro Carvalho Chehabbuffer (ring)".
2494ba7bc9fSMauro Carvalho Chehab
2504ba7bc9fSMauro Carvalho ChehabPACKET_MMAP setting constraints
2514ba7bc9fSMauro Carvalho Chehab===============================
2524ba7bc9fSMauro Carvalho Chehab
2534ba7bc9fSMauro Carvalho ChehabIn kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
2544ba7bc9fSMauro Carvalho Chehabthe PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
255e4da63cdSBaruch Siach16384 in a 64 bit architecture.
2564ba7bc9fSMauro Carvalho Chehab
2574ba7bc9fSMauro Carvalho ChehabBlock size limit
2584ba7bc9fSMauro Carvalho Chehab----------------
2594ba7bc9fSMauro Carvalho Chehab
2604ba7bc9fSMauro Carvalho ChehabAs stated earlier, each block is a contiguous physical region of memory. These
2614ba7bc9fSMauro Carvalho Chehabmemory regions are allocated with calls to the __get_free_pages() function. As
2624ba7bc9fSMauro Carvalho Chehabthe name indicates, this function allocates pages of memory, and the second
2634ba7bc9fSMauro Carvalho Chehabargument is "order" or a power of two number of pages, that is
2644ba7bc9fSMauro Carvalho Chehab(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
2654ba7bc9fSMauro Carvalho Chehaborder=2 ==> 16384 bytes, etc. The maximum size of a
2664ba7bc9fSMauro Carvalho Chehabregion allocated by __get_free_pages is determined by the MAX_ORDER macro. More
2674ba7bc9fSMauro Carvalho Chehabprecisely the limit can be calculated as::
2684ba7bc9fSMauro Carvalho Chehab
2694ba7bc9fSMauro Carvalho Chehab   PAGE_SIZE << MAX_ORDER
2704ba7bc9fSMauro Carvalho Chehab
2714ba7bc9fSMauro Carvalho Chehab   In a i386 architecture PAGE_SIZE is 4096 bytes
2724ba7bc9fSMauro Carvalho Chehab   In a 2.4/i386 kernel MAX_ORDER is 10
2734ba7bc9fSMauro Carvalho Chehab   In a 2.6/i386 kernel MAX_ORDER is 11
2744ba7bc9fSMauro Carvalho Chehab
2754ba7bc9fSMauro Carvalho ChehabSo get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
2764ba7bc9fSMauro Carvalho Chehabrespectively, with an i386 architecture.
2774ba7bc9fSMauro Carvalho Chehab
2784ba7bc9fSMauro Carvalho ChehabUser space programs can include /usr/include/sys/user.h and
2794ba7bc9fSMauro Carvalho Chehab/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
2804ba7bc9fSMauro Carvalho Chehab
2814ba7bc9fSMauro Carvalho ChehabThe pagesize can also be determined dynamically with the getpagesize (2)
2824ba7bc9fSMauro Carvalho Chehabsystem call.
2834ba7bc9fSMauro Carvalho Chehab
2844ba7bc9fSMauro Carvalho ChehabBlock number limit
2854ba7bc9fSMauro Carvalho Chehab------------------
2864ba7bc9fSMauro Carvalho Chehab
2874ba7bc9fSMauro Carvalho ChehabTo understand the constraints of PACKET_MMAP, we have to see the structure
2884ba7bc9fSMauro Carvalho Chehabused to hold the pointers to each block.
2894ba7bc9fSMauro Carvalho Chehab
2904ba7bc9fSMauro Carvalho ChehabCurrently, this structure is a dynamically allocated vector with kmalloc
2914ba7bc9fSMauro Carvalho Chehabcalled pg_vec, its size limits the number of blocks that can be allocated::
2924ba7bc9fSMauro Carvalho Chehab
2934ba7bc9fSMauro Carvalho Chehab    +---+---+---+---+
2944ba7bc9fSMauro Carvalho Chehab    | x | x | x | x |
2954ba7bc9fSMauro Carvalho Chehab    +---+---+---+---+
2964ba7bc9fSMauro Carvalho Chehab      |   |   |   |
2974ba7bc9fSMauro Carvalho Chehab      |   |   |   v
2984ba7bc9fSMauro Carvalho Chehab      |   |   v  block #4
2994ba7bc9fSMauro Carvalho Chehab      |   v  block #3
3004ba7bc9fSMauro Carvalho Chehab      v  block #2
3014ba7bc9fSMauro Carvalho Chehab     block #1
3024ba7bc9fSMauro Carvalho Chehab
3034ba7bc9fSMauro Carvalho Chehabkmalloc allocates any number of bytes of physically contiguous memory from
3044ba7bc9fSMauro Carvalho Chehaba pool of pre-determined sizes. This pool of memory is maintained by the slab
3054ba7bc9fSMauro Carvalho Chehaballocator which is at the end the responsible for doing the allocation and
3064ba7bc9fSMauro Carvalho Chehabhence which imposes the maximum memory that kmalloc can allocate.
3074ba7bc9fSMauro Carvalho Chehab
3084ba7bc9fSMauro Carvalho ChehabIn a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
3094ba7bc9fSMauro Carvalho Chehabpredetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
3104ba7bc9fSMauro Carvalho Chehabentries of /proc/slabinfo
3114ba7bc9fSMauro Carvalho Chehab
3124ba7bc9fSMauro Carvalho ChehabIn a 32 bit architecture, pointers are 4 bytes long, so the total number of
3134ba7bc9fSMauro Carvalho Chehabpointers to blocks is::
3144ba7bc9fSMauro Carvalho Chehab
3154ba7bc9fSMauro Carvalho Chehab     131072/4 = 32768 blocks
3164ba7bc9fSMauro Carvalho Chehab
3174ba7bc9fSMauro Carvalho ChehabPACKET_MMAP buffer size calculator
3184ba7bc9fSMauro Carvalho Chehab==================================
3194ba7bc9fSMauro Carvalho Chehab
3204ba7bc9fSMauro Carvalho ChehabDefinitions:
3214ba7bc9fSMauro Carvalho Chehab
3224ba7bc9fSMauro Carvalho Chehab==============  ================================================================
3234ba7bc9fSMauro Carvalho Chehab<size-max>      is the maximum size of allocable with kmalloc
3244ba7bc9fSMauro Carvalho Chehab		(see /proc/slabinfo)
3254ba7bc9fSMauro Carvalho Chehab<pointer size>  depends on the architecture -- ``sizeof(void *)``
3264ba7bc9fSMauro Carvalho Chehab<page size>     depends on the architecture -- PAGE_SIZE or getpagesize (2)
3274ba7bc9fSMauro Carvalho Chehab<max-order>     is the value defined with MAX_ORDER
3284ba7bc9fSMauro Carvalho Chehab<frame size>    it's an upper bound of frame's capture size (more on this later)
3294ba7bc9fSMauro Carvalho Chehab==============  ================================================================
3304ba7bc9fSMauro Carvalho Chehab
3314ba7bc9fSMauro Carvalho Chehabfrom these definitions we will derive::
3324ba7bc9fSMauro Carvalho Chehab
3334ba7bc9fSMauro Carvalho Chehab	<block number> = <size-max>/<pointer size>
3344ba7bc9fSMauro Carvalho Chehab	<block size> = <pagesize> << <max-order>
3354ba7bc9fSMauro Carvalho Chehab
3364ba7bc9fSMauro Carvalho Chehabso, the max buffer size is::
3374ba7bc9fSMauro Carvalho Chehab
3384ba7bc9fSMauro Carvalho Chehab	<block number> * <block size>
3394ba7bc9fSMauro Carvalho Chehab
3404ba7bc9fSMauro Carvalho Chehaband, the number of frames be::
3414ba7bc9fSMauro Carvalho Chehab
3424ba7bc9fSMauro Carvalho Chehab	<block number> * <block size> / <frame size>
3434ba7bc9fSMauro Carvalho Chehab
3444ba7bc9fSMauro Carvalho ChehabSuppose the following parameters, which apply for 2.6 kernel and an
3454ba7bc9fSMauro Carvalho Chehabi386 architecture::
3464ba7bc9fSMauro Carvalho Chehab
3474ba7bc9fSMauro Carvalho Chehab	<size-max> = 131072 bytes
3484ba7bc9fSMauro Carvalho Chehab	<pointer size> = 4 bytes
3494ba7bc9fSMauro Carvalho Chehab	<pagesize> = 4096 bytes
3504ba7bc9fSMauro Carvalho Chehab	<max-order> = 11
3514ba7bc9fSMauro Carvalho Chehab
3524ba7bc9fSMauro Carvalho Chehaband a value for <frame size> of 2048 bytes. These parameters will yield::
3534ba7bc9fSMauro Carvalho Chehab
3544ba7bc9fSMauro Carvalho Chehab	<block number> = 131072/4 = 32768 blocks
3554ba7bc9fSMauro Carvalho Chehab	<block size> = 4096 << 11 = 8 MiB.
3564ba7bc9fSMauro Carvalho Chehab
3574ba7bc9fSMauro Carvalho Chehaband hence the buffer will have a 262144 MiB size. So it can hold
3584ba7bc9fSMauro Carvalho Chehab262144 MiB / 2048 bytes = 134217728 frames
3594ba7bc9fSMauro Carvalho Chehab
3604ba7bc9fSMauro Carvalho ChehabActually, this buffer size is not possible with an i386 architecture.
3614ba7bc9fSMauro Carvalho ChehabRemember that the memory is allocated in kernel space, in the case of
3624ba7bc9fSMauro Carvalho Chehaban i386 kernel's memory size is limited to 1GiB.
3634ba7bc9fSMauro Carvalho Chehab
3644ba7bc9fSMauro Carvalho ChehabAll memory allocations are not freed until the socket is closed. The memory
3654ba7bc9fSMauro Carvalho Chehaballocations are done with GFP_KERNEL priority, this basically means that
3664ba7bc9fSMauro Carvalho Chehabthe allocation can wait and swap other process' memory in order to allocate
3674ba7bc9fSMauro Carvalho Chehabthe necessary memory, so normally limits can be reached.
3684ba7bc9fSMauro Carvalho Chehab
3694ba7bc9fSMauro Carvalho ChehabOther constraints
3704ba7bc9fSMauro Carvalho Chehab-----------------
3714ba7bc9fSMauro Carvalho Chehab
3724ba7bc9fSMauro Carvalho ChehabIf you check the source code you will see that what I draw here as a frame
3734ba7bc9fSMauro Carvalho Chehabis not only the link level frame. At the beginning of each frame there is a
3744ba7bc9fSMauro Carvalho Chehabheader called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
3754ba7bc9fSMauro Carvalho Chehabmeta information like timestamp. So what we draw here a frame it's really
3764ba7bc9fSMauro Carvalho Chehabthe following (from include/linux/if_packet.h)::
3774ba7bc9fSMauro Carvalho Chehab
3784ba7bc9fSMauro Carvalho Chehab /*
3794ba7bc9fSMauro Carvalho Chehab   Frame structure:
3804ba7bc9fSMauro Carvalho Chehab
3814ba7bc9fSMauro Carvalho Chehab   - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
3824ba7bc9fSMauro Carvalho Chehab   - struct tpacket_hdr
3834ba7bc9fSMauro Carvalho Chehab   - pad to TPACKET_ALIGNMENT=16
3844ba7bc9fSMauro Carvalho Chehab   - struct sockaddr_ll
3854ba7bc9fSMauro Carvalho Chehab   - Gap, chosen so that packet data (Start+tp_net) aligns to
3864ba7bc9fSMauro Carvalho Chehab     TPACKET_ALIGNMENT=16
3874ba7bc9fSMauro Carvalho Chehab   - Start+tp_mac: [ Optional MAC header ]
3884ba7bc9fSMauro Carvalho Chehab   - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
3894ba7bc9fSMauro Carvalho Chehab   - Pad to align to TPACKET_ALIGNMENT=16
3904ba7bc9fSMauro Carvalho Chehab */
3914ba7bc9fSMauro Carvalho Chehab
3924ba7bc9fSMauro Carvalho ChehabThe following are conditions that are checked in packet_set_ring
3934ba7bc9fSMauro Carvalho Chehab
3944ba7bc9fSMauro Carvalho Chehab   - tp_block_size must be a multiple of PAGE_SIZE (1)
3954ba7bc9fSMauro Carvalho Chehab   - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
3964ba7bc9fSMauro Carvalho Chehab   - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
3974ba7bc9fSMauro Carvalho Chehab   - tp_frame_nr   must be exactly frames_per_block*tp_block_nr
3984ba7bc9fSMauro Carvalho Chehab
3994ba7bc9fSMauro Carvalho ChehabNote that tp_block_size should be chosen to be a power of two or there will
4004ba7bc9fSMauro Carvalho Chehabbe a waste of memory.
4014ba7bc9fSMauro Carvalho Chehab
4024ba7bc9fSMauro Carvalho ChehabMapping and use of the circular buffer (ring)
4034ba7bc9fSMauro Carvalho Chehab---------------------------------------------
4044ba7bc9fSMauro Carvalho Chehab
4054ba7bc9fSMauro Carvalho ChehabThe mapping of the buffer in the user process is done with the conventional
4064ba7bc9fSMauro Carvalho Chehabmmap function. Even the circular buffer is compound of several physically
4074ba7bc9fSMauro Carvalho Chehabdiscontiguous blocks of memory, they are contiguous to the user space, hence
4084ba7bc9fSMauro Carvalho Chehabjust one call to mmap is needed::
4094ba7bc9fSMauro Carvalho Chehab
4104ba7bc9fSMauro Carvalho Chehab    mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
4114ba7bc9fSMauro Carvalho Chehab
4124ba7bc9fSMauro Carvalho ChehabIf tp_frame_size is a divisor of tp_block_size frames will be
4134ba7bc9fSMauro Carvalho Chehabcontiguously spaced by tp_frame_size bytes. If not, each
4144ba7bc9fSMauro Carvalho Chehabtp_block_size/tp_frame_size frames there will be a gap between
4154ba7bc9fSMauro Carvalho Chehabthe frames. This is because a frame cannot be spawn across two
4164ba7bc9fSMauro Carvalho Chehabblocks.
4174ba7bc9fSMauro Carvalho Chehab
4184ba7bc9fSMauro Carvalho ChehabTo use one socket for capture and transmission, the mapping of both the
4194ba7bc9fSMauro Carvalho ChehabRX and TX buffer ring has to be done with one call to mmap::
4204ba7bc9fSMauro Carvalho Chehab
4214ba7bc9fSMauro Carvalho Chehab    ...
4224ba7bc9fSMauro Carvalho Chehab    setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
4234ba7bc9fSMauro Carvalho Chehab    setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
4244ba7bc9fSMauro Carvalho Chehab    ...
4254ba7bc9fSMauro Carvalho Chehab    rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
4264ba7bc9fSMauro Carvalho Chehab    tx_ring = rx_ring + size;
4274ba7bc9fSMauro Carvalho Chehab
4284ba7bc9fSMauro Carvalho ChehabRX must be the first as the kernel maps the TX ring memory right
4294ba7bc9fSMauro Carvalho Chehabafter the RX one.
4304ba7bc9fSMauro Carvalho Chehab
4314ba7bc9fSMauro Carvalho ChehabAt the beginning of each frame there is an status field (see
4324ba7bc9fSMauro Carvalho Chehabstruct tpacket_hdr). If this field is 0 means that the frame is ready
4334ba7bc9fSMauro Carvalho Chehabto be used for the kernel, If not, there is a frame the user can read
4344ba7bc9fSMauro Carvalho Chehaband the following flags apply:
4354ba7bc9fSMauro Carvalho Chehab
4364ba7bc9fSMauro Carvalho ChehabCapture process
4374ba7bc9fSMauro Carvalho Chehab^^^^^^^^^^^^^^^
4384ba7bc9fSMauro Carvalho Chehab
43917e94567SBaruch SiachFrom include/linux/if_packet.h::
4404ba7bc9fSMauro Carvalho Chehab
4414ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_COPY          (1 << 1)
4424ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_LOSING        (1 << 2)
4434ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_CSUMNOTREADY  (1 << 3)
4444ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_CSUM_VALID    (1 << 7)
4454ba7bc9fSMauro Carvalho Chehab
4464ba7bc9fSMauro Carvalho Chehab======================  =======================================================
4474ba7bc9fSMauro Carvalho ChehabTP_STATUS_COPY		This flag indicates that the frame (and associated
4484ba7bc9fSMauro Carvalho Chehab			meta information) has been truncated because it's
4494ba7bc9fSMauro Carvalho Chehab			larger than tp_frame_size. This packet can be
4504ba7bc9fSMauro Carvalho Chehab			read entirely with recvfrom().
4514ba7bc9fSMauro Carvalho Chehab
4524ba7bc9fSMauro Carvalho Chehab			In order to make this work it must to be
4534ba7bc9fSMauro Carvalho Chehab			enabled previously with setsockopt() and
4544ba7bc9fSMauro Carvalho Chehab			the PACKET_COPY_THRESH option.
4554ba7bc9fSMauro Carvalho Chehab
4564ba7bc9fSMauro Carvalho Chehab			The number of frames that can be buffered to
4574ba7bc9fSMauro Carvalho Chehab			be read with recvfrom is limited like a normal socket.
4584ba7bc9fSMauro Carvalho Chehab			See the SO_RCVBUF option in the socket (7) man page.
4594ba7bc9fSMauro Carvalho Chehab
4604ba7bc9fSMauro Carvalho ChehabTP_STATUS_LOSING	indicates there were packet drops from last time
4614ba7bc9fSMauro Carvalho Chehab			statistics where checked with getsockopt() and
4624ba7bc9fSMauro Carvalho Chehab			the PACKET_STATISTICS option.
4634ba7bc9fSMauro Carvalho Chehab
4644ba7bc9fSMauro Carvalho ChehabTP_STATUS_CSUMNOTREADY	currently it's used for outgoing IP packets which
4654ba7bc9fSMauro Carvalho Chehab			its checksum will be done in hardware. So while
4664ba7bc9fSMauro Carvalho Chehab			reading the packet we should not try to check the
4674ba7bc9fSMauro Carvalho Chehab			checksum.
4684ba7bc9fSMauro Carvalho Chehab
4694ba7bc9fSMauro Carvalho ChehabTP_STATUS_CSUM_VALID	This flag indicates that at least the transport
4704ba7bc9fSMauro Carvalho Chehab			header checksum of the packet has been already
4714ba7bc9fSMauro Carvalho Chehab			validated on the kernel side. If the flag is not set
4724ba7bc9fSMauro Carvalho Chehab			then we are free to check the checksum by ourselves
4734ba7bc9fSMauro Carvalho Chehab			provided that TP_STATUS_CSUMNOTREADY is also not set.
4744ba7bc9fSMauro Carvalho Chehab======================  =======================================================
4754ba7bc9fSMauro Carvalho Chehab
4764ba7bc9fSMauro Carvalho Chehabfor convenience there are also the following defines::
4774ba7bc9fSMauro Carvalho Chehab
4784ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_KERNEL        0
4794ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_USER          1
4804ba7bc9fSMauro Carvalho Chehab
4814ba7bc9fSMauro Carvalho ChehabThe kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
4824ba7bc9fSMauro Carvalho Chehabreceives a packet it puts in the buffer and updates the status with
4834ba7bc9fSMauro Carvalho Chehabat least the TP_STATUS_USER flag. Then the user can read the packet,
4844ba7bc9fSMauro Carvalho Chehabonce the packet is read the user must zero the status field, so the kernel
4854ba7bc9fSMauro Carvalho Chehabcan use again that frame buffer.
4864ba7bc9fSMauro Carvalho Chehab
4874ba7bc9fSMauro Carvalho ChehabThe user can use poll (any other variant should apply too) to check if new
4884ba7bc9fSMauro Carvalho Chehabpackets are in the ring::
4894ba7bc9fSMauro Carvalho Chehab
4904ba7bc9fSMauro Carvalho Chehab    struct pollfd pfd;
4914ba7bc9fSMauro Carvalho Chehab
4924ba7bc9fSMauro Carvalho Chehab    pfd.fd = fd;
4934ba7bc9fSMauro Carvalho Chehab    pfd.revents = 0;
4944ba7bc9fSMauro Carvalho Chehab    pfd.events = POLLIN|POLLRDNORM|POLLERR;
4954ba7bc9fSMauro Carvalho Chehab
4964ba7bc9fSMauro Carvalho Chehab    if (status == TP_STATUS_KERNEL)
4974ba7bc9fSMauro Carvalho Chehab	retval = poll(&pfd, 1, timeout);
4984ba7bc9fSMauro Carvalho Chehab
4994ba7bc9fSMauro Carvalho ChehabIt doesn't incur in a race condition to first check the status value and
5004ba7bc9fSMauro Carvalho Chehabthen poll for frames.
5014ba7bc9fSMauro Carvalho Chehab
5024ba7bc9fSMauro Carvalho ChehabTransmission process
5034ba7bc9fSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^
5044ba7bc9fSMauro Carvalho Chehab
5054ba7bc9fSMauro Carvalho ChehabThose defines are also used for transmission::
5064ba7bc9fSMauro Carvalho Chehab
5074ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_AVAILABLE        0 // Frame is available
5084ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_SEND_REQUEST     1 // Frame will be sent on next send()
5094ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_SENDING          2 // Frame is currently in transmission
5104ba7bc9fSMauro Carvalho Chehab     #define TP_STATUS_WRONG_FORMAT     4 // Frame format is not correct
5114ba7bc9fSMauro Carvalho Chehab
5124ba7bc9fSMauro Carvalho ChehabFirst, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
5134ba7bc9fSMauro Carvalho Chehabpacket, the user fills a data buffer of an available frame, sets tp_len to
5144ba7bc9fSMauro Carvalho Chehabcurrent data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
5154ba7bc9fSMauro Carvalho ChehabThis can be done on multiple frames. Once the user is ready to transmit, it
5164ba7bc9fSMauro Carvalho Chehabcalls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
5174ba7bc9fSMauro Carvalho Chehabforwarded to the network device. The kernel updates each status of sent
5184ba7bc9fSMauro Carvalho Chehabframes with TP_STATUS_SENDING until the end of transfer.
5194ba7bc9fSMauro Carvalho Chehab
5204ba7bc9fSMauro Carvalho ChehabAt the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
5214ba7bc9fSMauro Carvalho Chehab
5224ba7bc9fSMauro Carvalho Chehab::
5234ba7bc9fSMauro Carvalho Chehab
5244ba7bc9fSMauro Carvalho Chehab    header->tp_len = in_i_size;
5254ba7bc9fSMauro Carvalho Chehab    header->tp_status = TP_STATUS_SEND_REQUEST;
5264ba7bc9fSMauro Carvalho Chehab    retval = send(this->socket, NULL, 0, 0);
5274ba7bc9fSMauro Carvalho Chehab
5284ba7bc9fSMauro Carvalho ChehabThe user can also use poll() to check if a buffer is available:
5294ba7bc9fSMauro Carvalho Chehab
5304ba7bc9fSMauro Carvalho Chehab(status == TP_STATUS_SENDING)
5314ba7bc9fSMauro Carvalho Chehab
5324ba7bc9fSMauro Carvalho Chehab::
5334ba7bc9fSMauro Carvalho Chehab
5344ba7bc9fSMauro Carvalho Chehab    struct pollfd pfd;
5354ba7bc9fSMauro Carvalho Chehab    pfd.fd = fd;
5364ba7bc9fSMauro Carvalho Chehab    pfd.revents = 0;
5374ba7bc9fSMauro Carvalho Chehab    pfd.events = POLLOUT;
5384ba7bc9fSMauro Carvalho Chehab    retval = poll(&pfd, 1, timeout);
5394ba7bc9fSMauro Carvalho Chehab
5404ba7bc9fSMauro Carvalho ChehabWhat TPACKET versions are available and when to use them?
5414ba7bc9fSMauro Carvalho Chehab=========================================================
5424ba7bc9fSMauro Carvalho Chehab
5434ba7bc9fSMauro Carvalho Chehab::
5444ba7bc9fSMauro Carvalho Chehab
5454ba7bc9fSMauro Carvalho Chehab int val = tpacket_version;
5464ba7bc9fSMauro Carvalho Chehab setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
5474ba7bc9fSMauro Carvalho Chehab getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
5484ba7bc9fSMauro Carvalho Chehab
5494ba7bc9fSMauro Carvalho Chehabwhere 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
5504ba7bc9fSMauro Carvalho Chehab
5514ba7bc9fSMauro Carvalho ChehabTPACKET_V1:
5524ba7bc9fSMauro Carvalho Chehab	- Default if not otherwise specified by setsockopt(2)
5534ba7bc9fSMauro Carvalho Chehab	- RX_RING, TX_RING available
5544ba7bc9fSMauro Carvalho Chehab
5554ba7bc9fSMauro Carvalho ChehabTPACKET_V1 --> TPACKET_V2:
5564ba7bc9fSMauro Carvalho Chehab	- Made 64 bit clean due to unsigned long usage in TPACKET_V1
5574ba7bc9fSMauro Carvalho Chehab	  structures, thus this also works on 64 bit kernel with 32 bit
5584ba7bc9fSMauro Carvalho Chehab	  userspace and the like
5594ba7bc9fSMauro Carvalho Chehab	- Timestamp resolution in nanoseconds instead of microseconds
5604ba7bc9fSMauro Carvalho Chehab	- RX_RING, TX_RING available
5614ba7bc9fSMauro Carvalho Chehab	- VLAN metadata information available for packets
5624ba7bc9fSMauro Carvalho Chehab	  (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
5634ba7bc9fSMauro Carvalho Chehab	  in the tpacket2_hdr structure:
5644ba7bc9fSMauro Carvalho Chehab
5654ba7bc9fSMauro Carvalho Chehab		- TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
5664ba7bc9fSMauro Carvalho Chehab		  that the tp_vlan_tci field has valid VLAN TCI value
5674ba7bc9fSMauro Carvalho Chehab		- TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
5684ba7bc9fSMauro Carvalho Chehab		  indicates that the tp_vlan_tpid field has valid VLAN TPID value
5694ba7bc9fSMauro Carvalho Chehab
5704ba7bc9fSMauro Carvalho Chehab	- How to switch to TPACKET_V2:
5714ba7bc9fSMauro Carvalho Chehab
5724ba7bc9fSMauro Carvalho Chehab		1. Replace struct tpacket_hdr by struct tpacket2_hdr
5734ba7bc9fSMauro Carvalho Chehab		2. Query header len and save
5744ba7bc9fSMauro Carvalho Chehab		3. Set protocol version to 2, set up ring as usual
5754ba7bc9fSMauro Carvalho Chehab		4. For getting the sockaddr_ll,
5764ba7bc9fSMauro Carvalho Chehab		   use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
5774ba7bc9fSMauro Carvalho Chehab		   ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
5784ba7bc9fSMauro Carvalho Chehab
5794ba7bc9fSMauro Carvalho ChehabTPACKET_V2 --> TPACKET_V3:
5804ba7bc9fSMauro Carvalho Chehab	- Flexible buffer implementation for RX_RING:
5814ba7bc9fSMauro Carvalho Chehab		1. Blocks can be configured with non-static frame-size
5824ba7bc9fSMauro Carvalho Chehab		2. Read/poll is at a block-level (as opposed to packet-level)
5834ba7bc9fSMauro Carvalho Chehab		3. Added poll timeout to avoid indefinite user-space wait
5844ba7bc9fSMauro Carvalho Chehab		   on idle links
5854ba7bc9fSMauro Carvalho Chehab		4. Added user-configurable knobs:
5864ba7bc9fSMauro Carvalho Chehab
5874ba7bc9fSMauro Carvalho Chehab			4.1 block::timeout
5884ba7bc9fSMauro Carvalho Chehab			4.2 tpkt_hdr::sk_rxhash
5894ba7bc9fSMauro Carvalho Chehab
5904ba7bc9fSMauro Carvalho Chehab	- RX Hash data available in user space
5914ba7bc9fSMauro Carvalho Chehab	- TX_RING semantics are conceptually similar to TPACKET_V2;
5924ba7bc9fSMauro Carvalho Chehab	  use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
5934ba7bc9fSMauro Carvalho Chehab	  instead of TPACKET2_HDRLEN. In the current implementation,
5944ba7bc9fSMauro Carvalho Chehab	  the tp_next_offset field in the tpacket3_hdr MUST be set to
5954ba7bc9fSMauro Carvalho Chehab	  zero, indicating that the ring does not hold variable sized frames.
5964ba7bc9fSMauro Carvalho Chehab	  Packets with non-zero values of tp_next_offset will be dropped.
5974ba7bc9fSMauro Carvalho Chehab
5984ba7bc9fSMauro Carvalho ChehabAF_PACKET fanout mode
5994ba7bc9fSMauro Carvalho Chehab=====================
6004ba7bc9fSMauro Carvalho Chehab
6014ba7bc9fSMauro Carvalho ChehabIn the AF_PACKET fanout mode, packet reception can be load balanced among
6024ba7bc9fSMauro Carvalho Chehabprocesses. This also works in combination with mmap(2) on packet sockets.
6034ba7bc9fSMauro Carvalho Chehab
6044ba7bc9fSMauro Carvalho ChehabCurrently implemented fanout policies are:
6054ba7bc9fSMauro Carvalho Chehab
6064ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
6074ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_LB: schedule to socket by round-robin
6084ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
6094ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_RND: schedule to socket by random selection
6104ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
6114ba7bc9fSMauro Carvalho Chehab  - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
6124ba7bc9fSMauro Carvalho Chehab
6134ba7bc9fSMauro Carvalho ChehabMinimal example code by David S. Miller (try things like "./test eth0 hash",
6144ba7bc9fSMauro Carvalho Chehab"./test eth0 lb", etc.)::
6154ba7bc9fSMauro Carvalho Chehab
6164ba7bc9fSMauro Carvalho Chehab    #include <stddef.h>
6174ba7bc9fSMauro Carvalho Chehab    #include <stdlib.h>
6184ba7bc9fSMauro Carvalho Chehab    #include <stdio.h>
6194ba7bc9fSMauro Carvalho Chehab    #include <string.h>
6204ba7bc9fSMauro Carvalho Chehab
6214ba7bc9fSMauro Carvalho Chehab    #include <sys/types.h>
6224ba7bc9fSMauro Carvalho Chehab    #include <sys/wait.h>
6234ba7bc9fSMauro Carvalho Chehab    #include <sys/socket.h>
6244ba7bc9fSMauro Carvalho Chehab    #include <sys/ioctl.h>
6254ba7bc9fSMauro Carvalho Chehab
6264ba7bc9fSMauro Carvalho Chehab    #include <unistd.h>
6274ba7bc9fSMauro Carvalho Chehab
6284ba7bc9fSMauro Carvalho Chehab    #include <linux/if_ether.h>
6294ba7bc9fSMauro Carvalho Chehab    #include <linux/if_packet.h>
6304ba7bc9fSMauro Carvalho Chehab
6314ba7bc9fSMauro Carvalho Chehab    #include <net/if.h>
6324ba7bc9fSMauro Carvalho Chehab
6334ba7bc9fSMauro Carvalho Chehab    static const char *device_name;
6344ba7bc9fSMauro Carvalho Chehab    static int fanout_type;
6354ba7bc9fSMauro Carvalho Chehab    static int fanout_id;
6364ba7bc9fSMauro Carvalho Chehab
6374ba7bc9fSMauro Carvalho Chehab    #ifndef PACKET_FANOUT
6384ba7bc9fSMauro Carvalho Chehab    # define PACKET_FANOUT			18
6394ba7bc9fSMauro Carvalho Chehab    # define PACKET_FANOUT_HASH		0
6404ba7bc9fSMauro Carvalho Chehab    # define PACKET_FANOUT_LB		1
6414ba7bc9fSMauro Carvalho Chehab    #endif
6424ba7bc9fSMauro Carvalho Chehab
6434ba7bc9fSMauro Carvalho Chehab    static int setup_socket(void)
6444ba7bc9fSMauro Carvalho Chehab    {
6454ba7bc9fSMauro Carvalho Chehab	    int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
6464ba7bc9fSMauro Carvalho Chehab	    struct sockaddr_ll ll;
6474ba7bc9fSMauro Carvalho Chehab	    struct ifreq ifr;
6484ba7bc9fSMauro Carvalho Chehab	    int fanout_arg;
6494ba7bc9fSMauro Carvalho Chehab
6504ba7bc9fSMauro Carvalho Chehab	    if (fd < 0) {
6514ba7bc9fSMauro Carvalho Chehab		    perror("socket");
6524ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
6534ba7bc9fSMauro Carvalho Chehab	    }
6544ba7bc9fSMauro Carvalho Chehab
6554ba7bc9fSMauro Carvalho Chehab	    memset(&ifr, 0, sizeof(ifr));
6564ba7bc9fSMauro Carvalho Chehab	    strcpy(ifr.ifr_name, device_name);
6574ba7bc9fSMauro Carvalho Chehab	    err = ioctl(fd, SIOCGIFINDEX, &ifr);
6584ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
6594ba7bc9fSMauro Carvalho Chehab		    perror("SIOCGIFINDEX");
6604ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
6614ba7bc9fSMauro Carvalho Chehab	    }
6624ba7bc9fSMauro Carvalho Chehab
6634ba7bc9fSMauro Carvalho Chehab	    memset(&ll, 0, sizeof(ll));
6644ba7bc9fSMauro Carvalho Chehab	    ll.sll_family = AF_PACKET;
6654ba7bc9fSMauro Carvalho Chehab	    ll.sll_ifindex = ifr.ifr_ifindex;
6664ba7bc9fSMauro Carvalho Chehab	    err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
6674ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
6684ba7bc9fSMauro Carvalho Chehab		    perror("bind");
6694ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
6704ba7bc9fSMauro Carvalho Chehab	    }
6714ba7bc9fSMauro Carvalho Chehab
6724ba7bc9fSMauro Carvalho Chehab	    fanout_arg = (fanout_id | (fanout_type << 16));
6734ba7bc9fSMauro Carvalho Chehab	    err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
6744ba7bc9fSMauro Carvalho Chehab			    &fanout_arg, sizeof(fanout_arg));
6754ba7bc9fSMauro Carvalho Chehab	    if (err) {
6764ba7bc9fSMauro Carvalho Chehab		    perror("setsockopt");
6774ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
6784ba7bc9fSMauro Carvalho Chehab	    }
6794ba7bc9fSMauro Carvalho Chehab
6804ba7bc9fSMauro Carvalho Chehab	    return fd;
6814ba7bc9fSMauro Carvalho Chehab    }
6824ba7bc9fSMauro Carvalho Chehab
6834ba7bc9fSMauro Carvalho Chehab    static void fanout_thread(void)
6844ba7bc9fSMauro Carvalho Chehab    {
6854ba7bc9fSMauro Carvalho Chehab	    int fd = setup_socket();
6864ba7bc9fSMauro Carvalho Chehab	    int limit = 10000;
6874ba7bc9fSMauro Carvalho Chehab
6884ba7bc9fSMauro Carvalho Chehab	    if (fd < 0)
6894ba7bc9fSMauro Carvalho Chehab		    exit(fd);
6904ba7bc9fSMauro Carvalho Chehab
6914ba7bc9fSMauro Carvalho Chehab	    while (limit-- > 0) {
6924ba7bc9fSMauro Carvalho Chehab		    char buf[1600];
6934ba7bc9fSMauro Carvalho Chehab		    int err;
6944ba7bc9fSMauro Carvalho Chehab
6954ba7bc9fSMauro Carvalho Chehab		    err = read(fd, buf, sizeof(buf));
6964ba7bc9fSMauro Carvalho Chehab		    if (err < 0) {
6974ba7bc9fSMauro Carvalho Chehab			    perror("read");
6984ba7bc9fSMauro Carvalho Chehab			    exit(EXIT_FAILURE);
6994ba7bc9fSMauro Carvalho Chehab		    }
7004ba7bc9fSMauro Carvalho Chehab		    if ((limit % 10) == 0)
7014ba7bc9fSMauro Carvalho Chehab			    fprintf(stdout, "(%d) \n", getpid());
7024ba7bc9fSMauro Carvalho Chehab	    }
7034ba7bc9fSMauro Carvalho Chehab
7044ba7bc9fSMauro Carvalho Chehab	    fprintf(stdout, "%d: Received 10000 packets\n", getpid());
7054ba7bc9fSMauro Carvalho Chehab
7064ba7bc9fSMauro Carvalho Chehab	    close(fd);
7074ba7bc9fSMauro Carvalho Chehab	    exit(0);
7084ba7bc9fSMauro Carvalho Chehab    }
7094ba7bc9fSMauro Carvalho Chehab
7104ba7bc9fSMauro Carvalho Chehab    int main(int argc, char **argp)
7114ba7bc9fSMauro Carvalho Chehab    {
7124ba7bc9fSMauro Carvalho Chehab	    int fd, err;
7134ba7bc9fSMauro Carvalho Chehab	    int i;
7144ba7bc9fSMauro Carvalho Chehab
7154ba7bc9fSMauro Carvalho Chehab	    if (argc != 3) {
7164ba7bc9fSMauro Carvalho Chehab		    fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
7174ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
7184ba7bc9fSMauro Carvalho Chehab	    }
7194ba7bc9fSMauro Carvalho Chehab
7204ba7bc9fSMauro Carvalho Chehab	    if (!strcmp(argp[2], "hash"))
7214ba7bc9fSMauro Carvalho Chehab		    fanout_type = PACKET_FANOUT_HASH;
7224ba7bc9fSMauro Carvalho Chehab	    else if (!strcmp(argp[2], "lb"))
7234ba7bc9fSMauro Carvalho Chehab		    fanout_type = PACKET_FANOUT_LB;
7244ba7bc9fSMauro Carvalho Chehab	    else {
7254ba7bc9fSMauro Carvalho Chehab		    fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
7264ba7bc9fSMauro Carvalho Chehab		    exit(EXIT_FAILURE);
7274ba7bc9fSMauro Carvalho Chehab	    }
7284ba7bc9fSMauro Carvalho Chehab
7294ba7bc9fSMauro Carvalho Chehab	    device_name = argp[1];
7304ba7bc9fSMauro Carvalho Chehab	    fanout_id = getpid() & 0xffff;
7314ba7bc9fSMauro Carvalho Chehab
7324ba7bc9fSMauro Carvalho Chehab	    for (i = 0; i < 4; i++) {
7334ba7bc9fSMauro Carvalho Chehab		    pid_t pid = fork();
7344ba7bc9fSMauro Carvalho Chehab
7354ba7bc9fSMauro Carvalho Chehab		    switch (pid) {
7364ba7bc9fSMauro Carvalho Chehab		    case 0:
7374ba7bc9fSMauro Carvalho Chehab			    fanout_thread();
7384ba7bc9fSMauro Carvalho Chehab
7394ba7bc9fSMauro Carvalho Chehab		    case -1:
7404ba7bc9fSMauro Carvalho Chehab			    perror("fork");
7414ba7bc9fSMauro Carvalho Chehab			    exit(EXIT_FAILURE);
7424ba7bc9fSMauro Carvalho Chehab		    }
7434ba7bc9fSMauro Carvalho Chehab	    }
7444ba7bc9fSMauro Carvalho Chehab
7454ba7bc9fSMauro Carvalho Chehab	    for (i = 0; i < 4; i++) {
7464ba7bc9fSMauro Carvalho Chehab		    int status;
7474ba7bc9fSMauro Carvalho Chehab
7484ba7bc9fSMauro Carvalho Chehab		    wait(&status);
7494ba7bc9fSMauro Carvalho Chehab	    }
7504ba7bc9fSMauro Carvalho Chehab
7514ba7bc9fSMauro Carvalho Chehab	    return 0;
7524ba7bc9fSMauro Carvalho Chehab    }
7534ba7bc9fSMauro Carvalho Chehab
7544ba7bc9fSMauro Carvalho ChehabAF_PACKET TPACKET_V3 example
7554ba7bc9fSMauro Carvalho Chehab============================
7564ba7bc9fSMauro Carvalho Chehab
7574ba7bc9fSMauro Carvalho ChehabAF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
758*d56b699dSBjorn Helgaassizes by doing its own memory management. It is based on blocks where polling
7594ba7bc9fSMauro Carvalho Chehabworks on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
7604ba7bc9fSMauro Carvalho Chehab
7614ba7bc9fSMauro Carvalho ChehabIt is said that TPACKET_V3 brings the following benefits:
7624ba7bc9fSMauro Carvalho Chehab
7634ba7bc9fSMauro Carvalho Chehab * ~15% - 20% reduction in CPU-usage
7644ba7bc9fSMauro Carvalho Chehab * ~20% increase in packet capture rate
7654ba7bc9fSMauro Carvalho Chehab * ~2x increase in packet density
7664ba7bc9fSMauro Carvalho Chehab * Port aggregation analysis
7674ba7bc9fSMauro Carvalho Chehab * Non static frame size to capture entire packet payload
7684ba7bc9fSMauro Carvalho Chehab
7694ba7bc9fSMauro Carvalho ChehabSo it seems to be a good candidate to be used with packet fanout.
7704ba7bc9fSMauro Carvalho Chehab
7714ba7bc9fSMauro Carvalho ChehabMinimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
7724ba7bc9fSMauro Carvalho Chehabit with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
7734ba7bc9fSMauro Carvalho Chehab
7744ba7bc9fSMauro Carvalho Chehab    /* Written from scratch, but kernel-to-user space API usage
7754ba7bc9fSMauro Carvalho Chehab    * dissected from lolpcap:
7764ba7bc9fSMauro Carvalho Chehab    *  Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
7774ba7bc9fSMauro Carvalho Chehab    *  License: GPL, version 2.0
7784ba7bc9fSMauro Carvalho Chehab    */
7794ba7bc9fSMauro Carvalho Chehab
7804ba7bc9fSMauro Carvalho Chehab    #include <stdio.h>
7814ba7bc9fSMauro Carvalho Chehab    #include <stdlib.h>
7824ba7bc9fSMauro Carvalho Chehab    #include <stdint.h>
7834ba7bc9fSMauro Carvalho Chehab    #include <string.h>
7844ba7bc9fSMauro Carvalho Chehab    #include <assert.h>
7854ba7bc9fSMauro Carvalho Chehab    #include <net/if.h>
7864ba7bc9fSMauro Carvalho Chehab    #include <arpa/inet.h>
7874ba7bc9fSMauro Carvalho Chehab    #include <netdb.h>
7884ba7bc9fSMauro Carvalho Chehab    #include <poll.h>
7894ba7bc9fSMauro Carvalho Chehab    #include <unistd.h>
7904ba7bc9fSMauro Carvalho Chehab    #include <signal.h>
7914ba7bc9fSMauro Carvalho Chehab    #include <inttypes.h>
7924ba7bc9fSMauro Carvalho Chehab    #include <sys/socket.h>
7934ba7bc9fSMauro Carvalho Chehab    #include <sys/mman.h>
7944ba7bc9fSMauro Carvalho Chehab    #include <linux/if_packet.h>
7954ba7bc9fSMauro Carvalho Chehab    #include <linux/if_ether.h>
7964ba7bc9fSMauro Carvalho Chehab    #include <linux/ip.h>
7974ba7bc9fSMauro Carvalho Chehab
7984ba7bc9fSMauro Carvalho Chehab    #ifndef likely
7994ba7bc9fSMauro Carvalho Chehab    # define likely(x)		__builtin_expect(!!(x), 1)
8004ba7bc9fSMauro Carvalho Chehab    #endif
8014ba7bc9fSMauro Carvalho Chehab    #ifndef unlikely
8024ba7bc9fSMauro Carvalho Chehab    # define unlikely(x)		__builtin_expect(!!(x), 0)
8034ba7bc9fSMauro Carvalho Chehab    #endif
8044ba7bc9fSMauro Carvalho Chehab
8054ba7bc9fSMauro Carvalho Chehab    struct block_desc {
8064ba7bc9fSMauro Carvalho Chehab	    uint32_t version;
8074ba7bc9fSMauro Carvalho Chehab	    uint32_t offset_to_priv;
8084ba7bc9fSMauro Carvalho Chehab	    struct tpacket_hdr_v1 h1;
8094ba7bc9fSMauro Carvalho Chehab    };
8104ba7bc9fSMauro Carvalho Chehab
8114ba7bc9fSMauro Carvalho Chehab    struct ring {
8124ba7bc9fSMauro Carvalho Chehab	    struct iovec *rd;
8134ba7bc9fSMauro Carvalho Chehab	    uint8_t *map;
8144ba7bc9fSMauro Carvalho Chehab	    struct tpacket_req3 req;
8154ba7bc9fSMauro Carvalho Chehab    };
8164ba7bc9fSMauro Carvalho Chehab
8174ba7bc9fSMauro Carvalho Chehab    static unsigned long packets_total = 0, bytes_total = 0;
8184ba7bc9fSMauro Carvalho Chehab    static sig_atomic_t sigint = 0;
8194ba7bc9fSMauro Carvalho Chehab
8204ba7bc9fSMauro Carvalho Chehab    static void sighandler(int num)
8214ba7bc9fSMauro Carvalho Chehab    {
8224ba7bc9fSMauro Carvalho Chehab	    sigint = 1;
8234ba7bc9fSMauro Carvalho Chehab    }
8244ba7bc9fSMauro Carvalho Chehab
8254ba7bc9fSMauro Carvalho Chehab    static int setup_socket(struct ring *ring, char *netdev)
8264ba7bc9fSMauro Carvalho Chehab    {
8274ba7bc9fSMauro Carvalho Chehab	    int err, i, fd, v = TPACKET_V3;
8284ba7bc9fSMauro Carvalho Chehab	    struct sockaddr_ll ll;
8294ba7bc9fSMauro Carvalho Chehab	    unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
8304ba7bc9fSMauro Carvalho Chehab	    unsigned int blocknum = 64;
8314ba7bc9fSMauro Carvalho Chehab
8324ba7bc9fSMauro Carvalho Chehab	    fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
8334ba7bc9fSMauro Carvalho Chehab	    if (fd < 0) {
8344ba7bc9fSMauro Carvalho Chehab		    perror("socket");
8354ba7bc9fSMauro Carvalho Chehab		    exit(1);
8364ba7bc9fSMauro Carvalho Chehab	    }
8374ba7bc9fSMauro Carvalho Chehab
8384ba7bc9fSMauro Carvalho Chehab	    err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
8394ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
8404ba7bc9fSMauro Carvalho Chehab		    perror("setsockopt");
8414ba7bc9fSMauro Carvalho Chehab		    exit(1);
8424ba7bc9fSMauro Carvalho Chehab	    }
8434ba7bc9fSMauro Carvalho Chehab
8444ba7bc9fSMauro Carvalho Chehab	    memset(&ring->req, 0, sizeof(ring->req));
8454ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_block_size = blocksiz;
8464ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_frame_size = framesiz;
8474ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_block_nr = blocknum;
8484ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
8494ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_retire_blk_tov = 60;
8504ba7bc9fSMauro Carvalho Chehab	    ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
8514ba7bc9fSMauro Carvalho Chehab
8524ba7bc9fSMauro Carvalho Chehab	    err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
8534ba7bc9fSMauro Carvalho Chehab			    sizeof(ring->req));
8544ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
8554ba7bc9fSMauro Carvalho Chehab		    perror("setsockopt");
8564ba7bc9fSMauro Carvalho Chehab		    exit(1);
8574ba7bc9fSMauro Carvalho Chehab	    }
8584ba7bc9fSMauro Carvalho Chehab
8594ba7bc9fSMauro Carvalho Chehab	    ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
8604ba7bc9fSMauro Carvalho Chehab			    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
8614ba7bc9fSMauro Carvalho Chehab	    if (ring->map == MAP_FAILED) {
8624ba7bc9fSMauro Carvalho Chehab		    perror("mmap");
8634ba7bc9fSMauro Carvalho Chehab		    exit(1);
8644ba7bc9fSMauro Carvalho Chehab	    }
8654ba7bc9fSMauro Carvalho Chehab
8664ba7bc9fSMauro Carvalho Chehab	    ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
8674ba7bc9fSMauro Carvalho Chehab	    assert(ring->rd);
8684ba7bc9fSMauro Carvalho Chehab	    for (i = 0; i < ring->req.tp_block_nr; ++i) {
8694ba7bc9fSMauro Carvalho Chehab		    ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
8704ba7bc9fSMauro Carvalho Chehab		    ring->rd[i].iov_len = ring->req.tp_block_size;
8714ba7bc9fSMauro Carvalho Chehab	    }
8724ba7bc9fSMauro Carvalho Chehab
8734ba7bc9fSMauro Carvalho Chehab	    memset(&ll, 0, sizeof(ll));
8744ba7bc9fSMauro Carvalho Chehab	    ll.sll_family = PF_PACKET;
8754ba7bc9fSMauro Carvalho Chehab	    ll.sll_protocol = htons(ETH_P_ALL);
8764ba7bc9fSMauro Carvalho Chehab	    ll.sll_ifindex = if_nametoindex(netdev);
8774ba7bc9fSMauro Carvalho Chehab	    ll.sll_hatype = 0;
8784ba7bc9fSMauro Carvalho Chehab	    ll.sll_pkttype = 0;
8794ba7bc9fSMauro Carvalho Chehab	    ll.sll_halen = 0;
8804ba7bc9fSMauro Carvalho Chehab
8814ba7bc9fSMauro Carvalho Chehab	    err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
8824ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
8834ba7bc9fSMauro Carvalho Chehab		    perror("bind");
8844ba7bc9fSMauro Carvalho Chehab		    exit(1);
8854ba7bc9fSMauro Carvalho Chehab	    }
8864ba7bc9fSMauro Carvalho Chehab
8874ba7bc9fSMauro Carvalho Chehab	    return fd;
8884ba7bc9fSMauro Carvalho Chehab    }
8894ba7bc9fSMauro Carvalho Chehab
8904ba7bc9fSMauro Carvalho Chehab    static void display(struct tpacket3_hdr *ppd)
8914ba7bc9fSMauro Carvalho Chehab    {
8924ba7bc9fSMauro Carvalho Chehab	    struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
8934ba7bc9fSMauro Carvalho Chehab	    struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
8944ba7bc9fSMauro Carvalho Chehab
8954ba7bc9fSMauro Carvalho Chehab	    if (eth->h_proto == htons(ETH_P_IP)) {
8964ba7bc9fSMauro Carvalho Chehab		    struct sockaddr_in ss, sd;
8974ba7bc9fSMauro Carvalho Chehab		    char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
8984ba7bc9fSMauro Carvalho Chehab
8994ba7bc9fSMauro Carvalho Chehab		    memset(&ss, 0, sizeof(ss));
9004ba7bc9fSMauro Carvalho Chehab		    ss.sin_family = PF_INET;
9014ba7bc9fSMauro Carvalho Chehab		    ss.sin_addr.s_addr = ip->saddr;
9024ba7bc9fSMauro Carvalho Chehab		    getnameinfo((struct sockaddr *) &ss, sizeof(ss),
9034ba7bc9fSMauro Carvalho Chehab				sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
9044ba7bc9fSMauro Carvalho Chehab
9054ba7bc9fSMauro Carvalho Chehab		    memset(&sd, 0, sizeof(sd));
9064ba7bc9fSMauro Carvalho Chehab		    sd.sin_family = PF_INET;
9074ba7bc9fSMauro Carvalho Chehab		    sd.sin_addr.s_addr = ip->daddr;
9084ba7bc9fSMauro Carvalho Chehab		    getnameinfo((struct sockaddr *) &sd, sizeof(sd),
9094ba7bc9fSMauro Carvalho Chehab				dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
9104ba7bc9fSMauro Carvalho Chehab
9114ba7bc9fSMauro Carvalho Chehab		    printf("%s -> %s, ", sbuff, dbuff);
9124ba7bc9fSMauro Carvalho Chehab	    }
9134ba7bc9fSMauro Carvalho Chehab
9144ba7bc9fSMauro Carvalho Chehab	    printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
9154ba7bc9fSMauro Carvalho Chehab    }
9164ba7bc9fSMauro Carvalho Chehab
9174ba7bc9fSMauro Carvalho Chehab    static void walk_block(struct block_desc *pbd, const int block_num)
9184ba7bc9fSMauro Carvalho Chehab    {
9194ba7bc9fSMauro Carvalho Chehab	    int num_pkts = pbd->h1.num_pkts, i;
9204ba7bc9fSMauro Carvalho Chehab	    unsigned long bytes = 0;
9214ba7bc9fSMauro Carvalho Chehab	    struct tpacket3_hdr *ppd;
9224ba7bc9fSMauro Carvalho Chehab
9234ba7bc9fSMauro Carvalho Chehab	    ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
9244ba7bc9fSMauro Carvalho Chehab					pbd->h1.offset_to_first_pkt);
9254ba7bc9fSMauro Carvalho Chehab	    for (i = 0; i < num_pkts; ++i) {
9264ba7bc9fSMauro Carvalho Chehab		    bytes += ppd->tp_snaplen;
9274ba7bc9fSMauro Carvalho Chehab		    display(ppd);
9284ba7bc9fSMauro Carvalho Chehab
9294ba7bc9fSMauro Carvalho Chehab		    ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
9304ba7bc9fSMauro Carvalho Chehab						ppd->tp_next_offset);
9314ba7bc9fSMauro Carvalho Chehab	    }
9324ba7bc9fSMauro Carvalho Chehab
9334ba7bc9fSMauro Carvalho Chehab	    packets_total += num_pkts;
9344ba7bc9fSMauro Carvalho Chehab	    bytes_total += bytes;
9354ba7bc9fSMauro Carvalho Chehab    }
9364ba7bc9fSMauro Carvalho Chehab
9374ba7bc9fSMauro Carvalho Chehab    static void flush_block(struct block_desc *pbd)
9384ba7bc9fSMauro Carvalho Chehab    {
9394ba7bc9fSMauro Carvalho Chehab	    pbd->h1.block_status = TP_STATUS_KERNEL;
9404ba7bc9fSMauro Carvalho Chehab    }
9414ba7bc9fSMauro Carvalho Chehab
9424ba7bc9fSMauro Carvalho Chehab    static void teardown_socket(struct ring *ring, int fd)
9434ba7bc9fSMauro Carvalho Chehab    {
9444ba7bc9fSMauro Carvalho Chehab	    munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
9454ba7bc9fSMauro Carvalho Chehab	    free(ring->rd);
9464ba7bc9fSMauro Carvalho Chehab	    close(fd);
9474ba7bc9fSMauro Carvalho Chehab    }
9484ba7bc9fSMauro Carvalho Chehab
9494ba7bc9fSMauro Carvalho Chehab    int main(int argc, char **argp)
9504ba7bc9fSMauro Carvalho Chehab    {
9514ba7bc9fSMauro Carvalho Chehab	    int fd, err;
9524ba7bc9fSMauro Carvalho Chehab	    socklen_t len;
9534ba7bc9fSMauro Carvalho Chehab	    struct ring ring;
9544ba7bc9fSMauro Carvalho Chehab	    struct pollfd pfd;
9554ba7bc9fSMauro Carvalho Chehab	    unsigned int block_num = 0, blocks = 64;
9564ba7bc9fSMauro Carvalho Chehab	    struct block_desc *pbd;
9574ba7bc9fSMauro Carvalho Chehab	    struct tpacket_stats_v3 stats;
9584ba7bc9fSMauro Carvalho Chehab
9594ba7bc9fSMauro Carvalho Chehab	    if (argc != 2) {
9604ba7bc9fSMauro Carvalho Chehab		    fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
9614ba7bc9fSMauro Carvalho Chehab		    return EXIT_FAILURE;
9624ba7bc9fSMauro Carvalho Chehab	    }
9634ba7bc9fSMauro Carvalho Chehab
9644ba7bc9fSMauro Carvalho Chehab	    signal(SIGINT, sighandler);
9654ba7bc9fSMauro Carvalho Chehab
9664ba7bc9fSMauro Carvalho Chehab	    memset(&ring, 0, sizeof(ring));
9674ba7bc9fSMauro Carvalho Chehab	    fd = setup_socket(&ring, argp[argc - 1]);
9684ba7bc9fSMauro Carvalho Chehab	    assert(fd > 0);
9694ba7bc9fSMauro Carvalho Chehab
9704ba7bc9fSMauro Carvalho Chehab	    memset(&pfd, 0, sizeof(pfd));
9714ba7bc9fSMauro Carvalho Chehab	    pfd.fd = fd;
9724ba7bc9fSMauro Carvalho Chehab	    pfd.events = POLLIN | POLLERR;
9734ba7bc9fSMauro Carvalho Chehab	    pfd.revents = 0;
9744ba7bc9fSMauro Carvalho Chehab
9754ba7bc9fSMauro Carvalho Chehab	    while (likely(!sigint)) {
9764ba7bc9fSMauro Carvalho Chehab		    pbd = (struct block_desc *) ring.rd[block_num].iov_base;
9774ba7bc9fSMauro Carvalho Chehab
9784ba7bc9fSMauro Carvalho Chehab		    if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
9794ba7bc9fSMauro Carvalho Chehab			    poll(&pfd, 1, -1);
9804ba7bc9fSMauro Carvalho Chehab			    continue;
9814ba7bc9fSMauro Carvalho Chehab		    }
9824ba7bc9fSMauro Carvalho Chehab
9834ba7bc9fSMauro Carvalho Chehab		    walk_block(pbd, block_num);
9844ba7bc9fSMauro Carvalho Chehab		    flush_block(pbd);
9854ba7bc9fSMauro Carvalho Chehab		    block_num = (block_num + 1) % blocks;
9864ba7bc9fSMauro Carvalho Chehab	    }
9874ba7bc9fSMauro Carvalho Chehab
9884ba7bc9fSMauro Carvalho Chehab	    len = sizeof(stats);
9894ba7bc9fSMauro Carvalho Chehab	    err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
9904ba7bc9fSMauro Carvalho Chehab	    if (err < 0) {
9914ba7bc9fSMauro Carvalho Chehab		    perror("getsockopt");
9924ba7bc9fSMauro Carvalho Chehab		    exit(1);
9934ba7bc9fSMauro Carvalho Chehab	    }
9944ba7bc9fSMauro Carvalho Chehab
9954ba7bc9fSMauro Carvalho Chehab	    fflush(stdout);
9964ba7bc9fSMauro Carvalho Chehab	    printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
9974ba7bc9fSMauro Carvalho Chehab		stats.tp_packets, bytes_total, stats.tp_drops,
9984ba7bc9fSMauro Carvalho Chehab		stats.tp_freeze_q_cnt);
9994ba7bc9fSMauro Carvalho Chehab
10004ba7bc9fSMauro Carvalho Chehab	    teardown_socket(&ring, fd);
10014ba7bc9fSMauro Carvalho Chehab	    return 0;
10024ba7bc9fSMauro Carvalho Chehab    }
10034ba7bc9fSMauro Carvalho Chehab
10044ba7bc9fSMauro Carvalho ChehabPACKET_QDISC_BYPASS
10054ba7bc9fSMauro Carvalho Chehab===================
10064ba7bc9fSMauro Carvalho Chehab
10074ba7bc9fSMauro Carvalho ChehabIf there is a requirement to load the network with many packets in a similar
10084ba7bc9fSMauro Carvalho Chehabfashion as pktgen does, you might set the following option after socket
10094ba7bc9fSMauro Carvalho Chehabcreation::
10104ba7bc9fSMauro Carvalho Chehab
10114ba7bc9fSMauro Carvalho Chehab    int one = 1;
10124ba7bc9fSMauro Carvalho Chehab    setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
10134ba7bc9fSMauro Carvalho Chehab
10144ba7bc9fSMauro Carvalho ChehabThis has the side-effect, that packets sent through PF_PACKET will bypass the
10154ba7bc9fSMauro Carvalho Chehabkernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
10164ba7bc9fSMauro Carvalho Chehabpacket are not buffered, tc disciplines are ignored, increased loss can occur
10174ba7bc9fSMauro Carvalho Chehaband such packets are also not visible to other PF_PACKET sockets anymore. So,
10184ba7bc9fSMauro Carvalho Chehabyou have been warned; generally, this can be useful for stress testing various
10194ba7bc9fSMauro Carvalho Chehabcomponents of a system.
10204ba7bc9fSMauro Carvalho Chehab
10214ba7bc9fSMauro Carvalho ChehabOn default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
10224ba7bc9fSMauro Carvalho Chehabon PF_PACKET sockets.
10234ba7bc9fSMauro Carvalho Chehab
10244ba7bc9fSMauro Carvalho ChehabPACKET_TIMESTAMP
10254ba7bc9fSMauro Carvalho Chehab================
10264ba7bc9fSMauro Carvalho Chehab
10274ba7bc9fSMauro Carvalho ChehabThe PACKET_TIMESTAMP setting determines the source of the timestamp in
10284ba7bc9fSMauro Carvalho Chehabthe packet meta information for mmap(2)ed RX_RING and TX_RINGs.  If your
10294ba7bc9fSMauro Carvalho ChehabNIC is capable of timestamping packets in hardware, you can request those
10304ba7bc9fSMauro Carvalho Chehabhardware timestamps to be used. Note: you may need to enable the generation
10314ba7bc9fSMauro Carvalho Chehabof hardware timestamps with SIOCSHWTSTAMP (see related information from
103206bfa47eSMauro Carvalho ChehabDocumentation/networking/timestamping.rst).
10334ba7bc9fSMauro Carvalho Chehab
10344ba7bc9fSMauro Carvalho ChehabPACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
10354ba7bc9fSMauro Carvalho Chehab
10364ba7bc9fSMauro Carvalho Chehab    int req = SOF_TIMESTAMPING_RAW_HARDWARE;
10374ba7bc9fSMauro Carvalho Chehab    setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
10384ba7bc9fSMauro Carvalho Chehab
10394ba7bc9fSMauro Carvalho ChehabFor the mmap(2)ed ring buffers, such timestamps are stored in the
10404ba7bc9fSMauro Carvalho Chehab``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
10414ba7bc9fSMauro Carvalho ChehabTo determine what kind of timestamp has been reported, the tp_status field
10424ba7bc9fSMauro Carvalho Chehabis binary or'ed with the following possible bits ...
10434ba7bc9fSMauro Carvalho Chehab
10444ba7bc9fSMauro Carvalho Chehab::
10454ba7bc9fSMauro Carvalho Chehab
10464ba7bc9fSMauro Carvalho Chehab    TP_STATUS_TS_RAW_HARDWARE
10474ba7bc9fSMauro Carvalho Chehab    TP_STATUS_TS_SOFTWARE
10484ba7bc9fSMauro Carvalho Chehab
10494ba7bc9fSMauro Carvalho Chehab... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
10504ba7bc9fSMauro Carvalho ChehabRX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
10514ba7bc9fSMauro Carvalho Chehabsoftware fallback was invoked *within* PF_PACKET's processing code (less
10524ba7bc9fSMauro Carvalho Chehabprecise).
10534ba7bc9fSMauro Carvalho Chehab
10544ba7bc9fSMauro Carvalho ChehabGetting timestamps for the TX_RING works as follows: i) fill the ring frames,
10554ba7bc9fSMauro Carvalho Chehabii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
10564ba7bc9fSMauro Carvalho Chehabframes to be updated resp. the frame handed over to the application, iv) walk
10574ba7bc9fSMauro Carvalho Chehabthrough the frames to pick up the individual hw/sw timestamps.
10584ba7bc9fSMauro Carvalho Chehab
10594ba7bc9fSMauro Carvalho ChehabOnly (!) if transmit timestamping is enabled, then these bits are combined
10604ba7bc9fSMauro Carvalho Chehabwith binary | with TP_STATUS_AVAILABLE, so you must check for that in your
10614ba7bc9fSMauro Carvalho Chehabapplication (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
10624ba7bc9fSMauro Carvalho Chehabin a first step to see if the frame belongs to the application, and then
10634ba7bc9fSMauro Carvalho Chehabone can extract the type of timestamp in a second step from tp_status)!
10644ba7bc9fSMauro Carvalho Chehab
10654ba7bc9fSMauro Carvalho ChehabIf you don't care about them, thus having it disabled, checking for
10664ba7bc9fSMauro Carvalho ChehabTP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
10674ba7bc9fSMauro Carvalho ChehabTX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
10684ba7bc9fSMauro Carvalho Chehabmembers do not contain a valid value. For TX_RINGs, by default no timestamp
10694ba7bc9fSMauro Carvalho Chehabis generated!
10704ba7bc9fSMauro Carvalho Chehab
107106bfa47eSMauro Carvalho ChehabSee include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
10724ba7bc9fSMauro Carvalho Chehabfor more information on hardware timestamps.
10734ba7bc9fSMauro Carvalho Chehab
10744ba7bc9fSMauro Carvalho ChehabMiscellaneous bits
10754ba7bc9fSMauro Carvalho Chehab==================
10764ba7bc9fSMauro Carvalho Chehab
10774ba7bc9fSMauro Carvalho Chehab- Packet sockets work well together with Linux socket filters, thus you also
10786e94eaaaSMauro Carvalho Chehab  might want to have a look at Documentation/networking/filter.rst
10794ba7bc9fSMauro Carvalho Chehab
10804ba7bc9fSMauro Carvalho ChehabTHANKS
10814ba7bc9fSMauro Carvalho Chehab======
10824ba7bc9fSMauro Carvalho Chehab
10834ba7bc9fSMauro Carvalho Chehab   Jesse Brandeburg, for fixing my grammathical/spelling errors
1084