1============ 2SNMP counter 3============ 4 5This document explains the meaning of SNMP counters. 6 7General IPv4 counters 8===================== 9All layer 4 packets and ICMP packets will change these counters, but 10these counters won't be changed by layer 2 packets (such as STP) or 11ARP packets. 12 13* IpInReceives 14 15Defined in `RFC1213 ipInReceives`_ 16 17.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26 18 19The number of packets received by the IP layer. It gets increasing at the 20beginning of ip_rcv function, always be updated together with 21IpExtInOctets. It will be increased even if the packet is dropped 22later (e.g. due to the IP header is invalid or the checksum is wrong 23and so on). It indicates the number of aggregated segments after 24GRO/LRO. 25 26* IpInDelivers 27 28Defined in `RFC1213 ipInDelivers`_ 29 30.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28 31 32The number of packets delivers to the upper layer protocols. E.g. TCP, UDP, 33ICMP and so on. If no one listens on a raw socket, only kernel 34supported protocols will be delivered, if someone listens on the raw 35socket, all valid IP packets will be delivered. 36 37* IpOutRequests 38 39Defined in `RFC1213 ipOutRequests`_ 40 41.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28 42 43The number of packets sent via IP layer, for both single cast and 44multicast packets, and would always be updated together with 45IpExtOutOctets. 46 47* IpExtInOctets and IpExtOutOctets 48 49They are Linux kernel extensions, no RFC definitions. Please note, 50RFC1213 indeed defines ifInOctets and ifOutOctets, but they 51are different things. The ifInOctets and ifOutOctets include the MAC 52layer header size but IpExtInOctets and IpExtOutOctets don't, they 53only include the IP layer header and the IP layer data. 54 55* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts 56 57They indicate the number of four kinds of ECN IP packets, please refer 58`Explicit Congestion Notification`_ for more details. 59 60.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6 61 62These 4 counters calculate how many packets received per ECN 63status. They count the real frame number regardless the LRO/GRO. So 64for the same packet, you might find that IpInReceives count 1, but 65IpExtInNoECTPkts counts 2 or more. 66 67* IpInHdrErrors 68 69Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is 70dropped due to the IP header error. It might happen in both IP input 71and IP forward paths. 72 73.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27 74 75* IpInAddrErrors 76 77Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two 78scenarios: (1) The IP address is invalid. (2) The destination IP 79address is not a local address and IP forwarding is not enabled 80 81.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27 82 83* IpExtInNoRoutes 84 85This counter means the packet is dropped when the IP stack receives a 86packet and can't find a route for it from the route table. It might 87happen when IP forwarding is enabled and the destination IP address is 88not a local address and there is no route for the destination IP 89address. 90 91* IpInUnknownProtos 92 93Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the 94layer 4 protocol is unsupported by kernel. If an application is using 95raw socket, kernel will always deliver the packet to the raw socket 96and this counter won't be increased. 97 98.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27 99 100* IpExtInTruncatedPkts 101 102For IPv4 packet, it means the actual data size is smaller than the 103"Total Length" field in the IPv4 header. 104 105* IpInDiscards 106 107Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped 108in the IP receiving path and due to kernel internal reasons (e.g. no 109enough memory). 110 111.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28 112 113* IpOutDiscards 114 115Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is 116dropped in the IP sending path and due to kernel internal reasons. 117 118.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28 119 120* IpOutNoRoutes 121 122Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is 123dropped in the IP sending path and no route is found for it. 124 125.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29 126 127ICMP counters 128============= 129* IcmpInMsgs and IcmpOutMsgs 130 131Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_ 132 133.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41 134.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43 135 136As mentioned in the RFC1213, these two counters include errors, they 137would be increased even if the ICMP packet has an invalid type. The 138ICMP output path will check the header of a raw socket, so the 139IcmpOutMsgs would still be updated if the IP header is constructed by 140a userspace program. 141 142* ICMP named types 143 144| These counters include most of common ICMP types, they are: 145| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_ 146| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_ 147| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_ 148| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_ 149| IcmpInRedirects: `RFC1213 icmpInRedirects`_ 150| IcmpInEchos: `RFC1213 icmpInEchos`_ 151| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_ 152| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_ 153| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_ 154| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_ 155| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_ 156| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_ 157| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_ 158| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_ 159| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_ 160| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_ 161| IcmpOutEchos: `RFC1213 icmpOutEchos`_ 162| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_ 163| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_ 164| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_ 165| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_ 166| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_ 167 168.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41 169.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41 170.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42 171.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42 172.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42 173.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42 174.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42 175.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42 176.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43 177.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43 178.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43 179 180.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44 181.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44 182.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44 183.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44 184.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44 185.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45 186.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45 187.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45 188.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45 189.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45 190.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46 191 192Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP 193Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are 194straightforward. The 'In' counter means kernel receives such a packet 195and the 'Out' counter means kernel sends such a packet. 196 197* ICMP numeric types 198 199They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the 200ICMP type number. These counters track all kinds of ICMP packets. The 201ICMP type number definition could be found in the `ICMP parameters`_ 202document. 203 204.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml 205 206For example, if the Linux kernel sends an ICMP Echo packet, the 207IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply 208packet, IcmpMsgInType0 would increase 1. 209 210* IcmpInCsumErrors 211 212This counter indicates the checksum of the ICMP packet is 213wrong. Kernel verifies the checksum after updating the IcmpInMsgs and 214before updating IcmpMsgInType[N]. If a packet has bad checksum, the 215IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated. 216 217* IcmpInErrors and IcmpOutErrors 218 219Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_ 220 221.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41 222.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43 223 224When an error occurs in the ICMP packet handler path, these two 225counters would be updated. The receiving packet path use IcmpInErrors 226and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors 227is increased, IcmpInErrors would always be increased too. 228 229relationship of the ICMP counters 230--------------------------------- 231The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they 232are updated at the same time. The sum of IcmpMsgInType[N] plus 233IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel 234receives an ICMP packet, kernel follows below logic: 235 2361. increase IcmpInMsgs 2372. if has any error, update IcmpInErrors and finish the process 2383. update IcmpMsgOutType[N] 2394. handle the packet depending on the type, if has any error, update 240 IcmpInErrors and finish the process 241 242So if all errors occur in step (2), IcmpInMsgs should be equal to the 243sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in 244step (4), IcmpInMsgs should be equal to the sum of 245IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4), 246IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus 247IcmpInErrors. 248 249General TCP counters 250==================== 251* TcpInSegs 252 253Defined in `RFC1213 tcpInSegs`_ 254 255.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48 256 257The number of packets received by the TCP layer. As mentioned in 258RFC1213, it includes the packets received in error, such as checksum 259error, invalid TCP header and so on. Only one error won't be included: 260if the layer 2 destination address is not the NIC's layer 2 261address. It might happen if the packet is a multicast or broadcast 262packet, or the NIC is in promiscuous mode. In these situations, the 263packets would be delivered to the TCP layer, but the TCP layer will discard 264these packets before increasing TcpInSegs. The TcpInSegs counter 265isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs 266counter would only increase 1. 267 268* TcpOutSegs 269 270Defined in `RFC1213 tcpOutSegs`_ 271 272.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48 273 274The number of packets sent by the TCP layer. As mentioned in RFC1213, 275it excludes the retransmitted packets. But it includes the SYN, ACK 276and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of 277GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will 278increase 2. 279 280* TcpActiveOpens 281 282Defined in `RFC1213 tcpActiveOpens`_ 283 284.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47 285 286It means the TCP layer sends a SYN, and come into the SYN-SENT 287state. Every time TcpActiveOpens increases 1, TcpOutSegs should always 288increase 1. 289 290* TcpPassiveOpens 291 292Defined in `RFC1213 tcpPassiveOpens`_ 293 294.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47 295 296It means the TCP layer receives a SYN, replies a SYN+ACK, come into 297the SYN-RCVD state. 298 299* TcpExtTCPRcvCoalesce 300 301When packets are received by the TCP layer and are not be read by the 302application, the TCP layer will try to merge them. This counter 303indicate how many packets are merged in such situation. If GRO is 304enabled, lots of packets would be merged by GRO, these packets 305wouldn't be counted to TcpExtTCPRcvCoalesce. 306 307* TcpExtTCPAutoCorking 308 309When sending packets, the TCP layer will try to merge small packets to 310a bigger one. This counter increase 1 for every packet merged in such 311situation. Please refer to the LWN article for more details: 312https://lwn.net/Articles/576263/ 313 314* TcpExtTCPOrigDataSent 315 316This counter is explained by `kernel commit f19c29e3e391`_, I pasted the 317explaination below:: 318 319 TCPOrigDataSent: number of outgoing packets with original data (excluding 320 retransmission but including data-in-SYN). This counter is different from 321 TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is 322 more useful to track the TCP retransmission rate. 323 324* TCPSynRetrans 325 326This counter is explained by `kernel commit f19c29e3e391`_, I pasted the 327explaination below:: 328 329 TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down 330 retransmissions into SYN, fast-retransmits, timeout retransmits, etc. 331 332* TCPFastOpenActiveFail 333 334This counter is explained by `kernel commit f19c29e3e391`_, I pasted the 335explaination below:: 336 337 TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because 338 the remote does not accept it or the attempts timed out. 339 340.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd 341 342* TcpExtListenOverflows and TcpExtListenDrops 343 344When kernel receives a SYN from a client, and if the TCP accept queue 345is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows. 346At the same time kernel will also add 1 to TcpExtListenDrops. When a 347TCP socket is in LISTEN state, and kernel need to drop a packet, 348kernel would always add 1 to TcpExtListenDrops. So increase 349TcpExtListenOverflows would let TcpExtListenDrops increasing at the 350same time, but TcpExtListenDrops would also increase without 351TcpExtListenOverflows increasing, e.g. a memory allocation fail would 352also let TcpExtListenDrops increase. 353 354Note: The above explanation is based on kernel 4.10 or above version, on 355an old kernel, the TCP stack has different behavior when TCP accept 356queue is full. On the old kernel, TCP stack won't drop the SYN, it 357would complete the 3-way handshake. As the accept queue is full, TCP 358stack will keep the socket in the TCP half-open queue. As it is in the 359half open queue, TCP stack will send SYN+ACK on an exponential backoff 360timer, after client replies ACK, TCP stack checks whether the accept 361queue is still full, if it is not full, moves the socket to the accept 362queue, if it is full, keeps the socket in the half-open queue, at next 363time client replies ACK, this socket will get another chance to move 364to the accept queue. 365 366 367TCP Fast Open 368============= 369When kernel receives a TCP packet, it has two paths to handler the 370packet, one is fast path, another is slow path. The comment in kernel 371code provides a good explanation of them, I pasted them below:: 372 373 It is split into a fast path and a slow path. The fast path is 374 disabled when: 375 376 - A zero window was announced from us 377 - zero window probing 378 is only handled properly on the slow path. 379 - Out of order segments arrived. 380 - Urgent data is expected. 381 - There is no buffer space left 382 - Unexpected TCP flags/window values/header lengths are received 383 (detected by checking the TCP header against pred_flags) 384 - Data is sent in both directions. The fast path only supports pure senders 385 or pure receivers (this means either the sequence number or the ack 386 value must stay constant) 387 - Unexpected TCP option. 388 389Kernel will try to use fast path unless any of the above conditions 390are satisfied. If the packets are out of order, kernel will handle 391them in slow path, which means the performance might be not very 392good. Kernel would also come into slow path if the "Delayed ack" is 393used, because when using "Delayed ack", the data is sent in both 394directions. When the TCP window scale option is not used, kernel will 395try to enable fast path immediately when the connection comes into the 396established state, but if the TCP window scale option is used, kernel 397will disable the fast path at first, and try to enable it after kernel 398receives packets. 399 400* TcpExtTCPPureAcks and TcpExtTCPHPAcks 401 402If a packet set ACK flag and has no data, it is a pure ACK packet, if 403kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1, 404if kernel handles it in the slow path, TcpExtTCPPureAcks will 405increase 1. 406 407* TcpExtTCPHPHits 408 409If a TCP packet has data (which means it is not a pure ACK packet), 410and this packet is handled in the fast path, TcpExtTCPHPHits will 411increase 1. 412 413 414TCP abort 415========= 416 417* TcpExtTCPAbortOnData 418 419It means TCP layer has data in flight, but need to close the 420connection. So TCP layer sends a RST to the other side, indicate the 421connection is not closed very graceful. An easy way to increase this 422counter is using the SO_LINGER option. Please refer to the SO_LINGER 423section of the `socket man page`_: 424 425.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html 426 427By default, when an application closes a connection, the close function 428will return immediately and kernel will try to send the in-flight data 429async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger 430to a positive number, the close function won't return immediately, but 431wait for the in-flight data are acked by the other side, the max wait 432time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0, 433when the application closes a connection, kernel will send a RST 434immediately and increase the TcpExtTCPAbortOnData counter. 435 436* TcpExtTCPAbortOnClose 437 438This counter means the application has unread data in the TCP layer when 439the application wants to close the TCP connection. In such a situation, 440kernel will send a RST to the other side of the TCP connection. 441 442* TcpExtTCPAbortOnMemory 443 444When an application closes a TCP connection, kernel still need to track 445the connection, let it complete the TCP disconnect process. E.g. an 446app calls the close method of a socket, kernel sends fin to the other 447side of the connection, then the app has no relationship with the 448socket any more, but kernel need to keep the socket, this socket 449becomes an orphan socket, kernel waits for the reply of the other side, 450and would come to the TIME_WAIT state finally. When kernel has no 451enough memory to keep the orphan socket, kernel would send an RST to 452the other side, and delete the socket, in such situation, kernel will 453increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger 454TcpExtTCPAbortOnMemory: 455 4561. the memory used by the TCP protocol is higher than the third value of 457the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_: 458 459.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html 460 4612. the orphan socket count is higher than net.ipv4.tcp_max_orphans 462 463 464* TcpExtTCPAbortOnTimeout 465 466This counter will increase when any of the TCP timers expire. In such 467situation, kernel won't send RST, just give up the connection. 468 469* TcpExtTCPAbortOnLinger 470 471When a TCP connection comes into FIN_WAIT_2 state, instead of waiting 472for the fin packet from the other side, kernel could send a RST and 473delete the socket immediately. This is not the default behavior of 474Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option, 475you could let kernel follow this behavior. 476 477* TcpExtTCPAbortFailed 478 479The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is 480satisfied. If an internal error occurs during this process, 481TcpExtTCPAbortFailed will be increased. 482 483.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50 484 485TCP Hybrid Slow Start 486===================== 487The Hybrid Slow Start algorithm is an enhancement of the traditional 488TCP congestion window Slow Start algorithm. It uses two pieces of 489information to detect whether the max bandwidth of the TCP path is 490approached. The two pieces of information are ACK train length and 491increase in packet delay. For detail information, please refer the 492`Hybrid Slow Start paper`_. Either ACK train length or packet delay 493hits a specific threshold, the congestion control algorithm will come 494into the Congestion Avoidance state. Until v4.20, two congestion 495control algorithms are using Hybrid Slow Start, they are cubic (the 496default congestion control algorithm) and cdg. Four snmp counters 497relate with the Hybrid Slow Start algorithm. 498 499.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf 500 501* TcpExtTCPHystartTrainDetect 502 503How many times the ACK train length threshold is detected 504 505* TcpExtTCPHystartTrainCwnd 506 507The sum of CWND detected by ACK train length. Dividing this value by 508TcpExtTCPHystartTrainDetect is the average CWND which detected by the 509ACK train length. 510 511* TcpExtTCPHystartDelayDetect 512 513How many times the packet delay threshold is detected. 514 515* TcpExtTCPHystartDelayCwnd 516 517The sum of CWND detected by packet delay. Dividing this value by 518TcpExtTCPHystartDelayDetect is the average CWND which detected by the 519packet delay. 520 521TCP retransmission and congestion control 522========================================= 523The TCP protocol has two retransmission mechanisms: SACK and fast 524recovery. They are exclusive with each other. When SACK is enabled, 525the kernel TCP stack would use SACK, or kernel would use fast 526recovery. The SACK is a TCP option, which is defined in `RFC2018`_, 527the fast recovery is defined in `RFC6582`_, which is also called 528'Reno'. 529 530The TCP congestion control is a big and complex topic. To understand 531the related snmp counter, we need to know the states of the congestion 532control state machine. There are 5 states: Open, Disorder, CWR, 533Recovery and Loss. For details about these states, please refer page 5 534and page 6 of this document: 535https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf 536 537.. _RFC2018: https://tools.ietf.org/html/rfc2018 538.. _RFC6582: https://tools.ietf.org/html/rfc6582 539 540* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery 541 542When the congestion control comes into Recovery state, if sack is 543used, TcpExtTCPSackRecovery increases 1, if sack is not used, 544TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP 545stack begins to retransmit the lost packets. 546 547* TcpExtTCPSACKReneging 548 549A packet was acknowledged by SACK, but the receiver has dropped this 550packet, so the sender needs to retransmit this packet. In this 551situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver 552could drop a packet which has been acknowledged by SACK, although it is 553unusual, it is allowed by the TCP protocol. The sender doesn't really 554know what happened on the receiver side. The sender just waits until 555the RTO expires for this packet, then the sender assumes this packet 556has been dropped by the receiver. 557 558* TcpExtTCPRenoReorder 559 560The reorder packet is detected by fast recovery. It would only be used 561if SACK is disabled. The fast recovery algorithm detects recorder by 562the duplicate ACK number. E.g., if retransmission is triggered, and 563the original retransmitted packet is not lost, it is just out of 564order, the receiver would acknowledge multiple times, one for the 565retransmitted packet, another for the arriving of the original out of 566order packet. Thus the sender would find more ACks than its 567expectation, and the sender knows out of order occurs. 568 569* TcpExtTCPTSReorder 570 571The reorder packet is detected when a hole is filled. E.g., assume the 572sender sends packet 1,2,3,4,5, and the receiving order is 5731,2,4,5,3. When the sender receives the ACK of packet 3 (which will 574fill the hole), two conditions will let TcpExtTCPTSReorder increase 5751: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet 5763 is retransmitted but the timestamp of the packet 3's ACK is earlier 577than the retransmission timestamp. 578 579* TcpExtTCPSACKReorder 580 581The reorder packet detected by SACK. The SACK has two methods to 582detect reorder: (1) DSACK is received by the sender. It means the 583sender sends the same packet more than one times. And the only reason 584is the sender believes an out of order packet is lost so it sends the 585packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and 586the sender has received SACKs for packet 2 and 5, now the sender 587receives SACK for packet 4 and the sender doesn't retransmit the 588packet yet, the sender would know packet 4 is out of order. The TCP 589stack of kernel will increase TcpExtTCPSACKReorder for both of the 590above scenarios. 591 592 593DSACK 594===== 595The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report 596duplicate packets to the sender. There are two kinds of 597duplications: (1) a packet which has been acknowledged is 598duplicate. (2) an out of order packet is duplicate. The TCP stack 599counts these two kinds of duplications on both receiver side and 600sender side. 601 602.. _RFC2883 : https://tools.ietf.org/html/rfc2883 603 604* TcpExtTCPDSACKOldSent 605 606The TCP stack receives a duplicate packet which has been acked, so it 607sends a DSACK to the sender. 608 609* TcpExtTCPDSACKOfoSent 610 611The TCP stack receives an out of order duplicate packet, so it sends a 612DSACK to the sender. 613 614* TcpExtTCPDSACKRecv 615 616The TCP stack receives a DSACK, which indicate an acknowledged 617duplicate packet is received. 618 619* TcpExtTCPDSACKOfoRecv 620 621The TCP stack receives a DSACK, which indicate an out of order 622duplicate packet is received. 623 624TCP out of order 625================ 626* TcpExtTCPOFOQueue 627 628The TCP layer receives an out of order packet and has enough memory 629to queue it. 630 631* TcpExtTCPOFODrop 632 633The TCP layer receives an out of order packet but doesn't have enough 634memory, so drops it. Such packets won't be counted into 635TcpExtTCPOFOQueue. 636 637* TcpExtTCPOFOMerge 638 639The received out of order packet has an overlay with the previous 640packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge 641packets will also be counted into TcpExtTCPOFOQueue. 642 643TCP PAWS 644======== 645PAWS (Protection Against Wrapped Sequence numbers) is an algorithm 646which is used to drop old packets. It depends on the TCP 647timestamps. For detail information, please refer the `timestamp wiki`_ 648and the `RFC of PAWS`_. 649 650.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17 651.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps 652 653* TcpExtPAWSActive 654 655Packets are dropped by PAWS in Syn-Sent status. 656 657* TcpExtPAWSEstab 658 659Packets are dropped by PAWS in any status other than Syn-Sent. 660 661TCP ACK skip 662============ 663In some scenarios, kernel would avoid sending duplicate ACKs too 664frequently. Please find more details in the tcp_invalid_ratelimit 665section of the `sysctl document`_. When kernel decides to skip an ACK 666due to tcp_invalid_ratelimit, kernel would update one of below 667counters to indicate the ACK is skipped in which scenario. The ACK 668would only be skipped if the received packet is either a SYN packet or 669it has no data. 670 671.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt 672 673* TcpExtTCPACKSkippedSynRecv 674 675The ACK is skipped in Syn-Recv status. The Syn-Recv status means the 676TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is 677waiting for an ACK. Generally, the TCP stack doesn't need to send ACK 678in the Syn-Recv status. But in several scenarios, the TCP stack need 679to send an ACK. E.g., the TCP stack receives the same SYN packet 680repeately, the received packet does not pass the PAWS check, or the 681received packet sequence number is out of window. In these scenarios, 682the TCP stack needs to send ACK. If the ACk sending frequency is higher than 683tcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and 684increase TcpExtTCPACKSkippedSynRecv. 685 686 687* TcpExtTCPACKSkippedPAWS 688 689The ACK is skipped due to PAWS (Protect Against Wrapped Sequence 690numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2 691or Time-Wait statuses, the skipped ACK would be counted to 692TcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or 693TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK 694would be counted to TcpExtTCPACKSkippedPAWS. 695 696* TcpExtTCPACKSkippedSeq 697 698The sequence number is out of window and the timestamp passes the PAWS 699check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait. 700 701* TcpExtTCPACKSkippedFinWait2 702 703The ACK is skipped in Fin-Wait-2 status, the reason would be either 704PAWS check fails or the received sequence number is out of window. 705 706* TcpExtTCPACKSkippedTimeWait 707 708Tha ACK is skipped in Time-Wait status, the reason would be either 709PAWS check failed or the received sequence number is out of window. 710 711* TcpExtTCPACKSkippedChallenge 712 713The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines 7143 kind of challenge ACK, please refer `RFC 5961 section 3.2`_, 715`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these 716three scenarios, In some TCP status, the linux TCP stack would also 717send challenge ACKs if the ACK number is before the first 718unacknowledged number (more strict than `RFC 5961 section 5.2`_). 719 720.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7 721.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9 722.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11 723 724 725examples 726======== 727 728ping test 729--------- 730Run the ping command against the public dns server 8.8.8.8:: 731 732 nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1 733 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 734 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms 735 736 --- 8.8.8.8 ping statistics --- 737 1 packets transmitted, 1 received, 0% packet loss, time 0ms 738 rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms 739 740The nstayt result:: 741 742 nstatuser@nstat-a:~$ nstat 743 #kernel 744 IpInReceives 1 0.0 745 IpInDelivers 1 0.0 746 IpOutRequests 1 0.0 747 IcmpInMsgs 1 0.0 748 IcmpInEchoReps 1 0.0 749 IcmpOutMsgs 1 0.0 750 IcmpOutEchos 1 0.0 751 IcmpMsgInType0 1 0.0 752 IcmpMsgOutType8 1 0.0 753 IpExtInOctets 84 0.0 754 IpExtOutOctets 84 0.0 755 IpExtInNoECTPkts 1 0.0 756 757The Linux server sent an ICMP Echo packet, so IpOutRequests, 758IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The 759server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs, 760IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply 761was passed to the ICMP layer via IP layer, so IpInDelivers was 762increased 1. The default ping data size is 48, so an ICMP Echo packet 763and its corresponding Echo Reply packet are constructed by: 764 765* 14 bytes MAC header 766* 20 bytes IP header 767* 16 bytes ICMP header 768* 48 bytes data (default value of the ping command) 769 770So the IpExtInOctets and IpExtOutOctets are 20+16+48=84. 771 772tcp 3-way handshake 773------------------- 774On server side, we run:: 775 776 nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000 777 Listening on [0.0.0.0] (family 0, port 9000) 778 779On client side, we run:: 780 781 nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000 782 Connection to 192.168.122.251 9000 port [tcp/*] succeeded! 783 784The server listened on tcp 9000 port, the client connected to it, they 785completed the 3-way handshake. 786 787On server side, we can find below nstat output:: 788 789 nstatuser@nstat-b:~$ nstat | grep -i tcp 790 TcpPassiveOpens 1 0.0 791 TcpInSegs 2 0.0 792 TcpOutSegs 1 0.0 793 TcpExtTCPPureAcks 1 0.0 794 795On client side, we can find below nstat output:: 796 797 nstatuser@nstat-a:~$ nstat | grep -i tcp 798 TcpActiveOpens 1 0.0 799 TcpInSegs 1 0.0 800 TcpOutSegs 2 0.0 801 802When the server received the first SYN, it replied a SYN+ACK, and came into 803SYN-RCVD state, so TcpPassiveOpens increased 1. The server received 804SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2 805packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK 806of the 3-way handshake is a pure ACK without data, so 807TcpExtTCPPureAcks increased 1. 808 809When the client sent SYN, the client came into the SYN-SENT state, so 810TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent 811ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased 8121, TcpOutSegs increased 2. 813 814TCP normal traffic 815------------------ 816Run nc on server:: 817 818 nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 819 Listening on [0.0.0.0] (family 0, port 9000) 820 821Run nc on client:: 822 823 nstatuser@nstat-a:~$ nc -v nstat-b 9000 824 Connection to nstat-b 9000 port [tcp/*] succeeded! 825 826Input a string in the nc client ('hello' in our example):: 827 828 nstatuser@nstat-a:~$ nc -v nstat-b 9000 829 Connection to nstat-b 9000 port [tcp/*] succeeded! 830 hello 831 832The client side nstat output:: 833 834 nstatuser@nstat-a:~$ nstat 835 #kernel 836 IpInReceives 1 0.0 837 IpInDelivers 1 0.0 838 IpOutRequests 1 0.0 839 TcpInSegs 1 0.0 840 TcpOutSegs 1 0.0 841 TcpExtTCPPureAcks 1 0.0 842 TcpExtTCPOrigDataSent 1 0.0 843 IpExtInOctets 52 0.0 844 IpExtOutOctets 58 0.0 845 IpExtInNoECTPkts 1 0.0 846 847The server side nstat output:: 848 849 nstatuser@nstat-b:~$ nstat 850 #kernel 851 IpInReceives 1 0.0 852 IpInDelivers 1 0.0 853 IpOutRequests 1 0.0 854 TcpInSegs 1 0.0 855 TcpOutSegs 1 0.0 856 IpExtInOctets 58 0.0 857 IpExtOutOctets 52 0.0 858 IpExtInNoECTPkts 1 0.0 859 860Input a string in nc client side again ('world' in our exmaple):: 861 862 nstatuser@nstat-a:~$ nc -v nstat-b 9000 863 Connection to nstat-b 9000 port [tcp/*] succeeded! 864 hello 865 world 866 867Client side nstat output:: 868 869 nstatuser@nstat-a:~$ nstat 870 #kernel 871 IpInReceives 1 0.0 872 IpInDelivers 1 0.0 873 IpOutRequests 1 0.0 874 TcpInSegs 1 0.0 875 TcpOutSegs 1 0.0 876 TcpExtTCPHPAcks 1 0.0 877 TcpExtTCPOrigDataSent 1 0.0 878 IpExtInOctets 52 0.0 879 IpExtOutOctets 58 0.0 880 IpExtInNoECTPkts 1 0.0 881 882 883Server side nstat output:: 884 885 nstatuser@nstat-b:~$ nstat 886 #kernel 887 IpInReceives 1 0.0 888 IpInDelivers 1 0.0 889 IpOutRequests 1 0.0 890 TcpInSegs 1 0.0 891 TcpOutSegs 1 0.0 892 TcpExtTCPHPHits 1 0.0 893 IpExtInOctets 58 0.0 894 IpExtOutOctets 52 0.0 895 IpExtInNoECTPkts 1 0.0 896 897Compare the first client-side nstat and the second client-side nstat, 898we could find one difference: the first one had a 'TcpExtTCPPureAcks', 899but the second one had a 'TcpExtTCPHPAcks'. The first server-side 900nstat and the second server-side nstat had a difference too: the 901second server-side nstat had a TcpExtTCPHPHits, but the first 902server-side nstat didn't have it. The network traffic patterns were 903exactly the same: the client sent a packet to the server, the server 904replied an ACK. But kernel handled them in different ways. When the 905TCP window scale option is not used, kernel will try to enable fast 906path immediately when the connection comes into the established state, 907but if the TCP window scale option is used, kernel will disable the 908fast path at first, and try to enable it after kerenl receives 909packets. We could use the 'ss' command to verify whether the window 910scale option is used. e.g. run below command on either server or 911client:: 912 913 nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 ) 914 Netid Recv-Q Send-Q Local Address:Port Peer Address:Port 915 tcp 0 0 192.168.122.250:40654 192.168.122.251:9000 916 ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98 917 918The 'wscale:7,7' means both server and client set the window scale 919option to 7. Now we could explain the nstat output in our test: 920 921In the first nstat output of client side, the client sent a packet, server 922reply an ACK, when kernel handled this ACK, the fast path was not 923enabled, so the ACK was counted into 'TcpExtTCPPureAcks'. 924 925In the second nstat output of client side, the client sent a packet again, 926and received another ACK from the server, in this time, the fast path is 927enabled, and the ACK was qualified for fast path, so it was handled by 928the fast path, so this ACK was counted into TcpExtTCPHPAcks. 929 930In the first nstat output of server side, fast path was not enabled, 931so there was no 'TcpExtTCPHPHits'. 932 933In the second nstat output of server side, the fast path was enabled, 934and the packet received from client qualified for fast path, so it 935was counted into 'TcpExtTCPHPHits'. 936 937TcpExtTCPAbortOnClose 938--------------------- 939On the server side, we run below python script:: 940 941 import socket 942 import time 943 944 port = 9000 945 946 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 947 s.bind(('0.0.0.0', port)) 948 s.listen(1) 949 sock, addr = s.accept() 950 while True: 951 time.sleep(9999999) 952 953This python script listen on 9000 port, but doesn't read anything from 954the connection. 955 956On the client side, we send the string "hello" by nc:: 957 958 nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000 959 960Then, we come back to the server side, the server has received the "hello" 961packet, and the TCP layer has acked this packet, but the application didn't 962read it yet. We type Ctrl-C to terminate the server script. Then we 963could find TcpExtTCPAbortOnClose increased 1 on the server side:: 964 965 nstatuser@nstat-b:~$ nstat | grep -i abort 966 TcpExtTCPAbortOnClose 1 0.0 967 968If we run tcpdump on the server side, we could find the server sent a 969RST after we type Ctrl-C. 970 971TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout 972--------------------------------------------------- 973Below is an example which let the orphan socket count be higher than 974net.ipv4.tcp_max_orphans. 975Change tcp_max_orphans to a smaller value on client:: 976 977 sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans" 978 979Client code (create 64 connection to server):: 980 981 nstatuser@nstat-a:~$ cat client_orphan.py 982 import socket 983 import time 984 985 server = 'nstat-b' # server address 986 port = 9000 987 988 count = 64 989 990 connection_list = [] 991 992 for i in range(64): 993 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 994 s.connect((server, port)) 995 connection_list.append(s) 996 print("connection_count: %d" % len(connection_list)) 997 998 while True: 999 time.sleep(99999) 1000 1001Server code (accept 64 connection from client):: 1002 1003 nstatuser@nstat-b:~$ cat server_orphan.py 1004 import socket 1005 import time 1006 1007 port = 9000 1008 count = 64 1009 1010 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1011 s.bind(('0.0.0.0', port)) 1012 s.listen(count) 1013 connection_list = [] 1014 while True: 1015 sock, addr = s.accept() 1016 connection_list.append((sock, addr)) 1017 print("connection_count: %d" % len(connection_list)) 1018 1019Run the python scripts on server and client. 1020 1021On server:: 1022 1023 python3 server_orphan.py 1024 1025On client:: 1026 1027 python3 client_orphan.py 1028 1029Run iptables on server:: 1030 1031 sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP 1032 1033Type Ctrl-C on client, stop client_orphan.py. 1034 1035Check TcpExtTCPAbortOnMemory on client:: 1036 1037 nstatuser@nstat-a:~$ nstat | grep -i abort 1038 TcpExtTCPAbortOnMemory 54 0.0 1039 1040Check orphane socket count on client:: 1041 1042 nstatuser@nstat-a:~$ ss -s 1043 Total: 131 (kernel 0) 1044 TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0 1045 1046 Transport Total IP IPv6 1047 * 0 - - 1048 RAW 1 0 1 1049 UDP 1 1 0 1050 TCP 14 13 1 1051 INET 16 14 2 1052 FRAG 0 0 0 1053 1054The explanation of the test: after run server_orphan.py and 1055client_orphan.py, we set up 64 connections between server and 1056client. Run the iptables command, the server will drop all packets from 1057the client, type Ctrl-C on client_orphan.py, the system of the client 1058would try to close these connections, and before they are closed 1059gracefully, these connections became orphan sockets. As the iptables 1060of the server blocked packets from the client, the server won't receive fin 1061from the client, so all connection on clients would be stuck on FIN_WAIT_1 1062stage, so they will keep as orphan sockets until timeout. We have echo 106310 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would 1064only keep 10 orphan sockets, for all other orphan sockets, the client 1065system sent RST for them and delete them. We have 64 connections, so 1066the 'ss -s' command shows the system has 10 orphan sockets, and the 1067value of TcpExtTCPAbortOnMemory was 54. 1068 1069An additional explanation about orphan socket count: You could find the 1070exactly orphan socket count by the 'ss -s' command, but when kernel 1071decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel 1072doesn't always check the exactly orphan socket count. For increasing 1073performance, kernel checks an approximate count firstly, if the 1074approximate count is more than tcp_max_orphans, kernel checks the 1075exact count again. So if the approximate count is less than 1076tcp_max_orphans, but exactly count is more than tcp_max_orphans, you 1077would find TcpExtTCPAbortOnMemory is not increased at all. If 1078tcp_max_orphans is large enough, it won't occur, but if you decrease 1079tcp_max_orphans to a small value like our test, you might find this 1080issue. So in our test, the client set up 64 connections although the 1081tcp_max_orphans is 10. If the client only set up 11 connections, we 1082can't find the change of TcpExtTCPAbortOnMemory. 1083 1084Continue the previous test, we wait for several minutes. Because of the 1085iptables on the server blocked the traffic, the server wouldn't receive 1086fin, and all the client's orphan sockets would timeout on the 1087FIN_WAIT_1 state finally. So we wait for a few minutes, we could find 108810 timeout on the client:: 1089 1090 nstatuser@nstat-a:~$ nstat | grep -i abort 1091 TcpExtTCPAbortOnTimeout 10 0.0 1092 1093TcpExtTCPAbortOnLinger 1094---------------------- 1095The server side code:: 1096 1097 nstatuser@nstat-b:~$ cat server_linger.py 1098 import socket 1099 import time 1100 1101 port = 9000 1102 1103 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1104 s.bind(('0.0.0.0', port)) 1105 s.listen(1) 1106 sock, addr = s.accept() 1107 while True: 1108 time.sleep(9999999) 1109 1110The client side code:: 1111 1112 nstatuser@nstat-a:~$ cat client_linger.py 1113 import socket 1114 import struct 1115 1116 server = 'nstat-b' # server address 1117 port = 9000 1118 1119 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1120 s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10)) 1121 s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1)) 1122 s.connect((server, port)) 1123 s.close() 1124 1125Run server_linger.py on server:: 1126 1127 nstatuser@nstat-b:~$ python3 server_linger.py 1128 1129Run client_linger.py on client:: 1130 1131 nstatuser@nstat-a:~$ python3 client_linger.py 1132 1133After run client_linger.py, check the output of nstat:: 1134 1135 nstatuser@nstat-a:~$ nstat | grep -i abort 1136 TcpExtTCPAbortOnLinger 1 0.0 1137 1138TcpExtTCPRcvCoalesce 1139-------------------- 1140On the server, we run a program which listen on TCP port 9000, but 1141doesn't read any data:: 1142 1143 import socket 1144 import time 1145 port = 9000 1146 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1147 s.bind(('0.0.0.0', port)) 1148 s.listen(1) 1149 sock, addr = s.accept() 1150 while True: 1151 time.sleep(9999999) 1152 1153Save the above code as server_coalesce.py, and run:: 1154 1155 python3 server_coalesce.py 1156 1157On the client, save below code as client_coalesce.py:: 1158 1159 import socket 1160 server = 'nstat-b' 1161 port = 9000 1162 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1163 s.connect((server, port)) 1164 1165Run:: 1166 1167 nstatuser@nstat-a:~$ python3 -i client_coalesce.py 1168 1169We use '-i' to come into the interactive mode, then a packet:: 1170 1171 >>> s.send(b'foo') 1172 3 1173 1174Send a packet again:: 1175 1176 >>> s.send(b'bar') 1177 3 1178 1179On the server, run nstat:: 1180 1181 ubuntu@nstat-b:~$ nstat 1182 #kernel 1183 IpInReceives 2 0.0 1184 IpInDelivers 2 0.0 1185 IpOutRequests 2 0.0 1186 TcpInSegs 2 0.0 1187 TcpOutSegs 2 0.0 1188 TcpExtTCPRcvCoalesce 1 0.0 1189 IpExtInOctets 110 0.0 1190 IpExtOutOctets 104 0.0 1191 IpExtInNoECTPkts 2 0.0 1192 1193The client sent two packets, server didn't read any data. When 1194the second packet arrived at server, the first packet was still in 1195the receiving queue. So the TCP layer merged the two packets, and we 1196could find the TcpExtTCPRcvCoalesce increased 1. 1197 1198TcpExtListenOverflows and TcpExtListenDrops 1199------------------------------------------- 1200On server, run the nc command, listen on port 9000:: 1201 1202 nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 1203 Listening on [0.0.0.0] (family 0, port 9000) 1204 1205On client, run 3 nc commands in different terminals:: 1206 1207 nstatuser@nstat-a:~$ nc -v nstat-b 9000 1208 Connection to nstat-b 9000 port [tcp/*] succeeded! 1209 1210The nc command only accepts 1 connection, and the accept queue length 1211is 1. On current linux implementation, set queue length to n means the 1212actual queue length is n+1. Now we create 3 connections, 1 is accepted 1213by nc, 2 in accepted queue, so the accept queue is full. 1214 1215Before running the 4th nc, we clean the nstat history on the server:: 1216 1217 nstatuser@nstat-b:~$ nstat -n 1218 1219Run the 4th nc on the client:: 1220 1221 nstatuser@nstat-a:~$ nc -v nstat-b 9000 1222 1223If the nc server is running on kernel 4.10 or higher version, you 1224won't see the "Connection to ... succeeded!" string, because kernel 1225will drop the SYN if the accept queue is full. If the nc client is running 1226on an old kernel, you would see that the connection is succeeded, 1227because kernel would complete the 3 way handshake and keep the socket 1228on half open queue. I did the test on kernel 4.15. Below is the nstat 1229on the server:: 1230 1231 nstatuser@nstat-b:~$ nstat 1232 #kernel 1233 IpInReceives 4 0.0 1234 IpInDelivers 4 0.0 1235 TcpInSegs 4 0.0 1236 TcpExtListenOverflows 4 0.0 1237 TcpExtListenDrops 4 0.0 1238 IpExtInOctets 240 0.0 1239 IpExtInNoECTPkts 4 0.0 1240 1241Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time 1242between the 4th nc and the nstat was longer, the value of 1243TcpExtListenOverflows and TcpExtListenDrops would be larger, because 1244the SYN of the 4th nc was dropped, the client was retrying. 1245 1246IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes 1247------------------------------------------------- 1248server A IP address: 192.168.122.250 1249server B IP address: 192.168.122.251 1250Prepare on server A, add a route to server B:: 1251 1252 $ sudo ip route add 8.8.8.8/32 via 192.168.122.251 1253 1254Prepare on server B, disable send_redirects for all interfaces:: 1255 1256 $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0 1257 $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0 1258 $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0 1259 $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0 1260 1261We want to let sever A send a packet to 8.8.8.8, and route the packet 1262to server B. When server B receives such packet, it might send a ICMP 1263Redirect message to server A, set send_redirects to 0 will disable 1264this behavior. 1265 1266First, generate InAddrErrors. On server B, we disable IP forwarding:: 1267 1268 $ sudo sysctl -w net.ipv4.conf.all.forwarding=0 1269 1270On server A, we send packets to 8.8.8.8:: 1271 1272 $ nc -v 8.8.8.8 53 1273 1274On server B, we check the output of nstat:: 1275 1276 $ nstat 1277 #kernel 1278 IpInReceives 3 0.0 1279 IpInAddrErrors 3 0.0 1280 IpExtInOctets 180 0.0 1281 IpExtInNoECTPkts 3 0.0 1282 1283As we have let server A route 8.8.8.8 to server B, and we disabled IP 1284forwarding on server B, Server A sent packets to server B, then server B 1285dropped packets and increased IpInAddrErrors. As the nc command would 1286re-send the SYN packet if it didn't receive a SYN+ACK, we could find 1287multiple IpInAddrErrors. 1288 1289Second, generate IpExtInNoRoutes. On server B, we enable IP 1290forwarding:: 1291 1292 $ sudo sysctl -w net.ipv4.conf.all.forwarding=1 1293 1294Check the route table of server B and remove the default route:: 1295 1296 $ ip route show 1297 default via 192.168.122.1 dev ens3 proto static 1298 192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251 1299 $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static 1300 1301On server A, we contact 8.8.8.8 again:: 1302 1303 $ nc -v 8.8.8.8 53 1304 nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable 1305 1306On server B, run nstat:: 1307 1308 $ nstat 1309 #kernel 1310 IpInReceives 1 0.0 1311 IpOutRequests 1 0.0 1312 IcmpOutMsgs 1 0.0 1313 IcmpOutDestUnreachs 1 0.0 1314 IcmpMsgOutType3 1 0.0 1315 IpExtInNoRoutes 1 0.0 1316 IpExtInOctets 60 0.0 1317 IpExtOutOctets 88 0.0 1318 IpExtInNoECTPkts 1 0.0 1319 1320We enabled IP forwarding on server B, when server B received a packet 1321which destination IP address is 8.8.8.8, server B will try to forward 1322this packet. We have deleted the default route, there was no route for 13238.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP 1324Destination Unreachable" message to server A. 1325 1326Third, generate IpOutNoRoutes. Run ping command on server B:: 1327 1328 $ ping -c 1 8.8.8.8 1329 connect: Network is unreachable 1330 1331Run nstat on server B:: 1332 1333 $ nstat 1334 #kernel 1335 IpOutNoRoutes 1 0.0 1336 1337We have deleted the default route on server B. Server B couldn't find 1338a route for the 8.8.8.8 IP address, so server B increased 1339IpOutNoRoutes. 1340 1341TcpExtTCPACKSkippedSynRecv 1342-------------------------- 1343In this test, we send 3 same SYN packets from client to server. The 1344first SYN will let server create a socket, set it to Syn-Recv status, 1345and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK 1346again, and record the reply time (the duplicate ACK reply time). The 1347third SYN will let server check the previous duplicate ACK reply time, 1348and decide to skip the duplicate ACK, then increase the 1349TcpExtTCPACKSkippedSynRecv counter. 1350 1351Run tcpdump to capture a SYN packet:: 1352 1353 nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000 1354 tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 1355 1356Open another terminal, run nc command:: 1357 1358 nstatuser@nstat-a:~$ nc nstat-b 9000 1359 1360As the nstat-b didn't listen on port 9000, it should reply a RST, and 1361the nc command exited immediately. It was enough for the tcpdump 1362command to capture a SYN packet. A linux server might use hardware 1363offload for the TCP checksum, so the checksum in the /tmp/syn.pcap 1364might be not correct. We call tcprewrite to fix it:: 1365 1366 nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum 1367 1368On nstat-b, we run nc to listen on port 9000:: 1369 1370 nstatuser@nstat-b:~$ nc -lkv 9000 1371 Listening on [0.0.0.0] (family 0, port 9000) 1372 1373On nstat-a, we blocked the packet from port 9000, or nstat-a would send 1374RST to nstat-b:: 1375 1376 nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP 1377 1378Send 3 SYN repeatly to nstat-b:: 1379 1380 nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done 1381 1382Check snmp cunter on nstat-b:: 1383 1384 nstatuser@nstat-b:~$ nstat | grep -i skip 1385 TcpExtTCPACKSkippedSynRecv 1 0.0 1386 1387As we expected, TcpExtTCPACKSkippedSynRecv is 1. 1388 1389TcpExtTCPACKSkippedPAWS 1390----------------------- 1391To trigger PAWS, we could send an old SYN. 1392 1393On nstat-b, let nc listen on port 9000:: 1394 1395 nstatuser@nstat-b:~$ nc -lkv 9000 1396 Listening on [0.0.0.0] (family 0, port 9000) 1397 1398On nstat-a, run tcpdump to capture a SYN:: 1399 1400 nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000 1401 tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 1402 1403On nstat-a, run nc as a client to connect nstat-b:: 1404 1405 nstatuser@nstat-a:~$ nc -v nstat-b 9000 1406 Connection to nstat-b 9000 port [tcp/*] succeeded! 1407 1408Now the tcpdump has captured the SYN and exit. We should fix the 1409checksum:: 1410 1411 nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum 1412 1413Send the SYN packet twice:: 1414 1415 nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done 1416 1417On nstat-b, check the snmp counter:: 1418 1419 nstatuser@nstat-b:~$ nstat | grep -i skip 1420 TcpExtTCPACKSkippedPAWS 1 0.0 1421 1422We sent two SYN via tcpreplay, both of them would let PAWS check 1423failed, the nstat-b replied an ACK for the first SYN, skipped the ACK 1424for the second SYN, and updated TcpExtTCPACKSkippedPAWS. 1425 1426TcpExtTCPACKSkippedSeq 1427---------------------- 1428To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid 1429timestamp (to pass PAWS check) but the sequence number is out of 1430window. The linux TCP stack would avoid to skip if the packet has 1431data, so we need a pure ACK packet. To generate such a packet, we 1432could create two sockets: one on port 9000, another on port 9001. Then 1433we capture an ACK on port 9001, change the source/destination port 1434numbers to match the port 9000 socket. Then we could trigger 1435TcpExtTCPACKSkippedSeq via this packet. 1436 1437On nstat-b, open two terminals, run two nc commands to listen on both 1438port 9000 and port 9001:: 1439 1440 nstatuser@nstat-b:~$ nc -lkv 9000 1441 Listening on [0.0.0.0] (family 0, port 9000) 1442 1443 nstatuser@nstat-b:~$ nc -lkv 9001 1444 Listening on [0.0.0.0] (family 0, port 9001) 1445 1446On nstat-a, run two nc clients:: 1447 1448 nstatuser@nstat-a:~$ nc -v nstat-b 9000 1449 Connection to nstat-b 9000 port [tcp/*] succeeded! 1450 1451 nstatuser@nstat-a:~$ nc -v nstat-b 9001 1452 Connection to nstat-b 9001 port [tcp/*] succeeded! 1453 1454On nstat-a, run tcpdump to capture an ACK:: 1455 1456 nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001 1457 tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 1458 1459On nstat-b, send a packet via the port 9001 socket. E.g. we sent a 1460string 'foo' in our example:: 1461 1462 nstatuser@nstat-b:~$ nc -lkv 9001 1463 Listening on [0.0.0.0] (family 0, port 9001) 1464 Connection from nstat-a 42132 received! 1465 foo 1466 1467On nstat-a, the tcpdump should have caputred the ACK. We should check 1468the source port numbers of the two nc clients:: 1469 1470 nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee 1471 State Recv-Q Send-Q Local Address:Port Peer Address:Port 1472 ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000 1473 ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001 1474 1475Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to 1476port 50208:: 1477 1478 nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum 1479 1480Now the /tmp/seq.pcap is the packet we need. Send it to nstat-b:: 1481 1482 nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done 1483 1484Check TcpExtTCPACKSkippedSeq on nstat-b:: 1485 1486 nstatuser@nstat-b:~$ nstat | grep -i skip 1487 TcpExtTCPACKSkippedSeq 1 0.0 1488