1ae5220c6SRandy Dunlap============ 2b08794a9SyupengSNMP counter 3ae5220c6SRandy Dunlap============ 4b08794a9Syupeng 5b08794a9SyupengThis document explains the meaning of SNMP counters. 6b08794a9Syupeng 7b08794a9SyupengGeneral IPv4 counters 8ae5220c6SRandy Dunlap===================== 9b08794a9SyupengAll layer 4 packets and ICMP packets will change these counters, but 10b08794a9Syupengthese counters won't be changed by layer 2 packets (such as STP) or 11b08794a9SyupengARP packets. 12b08794a9Syupeng 13b08794a9Syupeng* IpInReceives 14ae5220c6SRandy Dunlap 15b08794a9SyupengDefined in `RFC1213 ipInReceives`_ 16b08794a9Syupeng 17b08794a9Syupeng.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26 18b08794a9Syupeng 19b08794a9SyupengThe number of packets received by the IP layer. It gets increasing at the 20b08794a9Syupengbeginning of ip_rcv function, always be updated together with 218e2ea53aSyupengIpExtInOctets. It will be increased even if the packet is dropped 228e2ea53aSyupenglater (e.g. due to the IP header is invalid or the checksum is wrong 238e2ea53aSyupengand so on). It indicates the number of aggregated segments after 24b08794a9SyupengGRO/LRO. 25b08794a9Syupeng 26b08794a9Syupeng* IpInDelivers 27ae5220c6SRandy Dunlap 28b08794a9SyupengDefined in `RFC1213 ipInDelivers`_ 29b08794a9Syupeng 30b08794a9Syupeng.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28 31b08794a9Syupeng 32b08794a9SyupengThe number of packets delivers to the upper layer protocols. E.g. TCP, UDP, 33b08794a9SyupengICMP and so on. If no one listens on a raw socket, only kernel 34b08794a9Syupengsupported protocols will be delivered, if someone listens on the raw 35b08794a9Syupengsocket, all valid IP packets will be delivered. 36b08794a9Syupeng 37b08794a9Syupeng* IpOutRequests 38ae5220c6SRandy Dunlap 39b08794a9SyupengDefined in `RFC1213 ipOutRequests`_ 40b08794a9Syupeng 41b08794a9Syupeng.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28 42b08794a9Syupeng 43b08794a9SyupengThe number of packets sent via IP layer, for both single cast and 44b08794a9Syupengmulticast packets, and would always be updated together with 45b08794a9SyupengIpExtOutOctets. 46b08794a9Syupeng 47b08794a9Syupeng* IpExtInOctets and IpExtOutOctets 48ae5220c6SRandy Dunlap 4980cc4950SyupengThey are Linux kernel extensions, no RFC definitions. Please note, 50b08794a9SyupengRFC1213 indeed defines ifInOctets and ifOutOctets, but they 51b08794a9Syupengare different things. The ifInOctets and ifOutOctets include the MAC 52b08794a9Syupenglayer header size but IpExtInOctets and IpExtOutOctets don't, they 53b08794a9Syupengonly include the IP layer header and the IP layer data. 54b08794a9Syupeng 55b08794a9Syupeng* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts 56ae5220c6SRandy Dunlap 57b08794a9SyupengThey indicate the number of four kinds of ECN IP packets, please refer 58b08794a9Syupeng`Explicit Congestion Notification`_ for more details. 59b08794a9Syupeng 60b08794a9Syupeng.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6 61b08794a9Syupeng 62b08794a9SyupengThese 4 counters calculate how many packets received per ECN 63b08794a9Syupengstatus. They count the real frame number regardless the LRO/GRO. So 64b08794a9Syupengfor the same packet, you might find that IpInReceives count 1, but 65b08794a9SyupengIpExtInNoECTPkts counts 2 or more. 66b08794a9Syupeng 678e2ea53aSyupeng* IpInHdrErrors 68ae5220c6SRandy Dunlap 698e2ea53aSyupengDefined in `RFC1213 ipInHdrErrors`_. It indicates the packet is 708e2ea53aSyupengdropped due to the IP header error. It might happen in both IP input 718e2ea53aSyupengand IP forward paths. 728e2ea53aSyupeng 738e2ea53aSyupeng.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27 748e2ea53aSyupeng 758e2ea53aSyupeng* IpInAddrErrors 76ae5220c6SRandy Dunlap 778e2ea53aSyupengDefined in `RFC1213 ipInAddrErrors`_. It will be increased in two 788e2ea53aSyupengscenarios: (1) The IP address is invalid. (2) The destination IP 798e2ea53aSyupengaddress is not a local address and IP forwarding is not enabled 808e2ea53aSyupeng 818e2ea53aSyupeng.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27 828e2ea53aSyupeng 838e2ea53aSyupeng* IpExtInNoRoutes 84ae5220c6SRandy Dunlap 858e2ea53aSyupengThis counter means the packet is dropped when the IP stack receives a 868e2ea53aSyupengpacket and can't find a route for it from the route table. It might 878e2ea53aSyupenghappen when IP forwarding is enabled and the destination IP address is 888e2ea53aSyupengnot a local address and there is no route for the destination IP 898e2ea53aSyupengaddress. 908e2ea53aSyupeng 918e2ea53aSyupeng* IpInUnknownProtos 92ae5220c6SRandy Dunlap 938e2ea53aSyupengDefined in `RFC1213 ipInUnknownProtos`_. It will be increased if the 948e2ea53aSyupenglayer 4 protocol is unsupported by kernel. If an application is using 958e2ea53aSyupengraw socket, kernel will always deliver the packet to the raw socket 968e2ea53aSyupengand this counter won't be increased. 978e2ea53aSyupeng 988e2ea53aSyupeng.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27 998e2ea53aSyupeng 1008e2ea53aSyupeng* IpExtInTruncatedPkts 101ae5220c6SRandy Dunlap 1028e2ea53aSyupengFor IPv4 packet, it means the actual data size is smaller than the 1038e2ea53aSyupeng"Total Length" field in the IPv4 header. 1048e2ea53aSyupeng 1058e2ea53aSyupeng* IpInDiscards 106ae5220c6SRandy Dunlap 1078e2ea53aSyupengDefined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped 1088e2ea53aSyupengin the IP receiving path and due to kernel internal reasons (e.g. no 1098e2ea53aSyupengenough memory). 1108e2ea53aSyupeng 1118e2ea53aSyupeng.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28 1128e2ea53aSyupeng 1138e2ea53aSyupeng* IpOutDiscards 114ae5220c6SRandy Dunlap 1158e2ea53aSyupengDefined in `RFC1213 ipOutDiscards`_. It indicates the packet is 1168e2ea53aSyupengdropped in the IP sending path and due to kernel internal reasons. 1178e2ea53aSyupeng 1188e2ea53aSyupeng.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28 1198e2ea53aSyupeng 1208e2ea53aSyupeng* IpOutNoRoutes 121ae5220c6SRandy Dunlap 1228e2ea53aSyupengDefined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is 1238e2ea53aSyupengdropped in the IP sending path and no route is found for it. 1248e2ea53aSyupeng 1258e2ea53aSyupeng.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29 1268e2ea53aSyupeng 127b08794a9SyupengICMP counters 128ae5220c6SRandy Dunlap============= 129b08794a9Syupeng* IcmpInMsgs and IcmpOutMsgs 130ae5220c6SRandy Dunlap 131b08794a9SyupengDefined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_ 132b08794a9Syupeng 133b08794a9Syupeng.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41 134b08794a9Syupeng.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43 135b08794a9Syupeng 136b08794a9SyupengAs mentioned in the RFC1213, these two counters include errors, they 137b08794a9Syupengwould be increased even if the ICMP packet has an invalid type. The 138b08794a9SyupengICMP output path will check the header of a raw socket, so the 139b08794a9SyupengIcmpOutMsgs would still be updated if the IP header is constructed by 140b08794a9Syupenga userspace program. 141b08794a9Syupeng 142b08794a9Syupeng* ICMP named types 143ae5220c6SRandy Dunlap 144b08794a9Syupeng| These counters include most of common ICMP types, they are: 145b08794a9Syupeng| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_ 146b08794a9Syupeng| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_ 147b08794a9Syupeng| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_ 148b08794a9Syupeng| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_ 149b08794a9Syupeng| IcmpInRedirects: `RFC1213 icmpInRedirects`_ 150b08794a9Syupeng| IcmpInEchos: `RFC1213 icmpInEchos`_ 151b08794a9Syupeng| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_ 152b08794a9Syupeng| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_ 153b08794a9Syupeng| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_ 154b08794a9Syupeng| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_ 155b08794a9Syupeng| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_ 156b08794a9Syupeng| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_ 157b08794a9Syupeng| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_ 158b08794a9Syupeng| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_ 159b08794a9Syupeng| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_ 160b08794a9Syupeng| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_ 161b08794a9Syupeng| IcmpOutEchos: `RFC1213 icmpOutEchos`_ 162b08794a9Syupeng| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_ 163b08794a9Syupeng| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_ 164b08794a9Syupeng| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_ 165b08794a9Syupeng| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_ 166b08794a9Syupeng| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_ 167b08794a9Syupeng 168b08794a9Syupeng.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41 169b08794a9Syupeng.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41 170b08794a9Syupeng.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42 171b08794a9Syupeng.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42 172b08794a9Syupeng.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42 173b08794a9Syupeng.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42 174b08794a9Syupeng.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42 175b08794a9Syupeng.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42 176b08794a9Syupeng.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43 177b08794a9Syupeng.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43 178b08794a9Syupeng.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43 179b08794a9Syupeng 180b08794a9Syupeng.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44 181b08794a9Syupeng.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44 182b08794a9Syupeng.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44 183b08794a9Syupeng.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44 184b08794a9Syupeng.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44 185b08794a9Syupeng.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45 186b08794a9Syupeng.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45 187b08794a9Syupeng.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45 188b08794a9Syupeng.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45 189b08794a9Syupeng.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45 190b08794a9Syupeng.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46 191b08794a9Syupeng 192b08794a9SyupengEvery ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP 193b08794a9SyupengEcho packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are 194b08794a9Syupengstraightforward. The 'In' counter means kernel receives such a packet 195b08794a9Syupengand the 'Out' counter means kernel sends such a packet. 196b08794a9Syupeng 197b08794a9Syupeng* ICMP numeric types 198ae5220c6SRandy Dunlap 199b08794a9SyupengThey are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the 200b08794a9SyupengICMP type number. These counters track all kinds of ICMP packets. The 201b08794a9SyupengICMP type number definition could be found in the `ICMP parameters`_ 202b08794a9Syupengdocument. 203b08794a9Syupeng 204b08794a9Syupeng.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml 205b08794a9Syupeng 206b08794a9SyupengFor example, if the Linux kernel sends an ICMP Echo packet, the 207b08794a9SyupengIcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply 208b08794a9Syupengpacket, IcmpMsgInType0 would increase 1. 209b08794a9Syupeng 210b08794a9Syupeng* IcmpInCsumErrors 211ae5220c6SRandy Dunlap 212b08794a9SyupengThis counter indicates the checksum of the ICMP packet is 213b08794a9Syupengwrong. Kernel verifies the checksum after updating the IcmpInMsgs and 214b08794a9Syupengbefore updating IcmpMsgInType[N]. If a packet has bad checksum, the 215b08794a9SyupengIcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated. 216b08794a9Syupeng 217b08794a9Syupeng* IcmpInErrors and IcmpOutErrors 218ae5220c6SRandy Dunlap 219b08794a9SyupengDefined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_ 220b08794a9Syupeng 221b08794a9Syupeng.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41 222b08794a9Syupeng.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43 223b08794a9Syupeng 224b08794a9SyupengWhen an error occurs in the ICMP packet handler path, these two 225b08794a9Syupengcounters would be updated. The receiving packet path use IcmpInErrors 226b08794a9Syupengand the sending packet path use IcmpOutErrors. When IcmpInCsumErrors 227b08794a9Syupengis increased, IcmpInErrors would always be increased too. 228b08794a9Syupeng 229b08794a9Syupengrelationship of the ICMP counters 230ae5220c6SRandy Dunlap--------------------------------- 231b08794a9SyupengThe sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they 232b08794a9Syupengare updated at the same time. The sum of IcmpMsgInType[N] plus 233b08794a9SyupengIcmpInErrors should be equal or larger than IcmpInMsgs. When kernel 234b08794a9Syupengreceives an ICMP packet, kernel follows below logic: 235b08794a9Syupeng 236b08794a9Syupeng1. increase IcmpInMsgs 237b08794a9Syupeng2. if has any error, update IcmpInErrors and finish the process 238b08794a9Syupeng3. update IcmpMsgOutType[N] 239b08794a9Syupeng4. handle the packet depending on the type, if has any error, update 240b08794a9Syupeng IcmpInErrors and finish the process 241b08794a9Syupeng 242b08794a9SyupengSo if all errors occur in step (2), IcmpInMsgs should be equal to the 243b08794a9Syupengsum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in 244b08794a9Syupengstep (4), IcmpInMsgs should be equal to the sum of 245b08794a9SyupengIcmpMsgOutType[N]. If the errors occur in both step (2) and step (4), 246b08794a9SyupengIcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus 247b08794a9SyupengIcmpInErrors. 248b08794a9Syupeng 24980cc4950SyupengGeneral TCP counters 250ae5220c6SRandy Dunlap==================== 25180cc4950Syupeng* TcpInSegs 252ae5220c6SRandy Dunlap 25380cc4950SyupengDefined in `RFC1213 tcpInSegs`_ 25480cc4950Syupeng 25580cc4950Syupeng.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48 25680cc4950Syupeng 25780cc4950SyupengThe number of packets received by the TCP layer. As mentioned in 25880cc4950SyupengRFC1213, it includes the packets received in error, such as checksum 25980cc4950Syupengerror, invalid TCP header and so on. Only one error won't be included: 26080cc4950Syupengif the layer 2 destination address is not the NIC's layer 2 26180cc4950Syupengaddress. It might happen if the packet is a multicast or broadcast 26280cc4950Syupengpacket, or the NIC is in promiscuous mode. In these situations, the 26380cc4950Syupengpackets would be delivered to the TCP layer, but the TCP layer will discard 26480cc4950Syupengthese packets before increasing TcpInSegs. The TcpInSegs counter 26580cc4950Syupengisn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs 26680cc4950Syupengcounter would only increase 1. 26780cc4950Syupeng 26880cc4950Syupeng* TcpOutSegs 269ae5220c6SRandy Dunlap 27080cc4950SyupengDefined in `RFC1213 tcpOutSegs`_ 27180cc4950Syupeng 27280cc4950Syupeng.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48 27380cc4950Syupeng 27480cc4950SyupengThe number of packets sent by the TCP layer. As mentioned in RFC1213, 27580cc4950Syupengit excludes the retransmitted packets. But it includes the SYN, ACK 27680cc4950Syupengand RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of 27780cc4950SyupengGSO, so if a packet would be split to 2 by GSO, TcpOutSegs will 27880cc4950Syupengincrease 2. 27980cc4950Syupeng 28080cc4950Syupeng* TcpActiveOpens 281ae5220c6SRandy Dunlap 28280cc4950SyupengDefined in `RFC1213 tcpActiveOpens`_ 28380cc4950Syupeng 28480cc4950Syupeng.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47 28580cc4950Syupeng 28680cc4950SyupengIt means the TCP layer sends a SYN, and come into the SYN-SENT 28780cc4950Syupengstate. Every time TcpActiveOpens increases 1, TcpOutSegs should always 28880cc4950Syupengincrease 1. 28980cc4950Syupeng 29080cc4950Syupeng* TcpPassiveOpens 291ae5220c6SRandy Dunlap 29280cc4950SyupengDefined in `RFC1213 tcpPassiveOpens`_ 29380cc4950Syupeng 29480cc4950Syupeng.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47 29580cc4950Syupeng 29680cc4950SyupengIt means the TCP layer receives a SYN, replies a SYN+ACK, come into 29780cc4950Syupengthe SYN-RCVD state. 29880cc4950Syupeng 299712ee16cSyupeng* TcpExtTCPRcvCoalesce 300ae5220c6SRandy Dunlap 301712ee16cSyupengWhen packets are received by the TCP layer and are not be read by the 302712ee16cSyupengapplication, the TCP layer will try to merge them. This counter 303712ee16cSyupengindicate how many packets are merged in such situation. If GRO is 304712ee16cSyupengenabled, lots of packets would be merged by GRO, these packets 305712ee16cSyupengwouldn't be counted to TcpExtTCPRcvCoalesce. 306712ee16cSyupeng 307712ee16cSyupeng* TcpExtTCPAutoCorking 308ae5220c6SRandy Dunlap 309712ee16cSyupengWhen sending packets, the TCP layer will try to merge small packets to 310712ee16cSyupenga bigger one. This counter increase 1 for every packet merged in such 311712ee16cSyupengsituation. Please refer to the LWN article for more details: 312712ee16cSyupenghttps://lwn.net/Articles/576263/ 313712ee16cSyupeng 314712ee16cSyupeng* TcpExtTCPOrigDataSent 315ae5220c6SRandy Dunlap 316712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the 317ede71caeSMasanari Iidaexplanation below:: 318712ee16cSyupeng 319712ee16cSyupeng TCPOrigDataSent: number of outgoing packets with original data (excluding 320712ee16cSyupeng retransmission but including data-in-SYN). This counter is different from 321712ee16cSyupeng TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is 322712ee16cSyupeng more useful to track the TCP retransmission rate. 323712ee16cSyupeng 324712ee16cSyupeng* TCPSynRetrans 325ae5220c6SRandy Dunlap 326712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the 327ede71caeSMasanari Iidaexplanation below:: 328712ee16cSyupeng 329712ee16cSyupeng TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down 330712ee16cSyupeng retransmissions into SYN, fast-retransmits, timeout retransmits, etc. 331712ee16cSyupeng 332712ee16cSyupeng* TCPFastOpenActiveFail 333ae5220c6SRandy Dunlap 334712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the 335ede71caeSMasanari Iidaexplanation below:: 336712ee16cSyupeng 337712ee16cSyupeng TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because 338712ee16cSyupeng the remote does not accept it or the attempts timed out. 339712ee16cSyupeng 340712ee16cSyupeng.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd 341712ee16cSyupeng 342712ee16cSyupeng* TcpExtListenOverflows and TcpExtListenDrops 343ae5220c6SRandy Dunlap 344712ee16cSyupengWhen kernel receives a SYN from a client, and if the TCP accept queue 345712ee16cSyupengis full, kernel will drop the SYN and add 1 to TcpExtListenOverflows. 346712ee16cSyupengAt the same time kernel will also add 1 to TcpExtListenDrops. When a 347712ee16cSyupengTCP socket is in LISTEN state, and kernel need to drop a packet, 348712ee16cSyupengkernel would always add 1 to TcpExtListenDrops. So increase 349712ee16cSyupengTcpExtListenOverflows would let TcpExtListenDrops increasing at the 350712ee16cSyupengsame time, but TcpExtListenDrops would also increase without 351712ee16cSyupengTcpExtListenOverflows increasing, e.g. a memory allocation fail would 352712ee16cSyupengalso let TcpExtListenDrops increase. 353712ee16cSyupeng 354712ee16cSyupengNote: The above explanation is based on kernel 4.10 or above version, on 355712ee16cSyupengan old kernel, the TCP stack has different behavior when TCP accept 356712ee16cSyupengqueue is full. On the old kernel, TCP stack won't drop the SYN, it 357712ee16cSyupengwould complete the 3-way handshake. As the accept queue is full, TCP 358712ee16cSyupengstack will keep the socket in the TCP half-open queue. As it is in the 359712ee16cSyupenghalf open queue, TCP stack will send SYN+ACK on an exponential backoff 360712ee16cSyupengtimer, after client replies ACK, TCP stack checks whether the accept 361712ee16cSyupengqueue is still full, if it is not full, moves the socket to the accept 362712ee16cSyupengqueue, if it is full, keeps the socket in the half-open queue, at next 363712ee16cSyupengtime client replies ACK, this socket will get another chance to move 364712ee16cSyupengto the accept queue. 365712ee16cSyupeng 366712ee16cSyupeng 36780cc4950SyupengTCP Fast Open 368ae5220c6SRandy Dunlap============= 369a6c7c7aaSyupeng* TcpEstabResets 370132c4e9eSyupeng 371a6c7c7aaSyupengDefined in `RFC1213 tcpEstabResets`_. 372a6c7c7aaSyupeng 373a6c7c7aaSyupeng.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48 374a6c7c7aaSyupeng 375a6c7c7aaSyupeng* TcpAttemptFails 376132c4e9eSyupeng 377a6c7c7aaSyupengDefined in `RFC1213 tcpAttemptFails`_. 378a6c7c7aaSyupeng 379a6c7c7aaSyupeng.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48 380a6c7c7aaSyupeng 381a6c7c7aaSyupeng* TcpOutRsts 382132c4e9eSyupeng 383a6c7c7aaSyupengDefined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates 384a6c7c7aaSyupengthe 'segments sent containing the RST flag', but in linux kernel, this 385ede71caeSMasanari Iidacounter indicates the segments kernel tried to send. The sending 386a6c7c7aaSyupengprocess might be failed due to some errors (e.g. memory alloc failed). 387a6c7c7aaSyupeng 388a6c7c7aaSyupeng.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52 389a6c7c7aaSyupeng 390132c4e9eSyupeng* TcpExtTCPSpuriousRtxHostQueues 391132c4e9eSyupeng 392132c4e9eSyupengWhen the TCP stack wants to retransmit a packet, and finds that packet 393132c4e9eSyupengis not lost in the network, but the packet is not sent yet, the TCP 394132c4e9eSyupengstack would give up the retransmission and update this counter. It 395132c4e9eSyupengmight happen if a packet stays too long time in a qdisc or driver 396132c4e9eSyupengqueue. 397132c4e9eSyupeng 398132c4e9eSyupeng* TcpEstabResets 399132c4e9eSyupeng 400132c4e9eSyupengThe socket receives a RST packet in Establish or CloseWait state. 401132c4e9eSyupeng 402132c4e9eSyupeng* TcpExtTCPKeepAlive 403132c4e9eSyupeng 404132c4e9eSyupengThis counter indicates many keepalive packets were sent. The keepalive 405132c4e9eSyupengwon't be enabled by default. A userspace program could enable it by 406132c4e9eSyupengsetting the SO_KEEPALIVE socket option. 407132c4e9eSyupeng 408132c4e9eSyupeng* TcpExtTCPSpuriousRTOs 409132c4e9eSyupeng 410132c4e9eSyupengThe spurious retransmission timeout detected by the `F-RTO`_ 411132c4e9eSyupengalgorithm. 412132c4e9eSyupeng 413132c4e9eSyupeng.. _F-RTO: https://tools.ietf.org/html/rfc5682 414a6c7c7aaSyupeng 415a6c7c7aaSyupengTCP Fast Path 41665e9a6d2SRandy Dunlap============= 41780cc4950SyupengWhen kernel receives a TCP packet, it has two paths to handler the 41880cc4950Syupengpacket, one is fast path, another is slow path. The comment in kernel 41980cc4950Syupengcode provides a good explanation of them, I pasted them below:: 42080cc4950Syupeng 42180cc4950Syupeng It is split into a fast path and a slow path. The fast path is 42280cc4950Syupeng disabled when: 42380cc4950Syupeng 42480cc4950Syupeng - A zero window was announced from us 42580cc4950Syupeng - zero window probing 42680cc4950Syupeng is only handled properly on the slow path. 42780cc4950Syupeng - Out of order segments arrived. 42880cc4950Syupeng - Urgent data is expected. 42980cc4950Syupeng - There is no buffer space left 43080cc4950Syupeng - Unexpected TCP flags/window values/header lengths are received 43180cc4950Syupeng (detected by checking the TCP header against pred_flags) 43280cc4950Syupeng - Data is sent in both directions. The fast path only supports pure senders 43380cc4950Syupeng or pure receivers (this means either the sequence number or the ack 43480cc4950Syupeng value must stay constant) 43580cc4950Syupeng - Unexpected TCP option. 43680cc4950Syupeng 43780cc4950SyupengKernel will try to use fast path unless any of the above conditions 43880cc4950Syupengare satisfied. If the packets are out of order, kernel will handle 43980cc4950Syupengthem in slow path, which means the performance might be not very 44080cc4950Syupenggood. Kernel would also come into slow path if the "Delayed ack" is 44180cc4950Syupengused, because when using "Delayed ack", the data is sent in both 44280cc4950Syupengdirections. When the TCP window scale option is not used, kernel will 44380cc4950Syupengtry to enable fast path immediately when the connection comes into the 44480cc4950Syupengestablished state, but if the TCP window scale option is used, kernel 44580cc4950Syupengwill disable the fast path at first, and try to enable it after kernel 44680cc4950Syupengreceives packets. 44780cc4950Syupeng 44880cc4950Syupeng* TcpExtTCPPureAcks and TcpExtTCPHPAcks 449ae5220c6SRandy Dunlap 45080cc4950SyupengIf a packet set ACK flag and has no data, it is a pure ACK packet, if 45180cc4950Syupengkernel handles it in the fast path, TcpExtTCPHPAcks will increase 1, 45280cc4950Syupengif kernel handles it in the slow path, TcpExtTCPPureAcks will 45380cc4950Syupengincrease 1. 45480cc4950Syupeng 45580cc4950Syupeng* TcpExtTCPHPHits 456ae5220c6SRandy Dunlap 45780cc4950SyupengIf a TCP packet has data (which means it is not a pure ACK packet), 45880cc4950Syupengand this packet is handled in the fast path, TcpExtTCPHPHits will 45980cc4950Syupengincrease 1. 46080cc4950Syupeng 46180cc4950Syupeng 46280cc4950SyupengTCP abort 463ae5220c6SRandy Dunlap========= 46480cc4950Syupeng* TcpExtTCPAbortOnData 465ae5220c6SRandy Dunlap 46680cc4950SyupengIt means TCP layer has data in flight, but need to close the 46780cc4950Syupengconnection. So TCP layer sends a RST to the other side, indicate the 46880cc4950Syupengconnection is not closed very graceful. An easy way to increase this 46980cc4950Syupengcounter is using the SO_LINGER option. Please refer to the SO_LINGER 47080cc4950Syupengsection of the `socket man page`_: 47180cc4950Syupeng 47280cc4950Syupeng.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html 47380cc4950Syupeng 47480cc4950SyupengBy default, when an application closes a connection, the close function 47580cc4950Syupengwill return immediately and kernel will try to send the in-flight data 47680cc4950Syupengasync. If you use the SO_LINGER option, set l_onoff to 1, and l_linger 47780cc4950Syupengto a positive number, the close function won't return immediately, but 47880cc4950Syupengwait for the in-flight data are acked by the other side, the max wait 47980cc4950Syupengtime is l_linger seconds. If set l_onoff to 1 and set l_linger to 0, 48080cc4950Syupengwhen the application closes a connection, kernel will send a RST 48180cc4950Syupengimmediately and increase the TcpExtTCPAbortOnData counter. 48280cc4950Syupeng 48380cc4950Syupeng* TcpExtTCPAbortOnClose 484ae5220c6SRandy Dunlap 48580cc4950SyupengThis counter means the application has unread data in the TCP layer when 48680cc4950Syupengthe application wants to close the TCP connection. In such a situation, 48780cc4950Syupengkernel will send a RST to the other side of the TCP connection. 48880cc4950Syupeng 48980cc4950Syupeng* TcpExtTCPAbortOnMemory 490ae5220c6SRandy Dunlap 49180cc4950SyupengWhen an application closes a TCP connection, kernel still need to track 49280cc4950Syupengthe connection, let it complete the TCP disconnect process. E.g. an 49380cc4950Syupengapp calls the close method of a socket, kernel sends fin to the other 49480cc4950Syupengside of the connection, then the app has no relationship with the 49580cc4950Syupengsocket any more, but kernel need to keep the socket, this socket 49680cc4950Syupengbecomes an orphan socket, kernel waits for the reply of the other side, 49780cc4950Syupengand would come to the TIME_WAIT state finally. When kernel has no 49880cc4950Syupengenough memory to keep the orphan socket, kernel would send an RST to 49980cc4950Syupengthe other side, and delete the socket, in such situation, kernel will 50080cc4950Syupengincrease 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger 50180cc4950SyupengTcpExtTCPAbortOnMemory: 50280cc4950Syupeng 50380cc4950Syupeng1. the memory used by the TCP protocol is higher than the third value of 50480cc4950Syupengthe tcp_mem. Please refer the tcp_mem section in the `TCP man page`_: 50580cc4950Syupeng 50680cc4950Syupeng.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html 50780cc4950Syupeng 50880cc4950Syupeng2. the orphan socket count is higher than net.ipv4.tcp_max_orphans 50980cc4950Syupeng 51080cc4950Syupeng 51180cc4950Syupeng* TcpExtTCPAbortOnTimeout 512ae5220c6SRandy Dunlap 51380cc4950SyupengThis counter will increase when any of the TCP timers expire. In such 51480cc4950Syupengsituation, kernel won't send RST, just give up the connection. 51580cc4950Syupeng 51680cc4950Syupeng* TcpExtTCPAbortOnLinger 517ae5220c6SRandy Dunlap 51880cc4950SyupengWhen a TCP connection comes into FIN_WAIT_2 state, instead of waiting 51980cc4950Syupengfor the fin packet from the other side, kernel could send a RST and 52080cc4950Syupengdelete the socket immediately. This is not the default behavior of 52180cc4950SyupengLinux kernel TCP stack. By configuring the TCP_LINGER2 socket option, 52280cc4950Syupengyou could let kernel follow this behavior. 52380cc4950Syupeng 52480cc4950Syupeng* TcpExtTCPAbortFailed 525ae5220c6SRandy Dunlap 52680cc4950SyupengThe kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is 52780cc4950Syupengsatisfied. If an internal error occurs during this process, 52880cc4950SyupengTcpExtTCPAbortFailed will be increased. 52980cc4950Syupeng 53080cc4950Syupeng.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50 53180cc4950Syupeng 532712ee16cSyupengTCP Hybrid Slow Start 533ae5220c6SRandy Dunlap===================== 534712ee16cSyupengThe Hybrid Slow Start algorithm is an enhancement of the traditional 535712ee16cSyupengTCP congestion window Slow Start algorithm. It uses two pieces of 536712ee16cSyupenginformation to detect whether the max bandwidth of the TCP path is 537712ee16cSyupengapproached. The two pieces of information are ACK train length and 538712ee16cSyupengincrease in packet delay. For detail information, please refer the 539712ee16cSyupeng`Hybrid Slow Start paper`_. Either ACK train length or packet delay 540712ee16cSyupenghits a specific threshold, the congestion control algorithm will come 541712ee16cSyupenginto the Congestion Avoidance state. Until v4.20, two congestion 542712ee16cSyupengcontrol algorithms are using Hybrid Slow Start, they are cubic (the 543712ee16cSyupengdefault congestion control algorithm) and cdg. Four snmp counters 544712ee16cSyupengrelate with the Hybrid Slow Start algorithm. 545712ee16cSyupeng 546712ee16cSyupeng.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf 547712ee16cSyupeng 548712ee16cSyupeng* TcpExtTCPHystartTrainDetect 549ae5220c6SRandy Dunlap 550712ee16cSyupengHow many times the ACK train length threshold is detected 551712ee16cSyupeng 552712ee16cSyupeng* TcpExtTCPHystartTrainCwnd 553ae5220c6SRandy Dunlap 554712ee16cSyupengThe sum of CWND detected by ACK train length. Dividing this value by 555712ee16cSyupengTcpExtTCPHystartTrainDetect is the average CWND which detected by the 556712ee16cSyupengACK train length. 557712ee16cSyupeng 558712ee16cSyupeng* TcpExtTCPHystartDelayDetect 559ae5220c6SRandy Dunlap 560712ee16cSyupengHow many times the packet delay threshold is detected. 561712ee16cSyupeng 562712ee16cSyupeng* TcpExtTCPHystartDelayCwnd 563ae5220c6SRandy Dunlap 564712ee16cSyupengThe sum of CWND detected by packet delay. Dividing this value by 565712ee16cSyupengTcpExtTCPHystartDelayDetect is the average CWND which detected by the 566712ee16cSyupengpacket delay. 567712ee16cSyupeng 5688e2ea53aSyupengTCP retransmission and congestion control 569ae5220c6SRandy Dunlap========================================= 5708e2ea53aSyupengThe TCP protocol has two retransmission mechanisms: SACK and fast 5718e2ea53aSyupengrecovery. They are exclusive with each other. When SACK is enabled, 5728e2ea53aSyupengthe kernel TCP stack would use SACK, or kernel would use fast 5738e2ea53aSyupengrecovery. The SACK is a TCP option, which is defined in `RFC2018`_, 5748e2ea53aSyupengthe fast recovery is defined in `RFC6582`_, which is also called 5758e2ea53aSyupeng'Reno'. 5768e2ea53aSyupeng 5778e2ea53aSyupengThe TCP congestion control is a big and complex topic. To understand 5788e2ea53aSyupengthe related snmp counter, we need to know the states of the congestion 5798e2ea53aSyupengcontrol state machine. There are 5 states: Open, Disorder, CWR, 5808e2ea53aSyupengRecovery and Loss. For details about these states, please refer page 5 5818e2ea53aSyupengand page 6 of this document: 5828e2ea53aSyupenghttps://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf 5838e2ea53aSyupeng 5848e2ea53aSyupeng.. _RFC2018: https://tools.ietf.org/html/rfc2018 5858e2ea53aSyupeng.. _RFC6582: https://tools.ietf.org/html/rfc6582 5868e2ea53aSyupeng 5878e2ea53aSyupeng* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery 588ae5220c6SRandy Dunlap 5898e2ea53aSyupengWhen the congestion control comes into Recovery state, if sack is 5908e2ea53aSyupengused, TcpExtTCPSackRecovery increases 1, if sack is not used, 5918e2ea53aSyupengTcpExtTCPRenoRecovery increases 1. These two counters mean the TCP 5928e2ea53aSyupengstack begins to retransmit the lost packets. 5938e2ea53aSyupeng 5948e2ea53aSyupeng* TcpExtTCPSACKReneging 595ae5220c6SRandy Dunlap 5968e2ea53aSyupengA packet was acknowledged by SACK, but the receiver has dropped this 5978e2ea53aSyupengpacket, so the sender needs to retransmit this packet. In this 5988e2ea53aSyupengsituation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver 5998e2ea53aSyupengcould drop a packet which has been acknowledged by SACK, although it is 6008e2ea53aSyupengunusual, it is allowed by the TCP protocol. The sender doesn't really 6018e2ea53aSyupengknow what happened on the receiver side. The sender just waits until 6028e2ea53aSyupengthe RTO expires for this packet, then the sender assumes this packet 6038e2ea53aSyupenghas been dropped by the receiver. 6048e2ea53aSyupeng 6058e2ea53aSyupeng* TcpExtTCPRenoReorder 606ae5220c6SRandy Dunlap 6078e2ea53aSyupengThe reorder packet is detected by fast recovery. It would only be used 6088e2ea53aSyupengif SACK is disabled. The fast recovery algorithm detects recorder by 6098e2ea53aSyupengthe duplicate ACK number. E.g., if retransmission is triggered, and 6108e2ea53aSyupengthe original retransmitted packet is not lost, it is just out of 6118e2ea53aSyupengorder, the receiver would acknowledge multiple times, one for the 6128e2ea53aSyupengretransmitted packet, another for the arriving of the original out of 6138e2ea53aSyupengorder packet. Thus the sender would find more ACks than its 6148e2ea53aSyupengexpectation, and the sender knows out of order occurs. 6158e2ea53aSyupeng 6168e2ea53aSyupeng* TcpExtTCPTSReorder 617ae5220c6SRandy Dunlap 6188e2ea53aSyupengThe reorder packet is detected when a hole is filled. E.g., assume the 6198e2ea53aSyupengsender sends packet 1,2,3,4,5, and the receiving order is 6208e2ea53aSyupeng1,2,4,5,3. When the sender receives the ACK of packet 3 (which will 6218e2ea53aSyupengfill the hole), two conditions will let TcpExtTCPTSReorder increase 6228e2ea53aSyupeng1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet 6238e2ea53aSyupeng3 is retransmitted but the timestamp of the packet 3's ACK is earlier 6248e2ea53aSyupengthan the retransmission timestamp. 6258e2ea53aSyupeng 6268e2ea53aSyupeng* TcpExtTCPSACKReorder 627ae5220c6SRandy Dunlap 6288e2ea53aSyupengThe reorder packet detected by SACK. The SACK has two methods to 6298e2ea53aSyupengdetect reorder: (1) DSACK is received by the sender. It means the 6308e2ea53aSyupengsender sends the same packet more than one times. And the only reason 6318e2ea53aSyupengis the sender believes an out of order packet is lost so it sends the 6328e2ea53aSyupengpacket again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and 6338e2ea53aSyupengthe sender has received SACKs for packet 2 and 5, now the sender 6348e2ea53aSyupengreceives SACK for packet 4 and the sender doesn't retransmit the 6358e2ea53aSyupengpacket yet, the sender would know packet 4 is out of order. The TCP 6368e2ea53aSyupengstack of kernel will increase TcpExtTCPSACKReorder for both of the 6378e2ea53aSyupengabove scenarios. 6388e2ea53aSyupeng 639132c4e9eSyupeng* TcpExtTCPSlowStartRetrans 640132c4e9eSyupeng 641132c4e9eSyupengThe TCP stack wants to retransmit a packet and the congestion control 642132c4e9eSyupengstate is 'Loss'. 643132c4e9eSyupeng 644132c4e9eSyupeng* TcpExtTCPFastRetrans 645132c4e9eSyupeng 646132c4e9eSyupengThe TCP stack wants to retransmit a packet and the congestion control 647132c4e9eSyupengstate is not 'Loss'. 648132c4e9eSyupeng 649132c4e9eSyupeng* TcpExtTCPLostRetransmit 650132c4e9eSyupeng 651132c4e9eSyupengA SACK points out that a retransmission packet is lost again. 652132c4e9eSyupeng 653132c4e9eSyupeng* TcpExtTCPRetransFail 654132c4e9eSyupeng 655132c4e9eSyupengThe TCP stack tries to deliver a retransmission packet to lower layers 656132c4e9eSyupengbut the lower layers return an error. 657132c4e9eSyupeng 658132c4e9eSyupeng* TcpExtTCPSynRetrans 659132c4e9eSyupeng 660132c4e9eSyupengThe TCP stack retransmits a SYN packet. 661132c4e9eSyupeng 6628e2ea53aSyupengDSACK 6638e2ea53aSyupeng===== 6648e2ea53aSyupengThe DSACK is defined in `RFC2883`_. The receiver uses DSACK to report 6658e2ea53aSyupengduplicate packets to the sender. There are two kinds of 6668e2ea53aSyupengduplications: (1) a packet which has been acknowledged is 6678e2ea53aSyupengduplicate. (2) an out of order packet is duplicate. The TCP stack 6688e2ea53aSyupengcounts these two kinds of duplications on both receiver side and 6698e2ea53aSyupengsender side. 6708e2ea53aSyupeng 6718e2ea53aSyupeng.. _RFC2883 : https://tools.ietf.org/html/rfc2883 6728e2ea53aSyupeng 6738e2ea53aSyupeng* TcpExtTCPDSACKOldSent 674ae5220c6SRandy Dunlap 6758e2ea53aSyupengThe TCP stack receives a duplicate packet which has been acked, so it 6768e2ea53aSyupengsends a DSACK to the sender. 6778e2ea53aSyupeng 6788e2ea53aSyupeng* TcpExtTCPDSACKOfoSent 679ae5220c6SRandy Dunlap 6808e2ea53aSyupengThe TCP stack receives an out of order duplicate packet, so it sends a 6818e2ea53aSyupengDSACK to the sender. 6828e2ea53aSyupeng 6838e2ea53aSyupeng* TcpExtTCPDSACKRecv 68465e9a6d2SRandy Dunlap 685a6c7c7aaSyupengThe TCP stack receives a DSACK, which indicates an acknowledged 6868e2ea53aSyupengduplicate packet is received. 6878e2ea53aSyupeng 6888e2ea53aSyupeng* TcpExtTCPDSACKOfoRecv 689ae5220c6SRandy Dunlap 6908e2ea53aSyupengThe TCP stack receives a DSACK, which indicate an out of order 6912b965472Syupengduplicate packet is received. 6922b965472Syupeng 693a6c7c7aaSyupenginvalid SACK and DSACK 69465e9a6d2SRandy Dunlap====================== 695a6c7c7aaSyupengWhen a SACK (or DSACK) block is invalid, a corresponding counter would 696a6c7c7aaSyupengbe updated. The validation method is base on the start/end sequence 697a6c7c7aaSyupengnumber of the SACK block. For more details, please refer the comment 698a6c7c7aaSyupengof the function tcp_is_sackblock_valid in the kernel source code. A 699a6c7c7aaSyupengSACK option could have up to 4 blocks, they are checked 700a6c7c7aaSyupengindividually. E.g., if 3 blocks of a SACk is invalid, the 701a6c7c7aaSyupengcorresponding counter would be updated 3 times. The comment of the 702a6c7c7aaSyupeng`Add counters for discarded SACK blocks`_ patch has additional 703ede71caeSMasanari Iidaexplanation: 704a6c7c7aaSyupeng 705a6c7c7aaSyupeng.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32 706a6c7c7aaSyupeng 707a6c7c7aaSyupeng* TcpExtTCPSACKDiscard 70865e9a6d2SRandy Dunlap 709a6c7c7aaSyupengThis counter indicates how many SACK blocks are invalid. If the invalid 710a6c7c7aaSyupengSACK block is caused by ACK recording, the TCP stack will only ignore 711a6c7c7aaSyupengit and won't update this counter. 712a6c7c7aaSyupeng 713a6c7c7aaSyupeng* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo 71465e9a6d2SRandy Dunlap 715a6c7c7aaSyupengWhen a DSACK block is invalid, one of these two counters would be 716a6c7c7aaSyupengupdated. Which counter will be updated depends on the undo_marker flag 717a6c7c7aaSyupengof the TCP socket. If the undo_marker is not set, the TCP stack isn't 718a6c7c7aaSyupenglikely to re-transmit any packets, and we still receive an invalid 719a6c7c7aaSyupengDSACK block, the reason might be that the packet is duplicated in the 720a6c7c7aaSyupengmiddle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo 721a6c7c7aaSyupengwill be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld 722a6c7c7aaSyupengwill be updated. As implied in its name, it might be an old packet. 723a6c7c7aaSyupeng 724a6c7c7aaSyupengSACK shift 72565e9a6d2SRandy Dunlap========== 726a6c7c7aaSyupengThe linux networking stack stores data in sk_buff struct (skb for 727a6c7c7aaSyupengshort). If a SACK block acrosses multiple skb, the TCP stack will try 728a6c7c7aaSyupengto re-arrange data in these skb. E.g. if a SACK block acknowledges seq 729a6c7c7aaSyupeng10 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and 730a6c7c7aaSyupeng15 in skb2 would be moved to skb1. This operation is 'shift'. If a 731a6c7c7aaSyupengSACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has 732a6c7c7aaSyupengseq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be 733a6c7c7aaSyupengdiscard, this operation is 'merge'. 734a6c7c7aaSyupeng 735a6c7c7aaSyupeng* TcpExtTCPSackShifted 73665e9a6d2SRandy Dunlap 737a6c7c7aaSyupengA skb is shifted 738a6c7c7aaSyupeng 739a6c7c7aaSyupeng* TcpExtTCPSackMerged 74065e9a6d2SRandy Dunlap 741a6c7c7aaSyupengA skb is merged 742a6c7c7aaSyupeng 743a6c7c7aaSyupeng* TcpExtTCPSackShiftFallback 74465e9a6d2SRandy Dunlap 745a6c7c7aaSyupengA skb should be shifted or merged, but the TCP stack doesn't do it for 746a6c7c7aaSyupengsome reasons. 747a6c7c7aaSyupeng 7482b965472SyupengTCP out of order 749ae5220c6SRandy Dunlap================ 7502b965472Syupeng* TcpExtTCPOFOQueue 751ae5220c6SRandy Dunlap 7522b965472SyupengThe TCP layer receives an out of order packet and has enough memory 7532b965472Syupengto queue it. 7542b965472Syupeng 7552b965472Syupeng* TcpExtTCPOFODrop 756ae5220c6SRandy Dunlap 7572b965472SyupengThe TCP layer receives an out of order packet but doesn't have enough 7582b965472Syupengmemory, so drops it. Such packets won't be counted into 7592b965472SyupengTcpExtTCPOFOQueue. 7602b965472Syupeng 7612b965472Syupeng* TcpExtTCPOFOMerge 762ae5220c6SRandy Dunlap 7632b965472SyupengThe received out of order packet has an overlay with the previous 7642b965472Syupengpacket. the overlay part will be dropped. All of TcpExtTCPOFOMerge 7652b965472Syupengpackets will also be counted into TcpExtTCPOFOQueue. 7662b965472Syupeng 7672b965472SyupengTCP PAWS 768ae5220c6SRandy Dunlap======== 7692b965472SyupengPAWS (Protection Against Wrapped Sequence numbers) is an algorithm 7702b965472Syupengwhich is used to drop old packets. It depends on the TCP 7712b965472Syupengtimestamps. For detail information, please refer the `timestamp wiki`_ 7722b965472Syupengand the `RFC of PAWS`_. 7732b965472Syupeng 7742b965472Syupeng.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17 7752b965472Syupeng.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps 7762b965472Syupeng 7772b965472Syupeng* TcpExtPAWSActive 778ae5220c6SRandy Dunlap 7792b965472SyupengPackets are dropped by PAWS in Syn-Sent status. 7802b965472Syupeng 7812b965472Syupeng* TcpExtPAWSEstab 782ae5220c6SRandy Dunlap 7832b965472SyupengPackets are dropped by PAWS in any status other than Syn-Sent. 7842b965472Syupeng 7852b965472SyupengTCP ACK skip 786ae5220c6SRandy Dunlap============ 7872b965472SyupengIn some scenarios, kernel would avoid sending duplicate ACKs too 7882b965472Syupengfrequently. Please find more details in the tcp_invalid_ratelimit 7892b965472Syupengsection of the `sysctl document`_. When kernel decides to skip an ACK 7902b965472Syupengdue to tcp_invalid_ratelimit, kernel would update one of below 7912b965472Syupengcounters to indicate the ACK is skipped in which scenario. The ACK 7922b965472Syupengwould only be skipped if the received packet is either a SYN packet or 7932b965472Syupengit has no data. 7942b965472Syupeng 7951cec2cacSMauro Carvalho Chehab.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst 7962b965472Syupeng 7972b965472Syupeng* TcpExtTCPACKSkippedSynRecv 798ae5220c6SRandy Dunlap 7992b965472SyupengThe ACK is skipped in Syn-Recv status. The Syn-Recv status means the 8002b965472SyupengTCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is 8012b965472Syupengwaiting for an ACK. Generally, the TCP stack doesn't need to send ACK 8022b965472Syupengin the Syn-Recv status. But in several scenarios, the TCP stack need 8032b965472Syupengto send an ACK. E.g., the TCP stack receives the same SYN packet 8042b965472Syupengrepeately, the received packet does not pass the PAWS check, or the 8052b965472Syupengreceived packet sequence number is out of window. In these scenarios, 8062b965472Syupengthe TCP stack needs to send ACK. If the ACk sending frequency is higher than 8072b965472Syupengtcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and 8082b965472Syupengincrease TcpExtTCPACKSkippedSynRecv. 8092b965472Syupeng 8102b965472Syupeng 8112b965472Syupeng* TcpExtTCPACKSkippedPAWS 812ae5220c6SRandy Dunlap 8132b965472SyupengThe ACK is skipped due to PAWS (Protect Against Wrapped Sequence 8142b965472Syupengnumbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2 8152b965472Syupengor Time-Wait statuses, the skipped ACK would be counted to 8162b965472SyupengTcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or 8172b965472SyupengTcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK 8182b965472Syupengwould be counted to TcpExtTCPACKSkippedPAWS. 8192b965472Syupeng 8202b965472Syupeng* TcpExtTCPACKSkippedSeq 821ae5220c6SRandy Dunlap 8222b965472SyupengThe sequence number is out of window and the timestamp passes the PAWS 8232b965472Syupengcheck and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait. 8242b965472Syupeng 8252b965472Syupeng* TcpExtTCPACKSkippedFinWait2 826ae5220c6SRandy Dunlap 8272b965472SyupengThe ACK is skipped in Fin-Wait-2 status, the reason would be either 8282b965472SyupengPAWS check fails or the received sequence number is out of window. 8292b965472Syupeng 8302b965472Syupeng* TcpExtTCPACKSkippedTimeWait 831ae5220c6SRandy Dunlap 832ede71caeSMasanari IidaThe ACK is skipped in Time-Wait status, the reason would be either 8332b965472SyupengPAWS check failed or the received sequence number is out of window. 8342b965472Syupeng 8352b965472Syupeng* TcpExtTCPACKSkippedChallenge 836ae5220c6SRandy Dunlap 8372b965472SyupengThe ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines 8382b965472Syupeng3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_, 8392b965472Syupeng`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these 8402b965472Syupengthree scenarios, In some TCP status, the linux TCP stack would also 8412b965472Syupengsend challenge ACKs if the ACK number is before the first 8422b965472Syupengunacknowledged number (more strict than `RFC 5961 section 5.2`_). 8432b965472Syupeng 8442b965472Syupeng.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7 8452b965472Syupeng.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9 8462b965472Syupeng.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11 8472b965472Syupeng 848a6c7c7aaSyupengTCP receive window 849132c4e9eSyupeng================== 850a6c7c7aaSyupeng* TcpExtTCPWantZeroWindowAdv 851132c4e9eSyupeng 852a6c7c7aaSyupengDepending on current memory usage, the TCP stack tries to set receive 853a6c7c7aaSyupengwindow to zero. But the receive window might still be a no-zero 854a6c7c7aaSyupengvalue. For example, if the previous window size is 10, and the TCP 855a6c7c7aaSyupengstack receives 3 bytes, the current window size would be 7 even if the 856a6c7c7aaSyupengwindow size calculated by the memory usage is zero. 857a6c7c7aaSyupeng 858a6c7c7aaSyupeng* TcpExtTCPToZeroWindowAdv 859132c4e9eSyupeng 860a6c7c7aaSyupengThe TCP receive window is set to zero from a no-zero value. 861a6c7c7aaSyupeng 862a6c7c7aaSyupeng* TcpExtTCPFromZeroWindowAdv 863132c4e9eSyupeng 864a6c7c7aaSyupengThe TCP receive window is set to no-zero value from zero. 865a6c7c7aaSyupeng 866a6c7c7aaSyupeng 867a6c7c7aaSyupengDelayed ACK 868132c4e9eSyupeng=========== 869a6c7c7aaSyupengThe TCP Delayed ACK is a technique which is used for reducing the 870a6c7c7aaSyupengpacket count in the network. For more details, please refer the 871a6c7c7aaSyupeng`Delayed ACK wiki`_ 872a6c7c7aaSyupeng 873a6c7c7aaSyupeng.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment 874a6c7c7aaSyupeng 875a6c7c7aaSyupeng* TcpExtDelayedACKs 876132c4e9eSyupeng 877a6c7c7aaSyupengA delayed ACK timer expires. The TCP stack will send a pure ACK packet 878a6c7c7aaSyupengand exit the delayed ACK mode. 879a6c7c7aaSyupeng 880a6c7c7aaSyupeng* TcpExtDelayedACKLocked 881132c4e9eSyupeng 882a6c7c7aaSyupengA delayed ACK timer expires, but the TCP stack can't send an ACK 883a6c7c7aaSyupengimmediately due to the socket is locked by a userspace program. The 884a6c7c7aaSyupengTCP stack will send a pure ACK later (after the userspace program 885a6c7c7aaSyupengunlock the socket). When the TCP stack sends the pure ACK later, the 886a6c7c7aaSyupengTCP stack will also update TcpExtDelayedACKs and exit the delayed ACK 887a6c7c7aaSyupengmode. 888a6c7c7aaSyupeng 889a6c7c7aaSyupeng* TcpExtDelayedACKLost 890132c4e9eSyupeng 891a6c7c7aaSyupengIt will be updated when the TCP stack receives a packet which has been 892a6c7c7aaSyupengACKed. A Delayed ACK loss might cause this issue, but it would also be 893a6c7c7aaSyupengtriggered by other reasons, such as a packet is duplicated in the 894a6c7c7aaSyupengnetwork. 895a6c7c7aaSyupeng 896a6c7c7aaSyupengTail Loss Probe (TLP) 897132c4e9eSyupeng===================== 898a6c7c7aaSyupengTLP is an algorithm which is used to detect TCP packet loss. For more 899a6c7c7aaSyupengdetails, please refer the `TLP paper`_. 900a6c7c7aaSyupeng 901a6c7c7aaSyupeng.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01 902a6c7c7aaSyupeng 903a6c7c7aaSyupeng* TcpExtTCPLossProbes 904132c4e9eSyupeng 905a6c7c7aaSyupengA TLP probe packet is sent. 906a6c7c7aaSyupeng 907a6c7c7aaSyupeng* TcpExtTCPLossProbeRecovery 908132c4e9eSyupeng 909a6c7c7aaSyupengA packet loss is detected and recovered by TLP. 9108e2ea53aSyupeng 911c44166feSMauro Carvalho ChehabTCP Fast Open description 912c44166feSMauro Carvalho Chehab========================= 913132c4e9eSyupengTCP Fast Open is a technology which allows data transfer before the 914132c4e9eSyupeng3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a 915132c4e9eSyupenggeneral description. 916132c4e9eSyupeng 917132c4e9eSyupeng.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open 918132c4e9eSyupeng 919132c4e9eSyupeng* TcpExtTCPFastOpenActive 920132c4e9eSyupeng 921132c4e9eSyupengWhen the TCP stack receives an ACK packet in the SYN-SENT status, and 922132c4e9eSyupengthe ACK packet acknowledges the data in the SYN packet, the TCP stack 923132c4e9eSyupengunderstand the TFO cookie is accepted by the other side, then it 924132c4e9eSyupengupdates this counter. 925132c4e9eSyupeng 926132c4e9eSyupeng* TcpExtTCPFastOpenActiveFail 927132c4e9eSyupeng 928132c4e9eSyupengThis counter indicates that the TCP stack initiated a TCP Fast Open, 929132c4e9eSyupengbut it failed. This counter would be updated in three scenarios: (1) 930132c4e9eSyupengthe other side doesn't acknowledge the data in the SYN packet. (2) The 931132c4e9eSyupengSYN packet which has the TFO cookie is timeout at least once. (3) 932132c4e9eSyupengafter the 3-way handshake, the retransmission timeout happens 933132c4e9eSyupengnet.ipv4.tcp_retries1 times, because some middle-boxes may black-hole 934132c4e9eSyupengfast open after the handshake. 935132c4e9eSyupeng 936132c4e9eSyupeng* TcpExtTCPFastOpenPassive 937132c4e9eSyupeng 938132c4e9eSyupengThis counter indicates how many times the TCP stack accepts the fast 939132c4e9eSyupengopen request. 940132c4e9eSyupeng 941132c4e9eSyupeng* TcpExtTCPFastOpenPassiveFail 942132c4e9eSyupeng 943132c4e9eSyupengThis counter indicates how many times the TCP stack rejects the fast 944132c4e9eSyupengopen request. It is caused by either the TFO cookie is invalid or the 945132c4e9eSyupengTCP stack finds an error during the socket creating process. 946132c4e9eSyupeng 947132c4e9eSyupeng* TcpExtTCPFastOpenListenOverflow 948132c4e9eSyupeng 949132c4e9eSyupengWhen the pending fast open request number is larger than 950132c4e9eSyupengfastopenq->max_qlen, the TCP stack will reject the fast open request 951132c4e9eSyupengand update this counter. When this counter is updated, the TCP stack 952132c4e9eSyupengwon't update TcpExtTCPFastOpenPassive or 953132c4e9eSyupengTcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the 954132c4e9eSyupengTCP_FASTOPEN socket operation and it could not be larger than 955132c4e9eSyupengnet.core.somaxconn. For example: 956132c4e9eSyupeng 957132c4e9eSyupengsetsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen)); 958132c4e9eSyupeng 959132c4e9eSyupeng* TcpExtTCPFastOpenCookieReqd 960132c4e9eSyupeng 961132c4e9eSyupengThis counter indicates how many times a client wants to request a TFO 962132c4e9eSyupengcookie. 963132c4e9eSyupeng 964132c4e9eSyupengSYN cookies 965132c4e9eSyupeng=========== 966132c4e9eSyupengSYN cookies are used to mitigate SYN flood, for details, please refer 967132c4e9eSyupengthe `SYN cookies wiki`_. 968132c4e9eSyupeng 969132c4e9eSyupeng.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies 970132c4e9eSyupeng 971132c4e9eSyupeng* TcpExtSyncookiesSent 972132c4e9eSyupeng 973132c4e9eSyupengIt indicates how many SYN cookies are sent. 974132c4e9eSyupeng 975132c4e9eSyupeng* TcpExtSyncookiesRecv 976132c4e9eSyupeng 977132c4e9eSyupengHow many reply packets of the SYN cookies the TCP stack receives. 978132c4e9eSyupeng 979132c4e9eSyupeng* TcpExtSyncookiesFailed 980132c4e9eSyupeng 981132c4e9eSyupengThe MSS decoded from the SYN cookie is invalid. When this counter is 982132c4e9eSyupengupdated, the received packet won't be treated as a SYN cookie and the 983*a266ef69SRandy DunlapTcpExtSyncookiesRecv counter won't be updated. 984132c4e9eSyupeng 985132c4e9eSyupengChallenge ACK 986132c4e9eSyupeng============= 987ede71caeSMasanari IidaFor details of challenge ACK, please refer the explanation of 988132c4e9eSyupengTcpExtTCPACKSkippedChallenge. 989132c4e9eSyupeng 990132c4e9eSyupeng* TcpExtTCPChallengeACK 991132c4e9eSyupeng 992132c4e9eSyupengThe number of challenge acks sent. 993132c4e9eSyupeng 994132c4e9eSyupeng* TcpExtTCPSYNChallenge 995132c4e9eSyupeng 996132c4e9eSyupengThe number of challenge acks sent in response to SYN packets. After 997132c4e9eSyupengupdates this counter, the TCP stack might send a challenge ACK and 998132c4e9eSyupengupdate the TcpExtTCPChallengeACK counter, or it might also skip to 999132c4e9eSyupengsend the challenge and update the TcpExtTCPACKSkippedChallenge. 1000132c4e9eSyupeng 1001132c4e9eSyupengprune 1002132c4e9eSyupeng===== 1003132c4e9eSyupengWhen a socket is under memory pressure, the TCP stack will try to 1004132c4e9eSyupengreclaim memory from the receiving queue and out of order queue. One of 1005ede71caeSMasanari Iidathe reclaiming method is 'collapse', which means allocate a big skb, 1006132c4e9eSyupengcopy the contiguous skbs to the single big skb, and free these 1007132c4e9eSyupengcontiguous skbs. 1008132c4e9eSyupeng 1009132c4e9eSyupeng* TcpExtPruneCalled 1010132c4e9eSyupeng 1011132c4e9eSyupengThe TCP stack tries to reclaim memory for a socket. After updates this 1012132c4e9eSyupengcounter, the TCP stack will try to collapse the out of order queue and 1013132c4e9eSyupengthe receiving queue. If the memory is still not enough, the TCP stack 1014132c4e9eSyupengwill try to discard packets from the out of order queue (and update the 1015132c4e9eSyupengTcpExtOfoPruned counter) 1016132c4e9eSyupeng 1017132c4e9eSyupeng* TcpExtOfoPruned 1018132c4e9eSyupeng 1019132c4e9eSyupengThe TCP stack tries to discard packet on the out of order queue. 1020132c4e9eSyupeng 1021132c4e9eSyupeng* TcpExtRcvPruned 1022132c4e9eSyupeng 1023132c4e9eSyupengAfter 'collapse' and discard packets from the out of order queue, if 1024132c4e9eSyupengthe actually used memory is still larger than the max allowed memory, 1025132c4e9eSyupengthis counter will be updated. It means the 'prune' fails. 1026132c4e9eSyupeng 1027132c4e9eSyupeng* TcpExtTCPRcvCollapsed 1028132c4e9eSyupeng 1029132c4e9eSyupengThis counter indicates how many skbs are freed during 'collapse'. 1030132c4e9eSyupeng 1031b08794a9Syupengexamples 1032ae5220c6SRandy Dunlap======== 1033b08794a9Syupeng 1034b08794a9Syupengping test 1035ae5220c6SRandy Dunlap--------- 1036b08794a9SyupengRun the ping command against the public dns server 8.8.8.8:: 1037b08794a9Syupeng 1038b08794a9Syupeng nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1 1039b08794a9Syupeng PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 1040b08794a9Syupeng 64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms 1041b08794a9Syupeng 1042b08794a9Syupeng --- 8.8.8.8 ping statistics --- 1043b08794a9Syupeng 1 packets transmitted, 1 received, 0% packet loss, time 0ms 1044b08794a9Syupeng rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms 1045b08794a9Syupeng 1046b08794a9SyupengThe nstayt result:: 1047b08794a9Syupeng 1048b08794a9Syupeng nstatuser@nstat-a:~$ nstat 1049b08794a9Syupeng #kernel 1050b08794a9Syupeng IpInReceives 1 0.0 1051b08794a9Syupeng IpInDelivers 1 0.0 1052b08794a9Syupeng IpOutRequests 1 0.0 1053b08794a9Syupeng IcmpInMsgs 1 0.0 1054b08794a9Syupeng IcmpInEchoReps 1 0.0 1055b08794a9Syupeng IcmpOutMsgs 1 0.0 1056b08794a9Syupeng IcmpOutEchos 1 0.0 1057b08794a9Syupeng IcmpMsgInType0 1 0.0 1058b08794a9Syupeng IcmpMsgOutType8 1 0.0 1059b08794a9Syupeng IpExtInOctets 84 0.0 1060b08794a9Syupeng IpExtOutOctets 84 0.0 1061b08794a9Syupeng IpExtInNoECTPkts 1 0.0 1062b08794a9Syupeng 1063b08794a9SyupengThe Linux server sent an ICMP Echo packet, so IpOutRequests, 1064b08794a9SyupengIcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The 1065b08794a9Syupengserver got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs, 1066b08794a9SyupengIcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply 1067b08794a9Syupengwas passed to the ICMP layer via IP layer, so IpInDelivers was 1068b08794a9Syupengincreased 1. The default ping data size is 48, so an ICMP Echo packet 1069b08794a9Syupengand its corresponding Echo Reply packet are constructed by: 1070b08794a9Syupeng 1071b08794a9Syupeng* 14 bytes MAC header 1072b08794a9Syupeng* 20 bytes IP header 1073b08794a9Syupeng* 16 bytes ICMP header 1074b08794a9Syupeng* 48 bytes data (default value of the ping command) 1075b08794a9Syupeng 1076b08794a9SyupengSo the IpExtInOctets and IpExtOutOctets are 20+16+48=84. 107780cc4950Syupeng 107880cc4950Syupengtcp 3-way handshake 1079ae5220c6SRandy Dunlap------------------- 108080cc4950SyupengOn server side, we run:: 108180cc4950Syupeng 108280cc4950Syupeng nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000 108380cc4950Syupeng Listening on [0.0.0.0] (family 0, port 9000) 108480cc4950Syupeng 108580cc4950SyupengOn client side, we run:: 108680cc4950Syupeng 108780cc4950Syupeng nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000 108880cc4950Syupeng Connection to 192.168.122.251 9000 port [tcp/*] succeeded! 108980cc4950Syupeng 109080cc4950SyupengThe server listened on tcp 9000 port, the client connected to it, they 109180cc4950Syupengcompleted the 3-way handshake. 109280cc4950Syupeng 109380cc4950SyupengOn server side, we can find below nstat output:: 109480cc4950Syupeng 109580cc4950Syupeng nstatuser@nstat-b:~$ nstat | grep -i tcp 109680cc4950Syupeng TcpPassiveOpens 1 0.0 109780cc4950Syupeng TcpInSegs 2 0.0 109880cc4950Syupeng TcpOutSegs 1 0.0 109980cc4950Syupeng TcpExtTCPPureAcks 1 0.0 110080cc4950Syupeng 110180cc4950SyupengOn client side, we can find below nstat output:: 110280cc4950Syupeng 110380cc4950Syupeng nstatuser@nstat-a:~$ nstat | grep -i tcp 110480cc4950Syupeng TcpActiveOpens 1 0.0 110580cc4950Syupeng TcpInSegs 1 0.0 110680cc4950Syupeng TcpOutSegs 2 0.0 110780cc4950Syupeng 110880cc4950SyupengWhen the server received the first SYN, it replied a SYN+ACK, and came into 110980cc4950SyupengSYN-RCVD state, so TcpPassiveOpens increased 1. The server received 111080cc4950SyupengSYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2 111180cc4950Syupengpackets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK 111280cc4950Syupengof the 3-way handshake is a pure ACK without data, so 111380cc4950SyupengTcpExtTCPPureAcks increased 1. 111480cc4950Syupeng 111580cc4950SyupengWhen the client sent SYN, the client came into the SYN-SENT state, so 111680cc4950SyupengTcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent 111780cc4950SyupengACK, so client sent 2 packets, received 1 packet, TcpInSegs increased 111880cc4950Syupeng1, TcpOutSegs increased 2. 111980cc4950Syupeng 112080cc4950SyupengTCP normal traffic 1121ae5220c6SRandy Dunlap------------------ 112280cc4950SyupengRun nc on server:: 112380cc4950Syupeng 112480cc4950Syupeng nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 112580cc4950Syupeng Listening on [0.0.0.0] (family 0, port 9000) 112680cc4950Syupeng 112780cc4950SyupengRun nc on client:: 112880cc4950Syupeng 112980cc4950Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 113080cc4950Syupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 113180cc4950Syupeng 113280cc4950SyupengInput a string in the nc client ('hello' in our example):: 113380cc4950Syupeng 113480cc4950Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 113580cc4950Syupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 113680cc4950Syupeng hello 113780cc4950Syupeng 113880cc4950SyupengThe client side nstat output:: 113980cc4950Syupeng 114080cc4950Syupeng nstatuser@nstat-a:~$ nstat 114180cc4950Syupeng #kernel 114280cc4950Syupeng IpInReceives 1 0.0 114380cc4950Syupeng IpInDelivers 1 0.0 114480cc4950Syupeng IpOutRequests 1 0.0 114580cc4950Syupeng TcpInSegs 1 0.0 114680cc4950Syupeng TcpOutSegs 1 0.0 114780cc4950Syupeng TcpExtTCPPureAcks 1 0.0 114880cc4950Syupeng TcpExtTCPOrigDataSent 1 0.0 114980cc4950Syupeng IpExtInOctets 52 0.0 115080cc4950Syupeng IpExtOutOctets 58 0.0 115180cc4950Syupeng IpExtInNoECTPkts 1 0.0 115280cc4950Syupeng 115380cc4950SyupengThe server side nstat output:: 115480cc4950Syupeng 115580cc4950Syupeng nstatuser@nstat-b:~$ nstat 115680cc4950Syupeng #kernel 115780cc4950Syupeng IpInReceives 1 0.0 115880cc4950Syupeng IpInDelivers 1 0.0 115980cc4950Syupeng IpOutRequests 1 0.0 116080cc4950Syupeng TcpInSegs 1 0.0 116180cc4950Syupeng TcpOutSegs 1 0.0 116280cc4950Syupeng IpExtInOctets 58 0.0 116380cc4950Syupeng IpExtOutOctets 52 0.0 116480cc4950Syupeng IpExtInNoECTPkts 1 0.0 116580cc4950Syupeng 1166ede71caeSMasanari IidaInput a string in nc client side again ('world' in our example):: 116780cc4950Syupeng 116880cc4950Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 116980cc4950Syupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 117080cc4950Syupeng hello 117180cc4950Syupeng world 117280cc4950Syupeng 117380cc4950SyupengClient side nstat output:: 117480cc4950Syupeng 117580cc4950Syupeng nstatuser@nstat-a:~$ nstat 117680cc4950Syupeng #kernel 117780cc4950Syupeng IpInReceives 1 0.0 117880cc4950Syupeng IpInDelivers 1 0.0 117980cc4950Syupeng IpOutRequests 1 0.0 118080cc4950Syupeng TcpInSegs 1 0.0 118180cc4950Syupeng TcpOutSegs 1 0.0 118280cc4950Syupeng TcpExtTCPHPAcks 1 0.0 118380cc4950Syupeng TcpExtTCPOrigDataSent 1 0.0 118480cc4950Syupeng IpExtInOctets 52 0.0 118580cc4950Syupeng IpExtOutOctets 58 0.0 118680cc4950Syupeng IpExtInNoECTPkts 1 0.0 118780cc4950Syupeng 118880cc4950Syupeng 118980cc4950SyupengServer side nstat output:: 119080cc4950Syupeng 119180cc4950Syupeng nstatuser@nstat-b:~$ nstat 119280cc4950Syupeng #kernel 119380cc4950Syupeng IpInReceives 1 0.0 119480cc4950Syupeng IpInDelivers 1 0.0 119580cc4950Syupeng IpOutRequests 1 0.0 119680cc4950Syupeng TcpInSegs 1 0.0 119780cc4950Syupeng TcpOutSegs 1 0.0 119880cc4950Syupeng TcpExtTCPHPHits 1 0.0 119980cc4950Syupeng IpExtInOctets 58 0.0 120080cc4950Syupeng IpExtOutOctets 52 0.0 120180cc4950Syupeng IpExtInNoECTPkts 1 0.0 120280cc4950Syupeng 120380cc4950SyupengCompare the first client-side nstat and the second client-side nstat, 120480cc4950Syupengwe could find one difference: the first one had a 'TcpExtTCPPureAcks', 120580cc4950Syupengbut the second one had a 'TcpExtTCPHPAcks'. The first server-side 120680cc4950Syupengnstat and the second server-side nstat had a difference too: the 120780cc4950Syupengsecond server-side nstat had a TcpExtTCPHPHits, but the first 120880cc4950Syupengserver-side nstat didn't have it. The network traffic patterns were 120980cc4950Syupengexactly the same: the client sent a packet to the server, the server 121080cc4950Syupengreplied an ACK. But kernel handled them in different ways. When the 121180cc4950SyupengTCP window scale option is not used, kernel will try to enable fast 121280cc4950Syupengpath immediately when the connection comes into the established state, 121380cc4950Syupengbut if the TCP window scale option is used, kernel will disable the 1214ede71caeSMasanari Iidafast path at first, and try to enable it after kernel receives 121580cc4950Syupengpackets. We could use the 'ss' command to verify whether the window 121680cc4950Syupengscale option is used. e.g. run below command on either server or 121780cc4950Syupengclient:: 121880cc4950Syupeng 121980cc4950Syupeng nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 ) 122080cc4950Syupeng Netid Recv-Q Send-Q Local Address:Port Peer Address:Port 122180cc4950Syupeng tcp 0 0 192.168.122.250:40654 192.168.122.251:9000 122280cc4950Syupeng ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98 122380cc4950Syupeng 122480cc4950SyupengThe 'wscale:7,7' means both server and client set the window scale 122580cc4950Syupengoption to 7. Now we could explain the nstat output in our test: 122680cc4950Syupeng 122780cc4950SyupengIn the first nstat output of client side, the client sent a packet, server 122880cc4950Syupengreply an ACK, when kernel handled this ACK, the fast path was not 122980cc4950Syupengenabled, so the ACK was counted into 'TcpExtTCPPureAcks'. 123080cc4950Syupeng 123180cc4950SyupengIn the second nstat output of client side, the client sent a packet again, 123280cc4950Syupengand received another ACK from the server, in this time, the fast path is 123380cc4950Syupengenabled, and the ACK was qualified for fast path, so it was handled by 123480cc4950Syupengthe fast path, so this ACK was counted into TcpExtTCPHPAcks. 123580cc4950Syupeng 123680cc4950SyupengIn the first nstat output of server side, fast path was not enabled, 123780cc4950Syupengso there was no 'TcpExtTCPHPHits'. 123880cc4950Syupeng 123980cc4950SyupengIn the second nstat output of server side, the fast path was enabled, 124080cc4950Syupengand the packet received from client qualified for fast path, so it 124180cc4950Syupengwas counted into 'TcpExtTCPHPHits'. 124280cc4950Syupeng 124380cc4950SyupengTcpExtTCPAbortOnClose 1244ae5220c6SRandy Dunlap--------------------- 124580cc4950SyupengOn the server side, we run below python script:: 124680cc4950Syupeng 124780cc4950Syupeng import socket 124880cc4950Syupeng import time 124980cc4950Syupeng 125080cc4950Syupeng port = 9000 125180cc4950Syupeng 125280cc4950Syupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 125380cc4950Syupeng s.bind(('0.0.0.0', port)) 125480cc4950Syupeng s.listen(1) 125580cc4950Syupeng sock, addr = s.accept() 125680cc4950Syupeng while True: 125780cc4950Syupeng time.sleep(9999999) 125880cc4950Syupeng 125980cc4950SyupengThis python script listen on 9000 port, but doesn't read anything from 126080cc4950Syupengthe connection. 126180cc4950Syupeng 126280cc4950SyupengOn the client side, we send the string "hello" by nc:: 126380cc4950Syupeng 126480cc4950Syupeng nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000 126580cc4950Syupeng 126680cc4950SyupengThen, we come back to the server side, the server has received the "hello" 126780cc4950Syupengpacket, and the TCP layer has acked this packet, but the application didn't 126880cc4950Syupengread it yet. We type Ctrl-C to terminate the server script. Then we 126980cc4950Syupengcould find TcpExtTCPAbortOnClose increased 1 on the server side:: 127080cc4950Syupeng 127180cc4950Syupeng nstatuser@nstat-b:~$ nstat | grep -i abort 127280cc4950Syupeng TcpExtTCPAbortOnClose 1 0.0 127380cc4950Syupeng 127480cc4950SyupengIf we run tcpdump on the server side, we could find the server sent a 127580cc4950SyupengRST after we type Ctrl-C. 127680cc4950Syupeng 127780cc4950SyupengTcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout 1278ae5220c6SRandy Dunlap--------------------------------------------------- 127980cc4950SyupengBelow is an example which let the orphan socket count be higher than 128080cc4950Syupengnet.ipv4.tcp_max_orphans. 128180cc4950SyupengChange tcp_max_orphans to a smaller value on client:: 128280cc4950Syupeng 128380cc4950Syupeng sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans" 128480cc4950Syupeng 128580cc4950SyupengClient code (create 64 connection to server):: 128680cc4950Syupeng 128780cc4950Syupeng nstatuser@nstat-a:~$ cat client_orphan.py 128880cc4950Syupeng import socket 128980cc4950Syupeng import time 129080cc4950Syupeng 129180cc4950Syupeng server = 'nstat-b' # server address 129280cc4950Syupeng port = 9000 129380cc4950Syupeng 129480cc4950Syupeng count = 64 129580cc4950Syupeng 129680cc4950Syupeng connection_list = [] 129780cc4950Syupeng 129880cc4950Syupeng for i in range(64): 129980cc4950Syupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 130080cc4950Syupeng s.connect((server, port)) 130180cc4950Syupeng connection_list.append(s) 130280cc4950Syupeng print("connection_count: %d" % len(connection_list)) 130380cc4950Syupeng 130480cc4950Syupeng while True: 130580cc4950Syupeng time.sleep(99999) 130680cc4950Syupeng 130780cc4950SyupengServer code (accept 64 connection from client):: 130880cc4950Syupeng 130980cc4950Syupeng nstatuser@nstat-b:~$ cat server_orphan.py 131080cc4950Syupeng import socket 131180cc4950Syupeng import time 131280cc4950Syupeng 131380cc4950Syupeng port = 9000 131480cc4950Syupeng count = 64 131580cc4950Syupeng 131680cc4950Syupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 131780cc4950Syupeng s.bind(('0.0.0.0', port)) 131880cc4950Syupeng s.listen(count) 131980cc4950Syupeng connection_list = [] 132080cc4950Syupeng while True: 132180cc4950Syupeng sock, addr = s.accept() 132280cc4950Syupeng connection_list.append((sock, addr)) 132380cc4950Syupeng print("connection_count: %d" % len(connection_list)) 132480cc4950Syupeng 132580cc4950SyupengRun the python scripts on server and client. 132680cc4950Syupeng 132780cc4950SyupengOn server:: 132880cc4950Syupeng 132980cc4950Syupeng python3 server_orphan.py 133080cc4950Syupeng 133180cc4950SyupengOn client:: 133280cc4950Syupeng 133380cc4950Syupeng python3 client_orphan.py 133480cc4950Syupeng 133580cc4950SyupengRun iptables on server:: 133680cc4950Syupeng 133780cc4950Syupeng sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP 133880cc4950Syupeng 133980cc4950SyupengType Ctrl-C on client, stop client_orphan.py. 134080cc4950Syupeng 134180cc4950SyupengCheck TcpExtTCPAbortOnMemory on client:: 134280cc4950Syupeng 134380cc4950Syupeng nstatuser@nstat-a:~$ nstat | grep -i abort 134480cc4950Syupeng TcpExtTCPAbortOnMemory 54 0.0 134580cc4950Syupeng 1346ede71caeSMasanari IidaCheck orphaned socket count on client:: 134780cc4950Syupeng 134880cc4950Syupeng nstatuser@nstat-a:~$ ss -s 134980cc4950Syupeng Total: 131 (kernel 0) 135080cc4950Syupeng TCP: 14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0 135180cc4950Syupeng 135280cc4950Syupeng Transport Total IP IPv6 135380cc4950Syupeng * 0 - - 135480cc4950Syupeng RAW 1 0 1 135580cc4950Syupeng UDP 1 1 0 135680cc4950Syupeng TCP 14 13 1 135780cc4950Syupeng INET 16 14 2 135880cc4950Syupeng FRAG 0 0 0 135980cc4950Syupeng 136080cc4950SyupengThe explanation of the test: after run server_orphan.py and 136180cc4950Syupengclient_orphan.py, we set up 64 connections between server and 136280cc4950Syupengclient. Run the iptables command, the server will drop all packets from 136380cc4950Syupengthe client, type Ctrl-C on client_orphan.py, the system of the client 136480cc4950Syupengwould try to close these connections, and before they are closed 136580cc4950Syupenggracefully, these connections became orphan sockets. As the iptables 136680cc4950Syupengof the server blocked packets from the client, the server won't receive fin 136780cc4950Syupengfrom the client, so all connection on clients would be stuck on FIN_WAIT_1 136880cc4950Syupengstage, so they will keep as orphan sockets until timeout. We have echo 136980cc4950Syupeng10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would 137080cc4950Syupengonly keep 10 orphan sockets, for all other orphan sockets, the client 137180cc4950Syupengsystem sent RST for them and delete them. We have 64 connections, so 137280cc4950Syupengthe 'ss -s' command shows the system has 10 orphan sockets, and the 137380cc4950Syupengvalue of TcpExtTCPAbortOnMemory was 54. 137480cc4950Syupeng 137580cc4950SyupengAn additional explanation about orphan socket count: You could find the 137680cc4950Syupengexactly orphan socket count by the 'ss -s' command, but when kernel 137780cc4950Syupengdecide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel 137880cc4950Syupengdoesn't always check the exactly orphan socket count. For increasing 137980cc4950Syupengperformance, kernel checks an approximate count firstly, if the 138080cc4950Syupengapproximate count is more than tcp_max_orphans, kernel checks the 138180cc4950Syupengexact count again. So if the approximate count is less than 138280cc4950Syupengtcp_max_orphans, but exactly count is more than tcp_max_orphans, you 138380cc4950Syupengwould find TcpExtTCPAbortOnMemory is not increased at all. If 138480cc4950Syupengtcp_max_orphans is large enough, it won't occur, but if you decrease 138580cc4950Syupengtcp_max_orphans to a small value like our test, you might find this 138680cc4950Syupengissue. So in our test, the client set up 64 connections although the 138780cc4950Syupengtcp_max_orphans is 10. If the client only set up 11 connections, we 138880cc4950Syupengcan't find the change of TcpExtTCPAbortOnMemory. 138980cc4950Syupeng 139080cc4950SyupengContinue the previous test, we wait for several minutes. Because of the 139180cc4950Syupengiptables on the server blocked the traffic, the server wouldn't receive 139280cc4950Syupengfin, and all the client's orphan sockets would timeout on the 139380cc4950SyupengFIN_WAIT_1 state finally. So we wait for a few minutes, we could find 139480cc4950Syupeng10 timeout on the client:: 139580cc4950Syupeng 139680cc4950Syupeng nstatuser@nstat-a:~$ nstat | grep -i abort 139780cc4950Syupeng TcpExtTCPAbortOnTimeout 10 0.0 139880cc4950Syupeng 139980cc4950SyupengTcpExtTCPAbortOnLinger 1400ae5220c6SRandy Dunlap---------------------- 140180cc4950SyupengThe server side code:: 140280cc4950Syupeng 140380cc4950Syupeng nstatuser@nstat-b:~$ cat server_linger.py 140480cc4950Syupeng import socket 140580cc4950Syupeng import time 140680cc4950Syupeng 140780cc4950Syupeng port = 9000 140880cc4950Syupeng 140980cc4950Syupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 141080cc4950Syupeng s.bind(('0.0.0.0', port)) 141180cc4950Syupeng s.listen(1) 141280cc4950Syupeng sock, addr = s.accept() 141380cc4950Syupeng while True: 141480cc4950Syupeng time.sleep(9999999) 141580cc4950Syupeng 141680cc4950SyupengThe client side code:: 141780cc4950Syupeng 141880cc4950Syupeng nstatuser@nstat-a:~$ cat client_linger.py 141980cc4950Syupeng import socket 142080cc4950Syupeng import struct 142180cc4950Syupeng 142280cc4950Syupeng server = 'nstat-b' # server address 142380cc4950Syupeng port = 9000 142480cc4950Syupeng 142580cc4950Syupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 142680cc4950Syupeng s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10)) 142780cc4950Syupeng s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1)) 142880cc4950Syupeng s.connect((server, port)) 142980cc4950Syupeng s.close() 143080cc4950Syupeng 143180cc4950SyupengRun server_linger.py on server:: 143280cc4950Syupeng 143380cc4950Syupeng nstatuser@nstat-b:~$ python3 server_linger.py 143480cc4950Syupeng 143580cc4950SyupengRun client_linger.py on client:: 143680cc4950Syupeng 143780cc4950Syupeng nstatuser@nstat-a:~$ python3 client_linger.py 143880cc4950Syupeng 143980cc4950SyupengAfter run client_linger.py, check the output of nstat:: 144080cc4950Syupeng 144180cc4950Syupeng nstatuser@nstat-a:~$ nstat | grep -i abort 144280cc4950Syupeng TcpExtTCPAbortOnLinger 1 0.0 1443712ee16cSyupeng 1444712ee16cSyupengTcpExtTCPRcvCoalesce 1445ae5220c6SRandy Dunlap-------------------- 1446712ee16cSyupengOn the server, we run a program which listen on TCP port 9000, but 1447712ee16cSyupengdoesn't read any data:: 1448712ee16cSyupeng 1449712ee16cSyupeng import socket 1450712ee16cSyupeng import time 1451712ee16cSyupeng port = 9000 1452712ee16cSyupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1453712ee16cSyupeng s.bind(('0.0.0.0', port)) 1454712ee16cSyupeng s.listen(1) 1455712ee16cSyupeng sock, addr = s.accept() 1456712ee16cSyupeng while True: 1457712ee16cSyupeng time.sleep(9999999) 1458712ee16cSyupeng 1459712ee16cSyupengSave the above code as server_coalesce.py, and run:: 1460712ee16cSyupeng 1461712ee16cSyupeng python3 server_coalesce.py 1462712ee16cSyupeng 1463712ee16cSyupengOn the client, save below code as client_coalesce.py:: 1464712ee16cSyupeng 1465712ee16cSyupeng import socket 1466712ee16cSyupeng server = 'nstat-b' 1467712ee16cSyupeng port = 9000 1468712ee16cSyupeng s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 1469712ee16cSyupeng s.connect((server, port)) 1470712ee16cSyupeng 1471712ee16cSyupengRun:: 1472712ee16cSyupeng 1473712ee16cSyupeng nstatuser@nstat-a:~$ python3 -i client_coalesce.py 1474712ee16cSyupeng 1475712ee16cSyupengWe use '-i' to come into the interactive mode, then a packet:: 1476712ee16cSyupeng 1477712ee16cSyupeng >>> s.send(b'foo') 1478712ee16cSyupeng 3 1479712ee16cSyupeng 1480712ee16cSyupengSend a packet again:: 1481712ee16cSyupeng 1482712ee16cSyupeng >>> s.send(b'bar') 1483712ee16cSyupeng 3 1484712ee16cSyupeng 1485712ee16cSyupengOn the server, run nstat:: 1486712ee16cSyupeng 1487712ee16cSyupeng ubuntu@nstat-b:~$ nstat 1488712ee16cSyupeng #kernel 1489712ee16cSyupeng IpInReceives 2 0.0 1490712ee16cSyupeng IpInDelivers 2 0.0 1491712ee16cSyupeng IpOutRequests 2 0.0 1492712ee16cSyupeng TcpInSegs 2 0.0 1493712ee16cSyupeng TcpOutSegs 2 0.0 1494712ee16cSyupeng TcpExtTCPRcvCoalesce 1 0.0 1495712ee16cSyupeng IpExtInOctets 110 0.0 1496712ee16cSyupeng IpExtOutOctets 104 0.0 1497712ee16cSyupeng IpExtInNoECTPkts 2 0.0 1498712ee16cSyupeng 1499712ee16cSyupengThe client sent two packets, server didn't read any data. When 1500712ee16cSyupengthe second packet arrived at server, the first packet was still in 1501712ee16cSyupengthe receiving queue. So the TCP layer merged the two packets, and we 1502712ee16cSyupengcould find the TcpExtTCPRcvCoalesce increased 1. 1503712ee16cSyupeng 1504712ee16cSyupengTcpExtListenOverflows and TcpExtListenDrops 1505ae5220c6SRandy Dunlap------------------------------------------- 1506712ee16cSyupengOn server, run the nc command, listen on port 9000:: 1507712ee16cSyupeng 1508712ee16cSyupeng nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000 1509712ee16cSyupeng Listening on [0.0.0.0] (family 0, port 9000) 1510712ee16cSyupeng 1511712ee16cSyupengOn client, run 3 nc commands in different terminals:: 1512712ee16cSyupeng 1513712ee16cSyupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 1514712ee16cSyupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 1515712ee16cSyupeng 1516712ee16cSyupengThe nc command only accepts 1 connection, and the accept queue length 1517712ee16cSyupengis 1. On current linux implementation, set queue length to n means the 1518712ee16cSyupengactual queue length is n+1. Now we create 3 connections, 1 is accepted 1519712ee16cSyupengby nc, 2 in accepted queue, so the accept queue is full. 1520712ee16cSyupeng 1521712ee16cSyupengBefore running the 4th nc, we clean the nstat history on the server:: 1522712ee16cSyupeng 1523712ee16cSyupeng nstatuser@nstat-b:~$ nstat -n 1524712ee16cSyupeng 1525712ee16cSyupengRun the 4th nc on the client:: 1526712ee16cSyupeng 1527712ee16cSyupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 1528712ee16cSyupeng 1529712ee16cSyupengIf the nc server is running on kernel 4.10 or higher version, you 1530712ee16cSyupengwon't see the "Connection to ... succeeded!" string, because kernel 1531712ee16cSyupengwill drop the SYN if the accept queue is full. If the nc client is running 1532712ee16cSyupengon an old kernel, you would see that the connection is succeeded, 1533712ee16cSyupengbecause kernel would complete the 3 way handshake and keep the socket 1534712ee16cSyupengon half open queue. I did the test on kernel 4.15. Below is the nstat 1535712ee16cSyupengon the server:: 1536712ee16cSyupeng 1537712ee16cSyupeng nstatuser@nstat-b:~$ nstat 1538712ee16cSyupeng #kernel 1539712ee16cSyupeng IpInReceives 4 0.0 1540712ee16cSyupeng IpInDelivers 4 0.0 1541712ee16cSyupeng TcpInSegs 4 0.0 1542712ee16cSyupeng TcpExtListenOverflows 4 0.0 1543712ee16cSyupeng TcpExtListenDrops 4 0.0 1544712ee16cSyupeng IpExtInOctets 240 0.0 1545712ee16cSyupeng IpExtInNoECTPkts 4 0.0 1546712ee16cSyupeng 1547712ee16cSyupengBoth TcpExtListenOverflows and TcpExtListenDrops were 4. If the time 1548712ee16cSyupengbetween the 4th nc and the nstat was longer, the value of 1549712ee16cSyupengTcpExtListenOverflows and TcpExtListenDrops would be larger, because 1550712ee16cSyupengthe SYN of the 4th nc was dropped, the client was retrying. 15518e2ea53aSyupeng 15528e2ea53aSyupengIpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes 1553ae5220c6SRandy Dunlap------------------------------------------------- 15548e2ea53aSyupengserver A IP address: 192.168.122.250 15558e2ea53aSyupengserver B IP address: 192.168.122.251 15568e2ea53aSyupengPrepare on server A, add a route to server B:: 15578e2ea53aSyupeng 15588e2ea53aSyupeng $ sudo ip route add 8.8.8.8/32 via 192.168.122.251 15598e2ea53aSyupeng 15608e2ea53aSyupengPrepare on server B, disable send_redirects for all interfaces:: 15618e2ea53aSyupeng 15628e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0 15638e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0 15648e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0 15658e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0 15668e2ea53aSyupeng 15678e2ea53aSyupengWe want to let sever A send a packet to 8.8.8.8, and route the packet 15688e2ea53aSyupengto server B. When server B receives such packet, it might send a ICMP 15698e2ea53aSyupengRedirect message to server A, set send_redirects to 0 will disable 15708e2ea53aSyupengthis behavior. 15718e2ea53aSyupeng 15728e2ea53aSyupengFirst, generate InAddrErrors. On server B, we disable IP forwarding:: 15738e2ea53aSyupeng 15748e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.all.forwarding=0 15758e2ea53aSyupeng 15768e2ea53aSyupengOn server A, we send packets to 8.8.8.8:: 15778e2ea53aSyupeng 15788e2ea53aSyupeng $ nc -v 8.8.8.8 53 15798e2ea53aSyupeng 15808e2ea53aSyupengOn server B, we check the output of nstat:: 15818e2ea53aSyupeng 15828e2ea53aSyupeng $ nstat 15838e2ea53aSyupeng #kernel 15848e2ea53aSyupeng IpInReceives 3 0.0 15858e2ea53aSyupeng IpInAddrErrors 3 0.0 15868e2ea53aSyupeng IpExtInOctets 180 0.0 15878e2ea53aSyupeng IpExtInNoECTPkts 3 0.0 15888e2ea53aSyupeng 15898e2ea53aSyupengAs we have let server A route 8.8.8.8 to server B, and we disabled IP 15908e2ea53aSyupengforwarding on server B, Server A sent packets to server B, then server B 15918e2ea53aSyupengdropped packets and increased IpInAddrErrors. As the nc command would 15928e2ea53aSyupengre-send the SYN packet if it didn't receive a SYN+ACK, we could find 15938e2ea53aSyupengmultiple IpInAddrErrors. 15948e2ea53aSyupeng 15958e2ea53aSyupengSecond, generate IpExtInNoRoutes. On server B, we enable IP 15968e2ea53aSyupengforwarding:: 15978e2ea53aSyupeng 15988e2ea53aSyupeng $ sudo sysctl -w net.ipv4.conf.all.forwarding=1 15998e2ea53aSyupeng 16008e2ea53aSyupengCheck the route table of server B and remove the default route:: 16018e2ea53aSyupeng 16028e2ea53aSyupeng $ ip route show 16038e2ea53aSyupeng default via 192.168.122.1 dev ens3 proto static 16048e2ea53aSyupeng 192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251 16058e2ea53aSyupeng $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static 16068e2ea53aSyupeng 16078e2ea53aSyupengOn server A, we contact 8.8.8.8 again:: 16088e2ea53aSyupeng 16098e2ea53aSyupeng $ nc -v 8.8.8.8 53 16108e2ea53aSyupeng nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable 16118e2ea53aSyupeng 16128e2ea53aSyupengOn server B, run nstat:: 16138e2ea53aSyupeng 16148e2ea53aSyupeng $ nstat 16158e2ea53aSyupeng #kernel 16168e2ea53aSyupeng IpInReceives 1 0.0 16178e2ea53aSyupeng IpOutRequests 1 0.0 16188e2ea53aSyupeng IcmpOutMsgs 1 0.0 16198e2ea53aSyupeng IcmpOutDestUnreachs 1 0.0 16208e2ea53aSyupeng IcmpMsgOutType3 1 0.0 16218e2ea53aSyupeng IpExtInNoRoutes 1 0.0 16228e2ea53aSyupeng IpExtInOctets 60 0.0 16238e2ea53aSyupeng IpExtOutOctets 88 0.0 16248e2ea53aSyupeng IpExtInNoECTPkts 1 0.0 16258e2ea53aSyupeng 16268e2ea53aSyupengWe enabled IP forwarding on server B, when server B received a packet 16278e2ea53aSyupengwhich destination IP address is 8.8.8.8, server B will try to forward 16288e2ea53aSyupengthis packet. We have deleted the default route, there was no route for 16298e2ea53aSyupeng8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP 16308e2ea53aSyupengDestination Unreachable" message to server A. 16318e2ea53aSyupeng 16328e2ea53aSyupengThird, generate IpOutNoRoutes. Run ping command on server B:: 16338e2ea53aSyupeng 16348e2ea53aSyupeng $ ping -c 1 8.8.8.8 16358e2ea53aSyupeng connect: Network is unreachable 16368e2ea53aSyupeng 16378e2ea53aSyupengRun nstat on server B:: 16388e2ea53aSyupeng 16398e2ea53aSyupeng $ nstat 16408e2ea53aSyupeng #kernel 16418e2ea53aSyupeng IpOutNoRoutes 1 0.0 16428e2ea53aSyupeng 16438e2ea53aSyupengWe have deleted the default route on server B. Server B couldn't find 16448e2ea53aSyupenga route for the 8.8.8.8 IP address, so server B increased 16458e2ea53aSyupengIpOutNoRoutes. 16462b965472Syupeng 16472b965472SyupengTcpExtTCPACKSkippedSynRecv 1648ae5220c6SRandy Dunlap-------------------------- 16492b965472SyupengIn this test, we send 3 same SYN packets from client to server. The 16502b965472Syupengfirst SYN will let server create a socket, set it to Syn-Recv status, 16512b965472Syupengand reply a SYN/ACK. The second SYN will let server reply the SYN/ACK 16522b965472Syupengagain, and record the reply time (the duplicate ACK reply time). The 16532b965472Syupengthird SYN will let server check the previous duplicate ACK reply time, 16542b965472Syupengand decide to skip the duplicate ACK, then increase the 16552b965472SyupengTcpExtTCPACKSkippedSynRecv counter. 16562b965472Syupeng 16572b965472SyupengRun tcpdump to capture a SYN packet:: 16582b965472Syupeng 16592b965472Syupeng nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000 16602b965472Syupeng tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 16612b965472Syupeng 16622b965472SyupengOpen another terminal, run nc command:: 16632b965472Syupeng 16642b965472Syupeng nstatuser@nstat-a:~$ nc nstat-b 9000 16652b965472Syupeng 16662b965472SyupengAs the nstat-b didn't listen on port 9000, it should reply a RST, and 16672b965472Syupengthe nc command exited immediately. It was enough for the tcpdump 16682b965472Syupengcommand to capture a SYN packet. A linux server might use hardware 16692b965472Syupengoffload for the TCP checksum, so the checksum in the /tmp/syn.pcap 16702b965472Syupengmight be not correct. We call tcprewrite to fix it:: 16712b965472Syupeng 16722b965472Syupeng nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum 16732b965472Syupeng 16742b965472SyupengOn nstat-b, we run nc to listen on port 9000:: 16752b965472Syupeng 16762b965472Syupeng nstatuser@nstat-b:~$ nc -lkv 9000 16772b965472Syupeng Listening on [0.0.0.0] (family 0, port 9000) 16782b965472Syupeng 16792b965472SyupengOn nstat-a, we blocked the packet from port 9000, or nstat-a would send 16802b965472SyupengRST to nstat-b:: 16812b965472Syupeng 16822b965472Syupeng nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP 16832b965472Syupeng 1684*a266ef69SRandy DunlapSend 3 SYN repeatedly to nstat-b:: 16852b965472Syupeng 16862b965472Syupeng nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done 16872b965472Syupeng 1688ede71caeSMasanari IidaCheck snmp counter on nstat-b:: 16892b965472Syupeng 16902b965472Syupeng nstatuser@nstat-b:~$ nstat | grep -i skip 16912b965472Syupeng TcpExtTCPACKSkippedSynRecv 1 0.0 16922b965472Syupeng 16932b965472SyupengAs we expected, TcpExtTCPACKSkippedSynRecv is 1. 16942b965472Syupeng 16952b965472SyupengTcpExtTCPACKSkippedPAWS 1696ae5220c6SRandy Dunlap----------------------- 16972b965472SyupengTo trigger PAWS, we could send an old SYN. 16982b965472Syupeng 16992b965472SyupengOn nstat-b, let nc listen on port 9000:: 17002b965472Syupeng 17012b965472Syupeng nstatuser@nstat-b:~$ nc -lkv 9000 17022b965472Syupeng Listening on [0.0.0.0] (family 0, port 9000) 17032b965472Syupeng 17042b965472SyupengOn nstat-a, run tcpdump to capture a SYN:: 17052b965472Syupeng 17062b965472Syupeng nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000 17072b965472Syupeng tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 17082b965472Syupeng 17092b965472SyupengOn nstat-a, run nc as a client to connect nstat-b:: 17102b965472Syupeng 17112b965472Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 17122b965472Syupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 17132b965472Syupeng 17142b965472SyupengNow the tcpdump has captured the SYN and exit. We should fix the 17152b965472Syupengchecksum:: 17162b965472Syupeng 17172b965472Syupeng nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum 17182b965472Syupeng 17192b965472SyupengSend the SYN packet twice:: 17202b965472Syupeng 17212b965472Syupeng nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done 17222b965472Syupeng 17232b965472SyupengOn nstat-b, check the snmp counter:: 17242b965472Syupeng 17252b965472Syupeng nstatuser@nstat-b:~$ nstat | grep -i skip 17262b965472Syupeng TcpExtTCPACKSkippedPAWS 1 0.0 17272b965472Syupeng 17282b965472SyupengWe sent two SYN via tcpreplay, both of them would let PAWS check 17292b965472Syupengfailed, the nstat-b replied an ACK for the first SYN, skipped the ACK 17302b965472Syupengfor the second SYN, and updated TcpExtTCPACKSkippedPAWS. 17312b965472Syupeng 17322b965472SyupengTcpExtTCPACKSkippedSeq 1733ae5220c6SRandy Dunlap---------------------- 17342b965472SyupengTo trigger TcpExtTCPACKSkippedSeq, we send packets which have valid 17352b965472Syupengtimestamp (to pass PAWS check) but the sequence number is out of 17362b965472Syupengwindow. The linux TCP stack would avoid to skip if the packet has 17372b965472Syupengdata, so we need a pure ACK packet. To generate such a packet, we 17382b965472Syupengcould create two sockets: one on port 9000, another on port 9001. Then 17392b965472Syupengwe capture an ACK on port 9001, change the source/destination port 17402b965472Syupengnumbers to match the port 9000 socket. Then we could trigger 17412b965472SyupengTcpExtTCPACKSkippedSeq via this packet. 17422b965472Syupeng 17432b965472SyupengOn nstat-b, open two terminals, run two nc commands to listen on both 17442b965472Syupengport 9000 and port 9001:: 17452b965472Syupeng 17462b965472Syupeng nstatuser@nstat-b:~$ nc -lkv 9000 17472b965472Syupeng Listening on [0.0.0.0] (family 0, port 9000) 17482b965472Syupeng 17492b965472Syupeng nstatuser@nstat-b:~$ nc -lkv 9001 17502b965472Syupeng Listening on [0.0.0.0] (family 0, port 9001) 17512b965472Syupeng 17522b965472SyupengOn nstat-a, run two nc clients:: 17532b965472Syupeng 17542b965472Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9000 17552b965472Syupeng Connection to nstat-b 9000 port [tcp/*] succeeded! 17562b965472Syupeng 17572b965472Syupeng nstatuser@nstat-a:~$ nc -v nstat-b 9001 17582b965472Syupeng Connection to nstat-b 9001 port [tcp/*] succeeded! 17592b965472Syupeng 17602b965472SyupengOn nstat-a, run tcpdump to capture an ACK:: 17612b965472Syupeng 17622b965472Syupeng nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001 17632b965472Syupeng tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes 17642b965472Syupeng 17652b965472SyupengOn nstat-b, send a packet via the port 9001 socket. E.g. we sent a 17662b965472Syupengstring 'foo' in our example:: 17672b965472Syupeng 17682b965472Syupeng nstatuser@nstat-b:~$ nc -lkv 9001 17692b965472Syupeng Listening on [0.0.0.0] (family 0, port 9001) 17702b965472Syupeng Connection from nstat-a 42132 received! 17712b965472Syupeng foo 17722b965472Syupeng 1773ede71caeSMasanari IidaOn nstat-a, the tcpdump should have captured the ACK. We should check 17742b965472Syupengthe source port numbers of the two nc clients:: 17752b965472Syupeng 17762b965472Syupeng nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee 17772b965472Syupeng State Recv-Q Send-Q Local Address:Port Peer Address:Port 17782b965472Syupeng ESTAB 0 0 192.168.122.250:50208 192.168.122.251:9000 17792b965472Syupeng ESTAB 0 0 192.168.122.250:42132 192.168.122.251:9001 17802b965472Syupeng 1781ede71caeSMasanari IidaRun tcprewrite, change port 9001 to port 9000, change port 42132 to 17822b965472Syupengport 50208:: 17832b965472Syupeng 17842b965472Syupeng nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum 17852b965472Syupeng 17862b965472SyupengNow the /tmp/seq.pcap is the packet we need. Send it to nstat-b:: 17872b965472Syupeng 17882b965472Syupeng nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done 17892b965472Syupeng 17902b965472SyupengCheck TcpExtTCPACKSkippedSeq on nstat-b:: 17912b965472Syupeng 17922b965472Syupeng nstatuser@nstat-b:~$ nstat | grep -i skip 17932b965472Syupeng TcpExtTCPACKSkippedSeq 1 0.0 1794