xref: /openbmc/linux/Documentation/networking/snmp_counter.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
1ae5220c6SRandy Dunlap============
2b08794a9SyupengSNMP counter
3ae5220c6SRandy Dunlap============
4b08794a9Syupeng
5b08794a9SyupengThis document explains the meaning of SNMP counters.
6b08794a9Syupeng
7b08794a9SyupengGeneral IPv4 counters
8ae5220c6SRandy Dunlap=====================
9b08794a9SyupengAll layer 4 packets and ICMP packets will change these counters, but
10b08794a9Syupengthese counters won't be changed by layer 2 packets (such as STP) or
11b08794a9SyupengARP packets.
12b08794a9Syupeng
13b08794a9Syupeng* IpInReceives
14ae5220c6SRandy Dunlap
15b08794a9SyupengDefined in `RFC1213 ipInReceives`_
16b08794a9Syupeng
17b08794a9Syupeng.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
18b08794a9Syupeng
19b08794a9SyupengThe number of packets received by the IP layer. It gets increasing at the
20b08794a9Syupengbeginning of ip_rcv function, always be updated together with
218e2ea53aSyupengIpExtInOctets. It will be increased even if the packet is dropped
228e2ea53aSyupenglater (e.g. due to the IP header is invalid or the checksum is wrong
238e2ea53aSyupengand so on).  It indicates the number of aggregated segments after
24b08794a9SyupengGRO/LRO.
25b08794a9Syupeng
26b08794a9Syupeng* IpInDelivers
27ae5220c6SRandy Dunlap
28b08794a9SyupengDefined in `RFC1213 ipInDelivers`_
29b08794a9Syupeng
30b08794a9Syupeng.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
31b08794a9Syupeng
32b08794a9SyupengThe number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
33b08794a9SyupengICMP and so on. If no one listens on a raw socket, only kernel
34b08794a9Syupengsupported protocols will be delivered, if someone listens on the raw
35b08794a9Syupengsocket, all valid IP packets will be delivered.
36b08794a9Syupeng
37b08794a9Syupeng* IpOutRequests
38ae5220c6SRandy Dunlap
39b08794a9SyupengDefined in `RFC1213 ipOutRequests`_
40b08794a9Syupeng
41b08794a9Syupeng.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
42b08794a9Syupeng
43b08794a9SyupengThe number of packets sent via IP layer, for both single cast and
44b08794a9Syupengmulticast packets, and would always be updated together with
45b08794a9SyupengIpExtOutOctets.
46b08794a9Syupeng
47b08794a9Syupeng* IpExtInOctets and IpExtOutOctets
48ae5220c6SRandy Dunlap
4980cc4950SyupengThey are Linux kernel extensions, no RFC definitions. Please note,
50b08794a9SyupengRFC1213 indeed defines ifInOctets  and ifOutOctets, but they
51b08794a9Syupengare different things. The ifInOctets and ifOutOctets include the MAC
52b08794a9Syupenglayer header size but IpExtInOctets and IpExtOutOctets don't, they
53b08794a9Syupengonly include the IP layer header and the IP layer data.
54b08794a9Syupeng
55b08794a9Syupeng* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
56ae5220c6SRandy Dunlap
57b08794a9SyupengThey indicate the number of four kinds of ECN IP packets, please refer
58b08794a9Syupeng`Explicit Congestion Notification`_ for more details.
59b08794a9Syupeng
60b08794a9Syupeng.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
61b08794a9Syupeng
62b08794a9SyupengThese 4 counters calculate how many packets received per ECN
63b08794a9Syupengstatus. They count the real frame number regardless the LRO/GRO. So
64b08794a9Syupengfor the same packet, you might find that IpInReceives count 1, but
65b08794a9SyupengIpExtInNoECTPkts counts 2 or more.
66b08794a9Syupeng
678e2ea53aSyupeng* IpInHdrErrors
68ae5220c6SRandy Dunlap
698e2ea53aSyupengDefined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
708e2ea53aSyupengdropped due to the IP header error. It might happen in both IP input
718e2ea53aSyupengand IP forward paths.
728e2ea53aSyupeng
738e2ea53aSyupeng.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
748e2ea53aSyupeng
758e2ea53aSyupeng* IpInAddrErrors
76ae5220c6SRandy Dunlap
778e2ea53aSyupengDefined in `RFC1213 ipInAddrErrors`_. It will be increased in two
788e2ea53aSyupengscenarios: (1) The IP address is invalid. (2) The destination IP
798e2ea53aSyupengaddress is not a local address and IP forwarding is not enabled
808e2ea53aSyupeng
818e2ea53aSyupeng.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
828e2ea53aSyupeng
838e2ea53aSyupeng* IpExtInNoRoutes
84ae5220c6SRandy Dunlap
858e2ea53aSyupengThis counter means the packet is dropped when the IP stack receives a
868e2ea53aSyupengpacket and can't find a route for it from the route table. It might
878e2ea53aSyupenghappen when IP forwarding is enabled and the destination IP address is
888e2ea53aSyupengnot a local address and there is no route for the destination IP
898e2ea53aSyupengaddress.
908e2ea53aSyupeng
918e2ea53aSyupeng* IpInUnknownProtos
92ae5220c6SRandy Dunlap
938e2ea53aSyupengDefined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
948e2ea53aSyupenglayer 4 protocol is unsupported by kernel. If an application is using
958e2ea53aSyupengraw socket, kernel will always deliver the packet to the raw socket
968e2ea53aSyupengand this counter won't be increased.
978e2ea53aSyupeng
988e2ea53aSyupeng.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
998e2ea53aSyupeng
1008e2ea53aSyupeng* IpExtInTruncatedPkts
101ae5220c6SRandy Dunlap
1028e2ea53aSyupengFor IPv4 packet, it means the actual data size is smaller than the
1038e2ea53aSyupeng"Total Length" field in the IPv4 header.
1048e2ea53aSyupeng
1058e2ea53aSyupeng* IpInDiscards
106ae5220c6SRandy Dunlap
1078e2ea53aSyupengDefined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
1088e2ea53aSyupengin the IP receiving path and due to kernel internal reasons (e.g. no
1098e2ea53aSyupengenough memory).
1108e2ea53aSyupeng
1118e2ea53aSyupeng.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
1128e2ea53aSyupeng
1138e2ea53aSyupeng* IpOutDiscards
114ae5220c6SRandy Dunlap
1158e2ea53aSyupengDefined in `RFC1213 ipOutDiscards`_. It indicates the packet is
1168e2ea53aSyupengdropped in the IP sending path and due to kernel internal reasons.
1178e2ea53aSyupeng
1188e2ea53aSyupeng.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
1198e2ea53aSyupeng
1208e2ea53aSyupeng* IpOutNoRoutes
121ae5220c6SRandy Dunlap
1228e2ea53aSyupengDefined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
1238e2ea53aSyupengdropped in the IP sending path and no route is found for it.
1248e2ea53aSyupeng
1258e2ea53aSyupeng.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
1268e2ea53aSyupeng
127b08794a9SyupengICMP counters
128ae5220c6SRandy Dunlap=============
129b08794a9Syupeng* IcmpInMsgs and IcmpOutMsgs
130ae5220c6SRandy Dunlap
131b08794a9SyupengDefined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
132b08794a9Syupeng
133b08794a9Syupeng.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
134b08794a9Syupeng.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
135b08794a9Syupeng
136b08794a9SyupengAs mentioned in the RFC1213, these two counters include errors, they
137b08794a9Syupengwould be increased even if the ICMP packet has an invalid type. The
138b08794a9SyupengICMP output path will check the header of a raw socket, so the
139b08794a9SyupengIcmpOutMsgs would still be updated if the IP header is constructed by
140b08794a9Syupenga userspace program.
141b08794a9Syupeng
142b08794a9Syupeng* ICMP named types
143ae5220c6SRandy Dunlap
144b08794a9Syupeng| These counters include most of common ICMP types, they are:
145b08794a9Syupeng| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
146b08794a9Syupeng| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
147b08794a9Syupeng| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
148b08794a9Syupeng| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
149b08794a9Syupeng| IcmpInRedirects: `RFC1213 icmpInRedirects`_
150b08794a9Syupeng| IcmpInEchos: `RFC1213 icmpInEchos`_
151b08794a9Syupeng| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
152b08794a9Syupeng| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
153b08794a9Syupeng| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
154b08794a9Syupeng| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
155b08794a9Syupeng| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
156b08794a9Syupeng| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
157b08794a9Syupeng| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
158b08794a9Syupeng| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
159b08794a9Syupeng| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
160b08794a9Syupeng| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
161b08794a9Syupeng| IcmpOutEchos: `RFC1213 icmpOutEchos`_
162b08794a9Syupeng| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
163b08794a9Syupeng| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
164b08794a9Syupeng| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
165b08794a9Syupeng| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
166b08794a9Syupeng| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
167b08794a9Syupeng
168b08794a9Syupeng.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
169b08794a9Syupeng.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
170b08794a9Syupeng.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
171b08794a9Syupeng.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
172b08794a9Syupeng.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
173b08794a9Syupeng.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
174b08794a9Syupeng.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
175b08794a9Syupeng.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
176b08794a9Syupeng.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
177b08794a9Syupeng.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
178b08794a9Syupeng.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
179b08794a9Syupeng
180b08794a9Syupeng.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
181b08794a9Syupeng.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
182b08794a9Syupeng.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
183b08794a9Syupeng.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
184b08794a9Syupeng.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
185b08794a9Syupeng.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
186b08794a9Syupeng.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
187b08794a9Syupeng.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
188b08794a9Syupeng.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
189b08794a9Syupeng.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
190b08794a9Syupeng.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
191b08794a9Syupeng
192b08794a9SyupengEvery ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
193b08794a9SyupengEcho packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
194b08794a9Syupengstraightforward. The 'In' counter means kernel receives such a packet
195b08794a9Syupengand the 'Out' counter means kernel sends such a packet.
196b08794a9Syupeng
197b08794a9Syupeng* ICMP numeric types
198ae5220c6SRandy Dunlap
199b08794a9SyupengThey are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
200b08794a9SyupengICMP type number. These counters track all kinds of ICMP packets. The
201b08794a9SyupengICMP type number definition could be found in the `ICMP parameters`_
202b08794a9Syupengdocument.
203b08794a9Syupeng
204b08794a9Syupeng.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
205b08794a9Syupeng
206b08794a9SyupengFor example, if the Linux kernel sends an ICMP Echo packet, the
207b08794a9SyupengIcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
208b08794a9Syupengpacket, IcmpMsgInType0 would increase 1.
209b08794a9Syupeng
210b08794a9Syupeng* IcmpInCsumErrors
211ae5220c6SRandy Dunlap
212b08794a9SyupengThis counter indicates the checksum of the ICMP packet is
213b08794a9Syupengwrong. Kernel verifies the checksum after updating the IcmpInMsgs and
214b08794a9Syupengbefore updating IcmpMsgInType[N]. If a packet has bad checksum, the
215b08794a9SyupengIcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
216b08794a9Syupeng
217b08794a9Syupeng* IcmpInErrors and IcmpOutErrors
218ae5220c6SRandy Dunlap
219b08794a9SyupengDefined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
220b08794a9Syupeng
221b08794a9Syupeng.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
222b08794a9Syupeng.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
223b08794a9Syupeng
224b08794a9SyupengWhen an error occurs in the ICMP packet handler path, these two
225b08794a9Syupengcounters would be updated. The receiving packet path use IcmpInErrors
226b08794a9Syupengand the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
227b08794a9Syupengis increased, IcmpInErrors would always be increased too.
228b08794a9Syupeng
229b08794a9Syupengrelationship of the ICMP counters
230ae5220c6SRandy Dunlap---------------------------------
231b08794a9SyupengThe sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
232b08794a9Syupengare updated at the same time. The sum of IcmpMsgInType[N] plus
233b08794a9SyupengIcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
234b08794a9Syupengreceives an ICMP packet, kernel follows below logic:
235b08794a9Syupeng
236b08794a9Syupeng1. increase IcmpInMsgs
237b08794a9Syupeng2. if has any error, update IcmpInErrors and finish the process
238b08794a9Syupeng3. update IcmpMsgOutType[N]
239b08794a9Syupeng4. handle the packet depending on the type, if has any error, update
240b08794a9Syupeng   IcmpInErrors and finish the process
241b08794a9Syupeng
242b08794a9SyupengSo if all errors occur in step (2), IcmpInMsgs should be equal to the
243b08794a9Syupengsum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
244b08794a9Syupengstep (4), IcmpInMsgs should be equal to the sum of
245b08794a9SyupengIcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
246b08794a9SyupengIcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
247b08794a9SyupengIcmpInErrors.
248b08794a9Syupeng
24980cc4950SyupengGeneral TCP counters
250ae5220c6SRandy Dunlap====================
25180cc4950Syupeng* TcpInSegs
252ae5220c6SRandy Dunlap
25380cc4950SyupengDefined in `RFC1213 tcpInSegs`_
25480cc4950Syupeng
25580cc4950Syupeng.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
25680cc4950Syupeng
25780cc4950SyupengThe number of packets received by the TCP layer. As mentioned in
25880cc4950SyupengRFC1213, it includes the packets received in error, such as checksum
25980cc4950Syupengerror, invalid TCP header and so on. Only one error won't be included:
26080cc4950Syupengif the layer 2 destination address is not the NIC's layer 2
26180cc4950Syupengaddress. It might happen if the packet is a multicast or broadcast
26280cc4950Syupengpacket, or the NIC is in promiscuous mode. In these situations, the
26380cc4950Syupengpackets would be delivered to the TCP layer, but the TCP layer will discard
26480cc4950Syupengthese packets before increasing TcpInSegs. The TcpInSegs counter
26580cc4950Syupengisn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
26680cc4950Syupengcounter would only increase 1.
26780cc4950Syupeng
26880cc4950Syupeng* TcpOutSegs
269ae5220c6SRandy Dunlap
27080cc4950SyupengDefined in `RFC1213 tcpOutSegs`_
27180cc4950Syupeng
27280cc4950Syupeng.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
27380cc4950Syupeng
27480cc4950SyupengThe number of packets sent by the TCP layer. As mentioned in RFC1213,
27580cc4950Syupengit excludes the retransmitted packets. But it includes the SYN, ACK
27680cc4950Syupengand RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
27780cc4950SyupengGSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
27880cc4950Syupengincrease 2.
27980cc4950Syupeng
28080cc4950Syupeng* TcpActiveOpens
281ae5220c6SRandy Dunlap
28280cc4950SyupengDefined in `RFC1213 tcpActiveOpens`_
28380cc4950Syupeng
28480cc4950Syupeng.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
28580cc4950Syupeng
28680cc4950SyupengIt means the TCP layer sends a SYN, and come into the SYN-SENT
28780cc4950Syupengstate. Every time TcpActiveOpens increases 1, TcpOutSegs should always
28880cc4950Syupengincrease 1.
28980cc4950Syupeng
29080cc4950Syupeng* TcpPassiveOpens
291ae5220c6SRandy Dunlap
29280cc4950SyupengDefined in `RFC1213 tcpPassiveOpens`_
29380cc4950Syupeng
29480cc4950Syupeng.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
29580cc4950Syupeng
29680cc4950SyupengIt means the TCP layer receives a SYN, replies a SYN+ACK, come into
29780cc4950Syupengthe SYN-RCVD state.
29880cc4950Syupeng
299712ee16cSyupeng* TcpExtTCPRcvCoalesce
300ae5220c6SRandy Dunlap
301712ee16cSyupengWhen packets are received by the TCP layer and are not be read by the
302712ee16cSyupengapplication, the TCP layer will try to merge them. This counter
303712ee16cSyupengindicate how many packets are merged in such situation. If GRO is
304712ee16cSyupengenabled, lots of packets would be merged by GRO, these packets
305712ee16cSyupengwouldn't be counted to TcpExtTCPRcvCoalesce.
306712ee16cSyupeng
307712ee16cSyupeng* TcpExtTCPAutoCorking
308ae5220c6SRandy Dunlap
309712ee16cSyupengWhen sending packets, the TCP layer will try to merge small packets to
310712ee16cSyupenga bigger one. This counter increase 1 for every packet merged in such
311712ee16cSyupengsituation. Please refer to the LWN article for more details:
312712ee16cSyupenghttps://lwn.net/Articles/576263/
313712ee16cSyupeng
314712ee16cSyupeng* TcpExtTCPOrigDataSent
315ae5220c6SRandy Dunlap
316712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
317ede71caeSMasanari Iidaexplanation below::
318712ee16cSyupeng
319712ee16cSyupeng  TCPOrigDataSent: number of outgoing packets with original data (excluding
320712ee16cSyupeng  retransmission but including data-in-SYN). This counter is different from
321712ee16cSyupeng  TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
322712ee16cSyupeng  more useful to track the TCP retransmission rate.
323712ee16cSyupeng
324712ee16cSyupeng* TCPSynRetrans
325ae5220c6SRandy Dunlap
326712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
327ede71caeSMasanari Iidaexplanation below::
328712ee16cSyupeng
329712ee16cSyupeng  TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
330712ee16cSyupeng  retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
331712ee16cSyupeng
332712ee16cSyupeng* TCPFastOpenActiveFail
333ae5220c6SRandy Dunlap
334712ee16cSyupengThis counter is explained by `kernel commit f19c29e3e391`_, I pasted the
335ede71caeSMasanari Iidaexplanation below::
336712ee16cSyupeng
337712ee16cSyupeng  TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
338712ee16cSyupeng  the remote does not accept it or the attempts timed out.
339712ee16cSyupeng
340712ee16cSyupeng.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
341712ee16cSyupeng
342712ee16cSyupeng* TcpExtListenOverflows and TcpExtListenDrops
343ae5220c6SRandy Dunlap
344712ee16cSyupengWhen kernel receives a SYN from a client, and if the TCP accept queue
345712ee16cSyupengis full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
346712ee16cSyupengAt the same time kernel will also add 1 to TcpExtListenDrops. When a
347712ee16cSyupengTCP socket is in LISTEN state, and kernel need to drop a packet,
348712ee16cSyupengkernel would always add 1 to TcpExtListenDrops. So increase
349712ee16cSyupengTcpExtListenOverflows would let TcpExtListenDrops increasing at the
350712ee16cSyupengsame time, but TcpExtListenDrops would also increase without
351712ee16cSyupengTcpExtListenOverflows increasing, e.g. a memory allocation fail would
352712ee16cSyupengalso let TcpExtListenDrops increase.
353712ee16cSyupeng
354712ee16cSyupengNote: The above explanation is based on kernel 4.10 or above version, on
355712ee16cSyupengan old kernel, the TCP stack has different behavior when TCP accept
356712ee16cSyupengqueue is full. On the old kernel, TCP stack won't drop the SYN, it
357712ee16cSyupengwould complete the 3-way handshake. As the accept queue is full, TCP
358712ee16cSyupengstack will keep the socket in the TCP half-open queue. As it is in the
359712ee16cSyupenghalf open queue, TCP stack will send SYN+ACK on an exponential backoff
360712ee16cSyupengtimer, after client replies ACK, TCP stack checks whether the accept
361712ee16cSyupengqueue is still full, if it is not full, moves the socket to the accept
362712ee16cSyupengqueue, if it is full, keeps the socket in the half-open queue, at next
363712ee16cSyupengtime client replies ACK, this socket will get another chance to move
364712ee16cSyupengto the accept queue.
365712ee16cSyupeng
366712ee16cSyupeng
36780cc4950SyupengTCP Fast Open
368ae5220c6SRandy Dunlap=============
369a6c7c7aaSyupeng* TcpEstabResets
370132c4e9eSyupeng
371a6c7c7aaSyupengDefined in `RFC1213 tcpEstabResets`_.
372a6c7c7aaSyupeng
373a6c7c7aaSyupeng.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
374a6c7c7aaSyupeng
375a6c7c7aaSyupeng* TcpAttemptFails
376132c4e9eSyupeng
377a6c7c7aaSyupengDefined in `RFC1213 tcpAttemptFails`_.
378a6c7c7aaSyupeng
379a6c7c7aaSyupeng.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
380a6c7c7aaSyupeng
381a6c7c7aaSyupeng* TcpOutRsts
382132c4e9eSyupeng
383a6c7c7aaSyupengDefined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
384a6c7c7aaSyupengthe 'segments sent containing the RST flag', but in linux kernel, this
385ede71caeSMasanari Iidacounter indicates the segments kernel tried to send. The sending
386a6c7c7aaSyupengprocess might be failed due to some errors (e.g. memory alloc failed).
387a6c7c7aaSyupeng
388a6c7c7aaSyupeng.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
389a6c7c7aaSyupeng
390132c4e9eSyupeng* TcpExtTCPSpuriousRtxHostQueues
391132c4e9eSyupeng
392132c4e9eSyupengWhen the TCP stack wants to retransmit a packet, and finds that packet
393132c4e9eSyupengis not lost in the network, but the packet is not sent yet, the TCP
394132c4e9eSyupengstack would give up the retransmission and update this counter. It
395132c4e9eSyupengmight happen if a packet stays too long time in a qdisc or driver
396132c4e9eSyupengqueue.
397132c4e9eSyupeng
398132c4e9eSyupeng* TcpEstabResets
399132c4e9eSyupeng
400132c4e9eSyupengThe socket receives a RST packet in Establish or CloseWait state.
401132c4e9eSyupeng
402132c4e9eSyupeng* TcpExtTCPKeepAlive
403132c4e9eSyupeng
404132c4e9eSyupengThis counter indicates many keepalive packets were sent. The keepalive
405132c4e9eSyupengwon't be enabled by default. A userspace program could enable it by
406132c4e9eSyupengsetting the SO_KEEPALIVE socket option.
407132c4e9eSyupeng
408132c4e9eSyupeng* TcpExtTCPSpuriousRTOs
409132c4e9eSyupeng
410132c4e9eSyupengThe spurious retransmission timeout detected by the `F-RTO`_
411132c4e9eSyupengalgorithm.
412132c4e9eSyupeng
413132c4e9eSyupeng.. _F-RTO: https://tools.ietf.org/html/rfc5682
414a6c7c7aaSyupeng
415a6c7c7aaSyupengTCP Fast Path
41665e9a6d2SRandy Dunlap=============
41780cc4950SyupengWhen kernel receives a TCP packet, it has two paths to handler the
41880cc4950Syupengpacket, one is fast path, another is slow path. The comment in kernel
41980cc4950Syupengcode provides a good explanation of them, I pasted them below::
42080cc4950Syupeng
42180cc4950Syupeng  It is split into a fast path and a slow path. The fast path is
42280cc4950Syupeng  disabled when:
42380cc4950Syupeng
42480cc4950Syupeng  - A zero window was announced from us
42580cc4950Syupeng  - zero window probing
42680cc4950Syupeng    is only handled properly on the slow path.
42780cc4950Syupeng  - Out of order segments arrived.
42880cc4950Syupeng  - Urgent data is expected.
42980cc4950Syupeng  - There is no buffer space left
43080cc4950Syupeng  - Unexpected TCP flags/window values/header lengths are received
43180cc4950Syupeng    (detected by checking the TCP header against pred_flags)
43280cc4950Syupeng  - Data is sent in both directions. The fast path only supports pure senders
43380cc4950Syupeng    or pure receivers (this means either the sequence number or the ack
43480cc4950Syupeng    value must stay constant)
43580cc4950Syupeng  - Unexpected TCP option.
43680cc4950Syupeng
43780cc4950SyupengKernel will try to use fast path unless any of the above conditions
43880cc4950Syupengare satisfied. If the packets are out of order, kernel will handle
43980cc4950Syupengthem in slow path, which means the performance might be not very
44080cc4950Syupenggood. Kernel would also come into slow path if the "Delayed ack" is
44180cc4950Syupengused, because when using "Delayed ack", the data is sent in both
44280cc4950Syupengdirections. When the TCP window scale option is not used, kernel will
44380cc4950Syupengtry to enable fast path immediately when the connection comes into the
44480cc4950Syupengestablished state, but if the TCP window scale option is used, kernel
44580cc4950Syupengwill disable the fast path at first, and try to enable it after kernel
44680cc4950Syupengreceives packets.
44780cc4950Syupeng
44880cc4950Syupeng* TcpExtTCPPureAcks and TcpExtTCPHPAcks
449ae5220c6SRandy Dunlap
45080cc4950SyupengIf a packet set ACK flag and has no data, it is a pure ACK packet, if
45180cc4950Syupengkernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
45280cc4950Syupengif kernel handles it in the slow path, TcpExtTCPPureAcks will
45380cc4950Syupengincrease 1.
45480cc4950Syupeng
45580cc4950Syupeng* TcpExtTCPHPHits
456ae5220c6SRandy Dunlap
45780cc4950SyupengIf a TCP packet has data (which means it is not a pure ACK packet),
45880cc4950Syupengand this packet is handled in the fast path, TcpExtTCPHPHits will
45980cc4950Syupengincrease 1.
46080cc4950Syupeng
46180cc4950Syupeng
46280cc4950SyupengTCP abort
463ae5220c6SRandy Dunlap=========
46480cc4950Syupeng* TcpExtTCPAbortOnData
465ae5220c6SRandy Dunlap
46680cc4950SyupengIt means TCP layer has data in flight, but need to close the
46780cc4950Syupengconnection. So TCP layer sends a RST to the other side, indicate the
46880cc4950Syupengconnection is not closed very graceful. An easy way to increase this
46980cc4950Syupengcounter is using the SO_LINGER option. Please refer to the SO_LINGER
47080cc4950Syupengsection of the `socket man page`_:
47180cc4950Syupeng
47280cc4950Syupeng.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
47380cc4950Syupeng
47480cc4950SyupengBy default, when an application closes a connection, the close function
47580cc4950Syupengwill return immediately and kernel will try to send the in-flight data
47680cc4950Syupengasync. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
47780cc4950Syupengto a positive number, the close function won't return immediately, but
47880cc4950Syupengwait for the in-flight data are acked by the other side, the max wait
47980cc4950Syupengtime is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
48080cc4950Syupengwhen the application closes a connection, kernel will send a RST
48180cc4950Syupengimmediately and increase the TcpExtTCPAbortOnData counter.
48280cc4950Syupeng
48380cc4950Syupeng* TcpExtTCPAbortOnClose
484ae5220c6SRandy Dunlap
48580cc4950SyupengThis counter means the application has unread data in the TCP layer when
48680cc4950Syupengthe application wants to close the TCP connection. In such a situation,
48780cc4950Syupengkernel will send a RST to the other side of the TCP connection.
48880cc4950Syupeng
48980cc4950Syupeng* TcpExtTCPAbortOnMemory
490ae5220c6SRandy Dunlap
49180cc4950SyupengWhen an application closes a TCP connection, kernel still need to track
49280cc4950Syupengthe connection, let it complete the TCP disconnect process. E.g. an
49380cc4950Syupengapp calls the close method of a socket, kernel sends fin to the other
49480cc4950Syupengside of the connection, then the app has no relationship with the
49580cc4950Syupengsocket any more, but kernel need to keep the socket, this socket
49680cc4950Syupengbecomes an orphan socket, kernel waits for the reply of the other side,
49780cc4950Syupengand would come to the TIME_WAIT state finally. When kernel has no
49880cc4950Syupengenough memory to keep the orphan socket, kernel would send an RST to
49980cc4950Syupengthe other side, and delete the socket, in such situation, kernel will
50080cc4950Syupengincrease 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
50180cc4950SyupengTcpExtTCPAbortOnMemory:
50280cc4950Syupeng
50380cc4950Syupeng1. the memory used by the TCP protocol is higher than the third value of
50480cc4950Syupengthe tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
50580cc4950Syupeng
50680cc4950Syupeng.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
50780cc4950Syupeng
50880cc4950Syupeng2. the orphan socket count is higher than net.ipv4.tcp_max_orphans
50980cc4950Syupeng
51080cc4950Syupeng
51180cc4950Syupeng* TcpExtTCPAbortOnTimeout
512ae5220c6SRandy Dunlap
51380cc4950SyupengThis counter will increase when any of the TCP timers expire. In such
51480cc4950Syupengsituation, kernel won't send RST, just give up the connection.
51580cc4950Syupeng
51680cc4950Syupeng* TcpExtTCPAbortOnLinger
517ae5220c6SRandy Dunlap
51880cc4950SyupengWhen a TCP connection comes into FIN_WAIT_2 state, instead of waiting
51980cc4950Syupengfor the fin packet from the other side, kernel could send a RST and
52080cc4950Syupengdelete the socket immediately. This is not the default behavior of
52180cc4950SyupengLinux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
52280cc4950Syupengyou could let kernel follow this behavior.
52380cc4950Syupeng
52480cc4950Syupeng* TcpExtTCPAbortFailed
525ae5220c6SRandy Dunlap
52680cc4950SyupengThe kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
52780cc4950Syupengsatisfied. If an internal error occurs during this process,
52880cc4950SyupengTcpExtTCPAbortFailed will be increased.
52980cc4950Syupeng
53080cc4950Syupeng.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
53180cc4950Syupeng
532712ee16cSyupengTCP Hybrid Slow Start
533ae5220c6SRandy Dunlap=====================
534712ee16cSyupengThe Hybrid Slow Start algorithm is an enhancement of the traditional
535712ee16cSyupengTCP congestion window Slow Start algorithm. It uses two pieces of
536712ee16cSyupenginformation to detect whether the max bandwidth of the TCP path is
537712ee16cSyupengapproached. The two pieces of information are ACK train length and
538712ee16cSyupengincrease in packet delay. For detail information, please refer the
539712ee16cSyupeng`Hybrid Slow Start paper`_. Either ACK train length or packet delay
540712ee16cSyupenghits a specific threshold, the congestion control algorithm will come
541712ee16cSyupenginto the Congestion Avoidance state. Until v4.20, two congestion
542712ee16cSyupengcontrol algorithms are using Hybrid Slow Start, they are cubic (the
543712ee16cSyupengdefault congestion control algorithm) and cdg. Four snmp counters
544712ee16cSyupengrelate with the Hybrid Slow Start algorithm.
545712ee16cSyupeng
546712ee16cSyupeng.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
547712ee16cSyupeng
548712ee16cSyupeng* TcpExtTCPHystartTrainDetect
549ae5220c6SRandy Dunlap
550712ee16cSyupengHow many times the ACK train length threshold is detected
551712ee16cSyupeng
552712ee16cSyupeng* TcpExtTCPHystartTrainCwnd
553ae5220c6SRandy Dunlap
554712ee16cSyupengThe sum of CWND detected by ACK train length. Dividing this value by
555712ee16cSyupengTcpExtTCPHystartTrainDetect is the average CWND which detected by the
556712ee16cSyupengACK train length.
557712ee16cSyupeng
558712ee16cSyupeng* TcpExtTCPHystartDelayDetect
559ae5220c6SRandy Dunlap
560712ee16cSyupengHow many times the packet delay threshold is detected.
561712ee16cSyupeng
562712ee16cSyupeng* TcpExtTCPHystartDelayCwnd
563ae5220c6SRandy Dunlap
564712ee16cSyupengThe sum of CWND detected by packet delay. Dividing this value by
565712ee16cSyupengTcpExtTCPHystartDelayDetect is the average CWND which detected by the
566712ee16cSyupengpacket delay.
567712ee16cSyupeng
5688e2ea53aSyupengTCP retransmission and congestion control
569ae5220c6SRandy Dunlap=========================================
5708e2ea53aSyupengThe TCP protocol has two retransmission mechanisms: SACK and fast
5718e2ea53aSyupengrecovery. They are exclusive with each other. When SACK is enabled,
5728e2ea53aSyupengthe kernel TCP stack would use SACK, or kernel would use fast
5738e2ea53aSyupengrecovery. The SACK is a TCP option, which is defined in `RFC2018`_,
5748e2ea53aSyupengthe fast recovery is defined in `RFC6582`_, which is also called
5758e2ea53aSyupeng'Reno'.
5768e2ea53aSyupeng
5778e2ea53aSyupengThe TCP congestion control is a big and complex topic. To understand
5788e2ea53aSyupengthe related snmp counter, we need to know the states of the congestion
5798e2ea53aSyupengcontrol state machine. There are 5 states: Open, Disorder, CWR,
5808e2ea53aSyupengRecovery and Loss. For details about these states, please refer page 5
5818e2ea53aSyupengand page 6 of this document:
5828e2ea53aSyupenghttps://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
5838e2ea53aSyupeng
5848e2ea53aSyupeng.. _RFC2018: https://tools.ietf.org/html/rfc2018
5858e2ea53aSyupeng.. _RFC6582: https://tools.ietf.org/html/rfc6582
5868e2ea53aSyupeng
5878e2ea53aSyupeng* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
588ae5220c6SRandy Dunlap
5898e2ea53aSyupengWhen the congestion control comes into Recovery state, if sack is
5908e2ea53aSyupengused, TcpExtTCPSackRecovery increases 1, if sack is not used,
5918e2ea53aSyupengTcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
5928e2ea53aSyupengstack begins to retransmit the lost packets.
5938e2ea53aSyupeng
5948e2ea53aSyupeng* TcpExtTCPSACKReneging
595ae5220c6SRandy Dunlap
5968e2ea53aSyupengA packet was acknowledged by SACK, but the receiver has dropped this
5978e2ea53aSyupengpacket, so the sender needs to retransmit this packet. In this
5988e2ea53aSyupengsituation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
5998e2ea53aSyupengcould drop a packet which has been acknowledged by SACK, although it is
6008e2ea53aSyupengunusual, it is allowed by the TCP protocol. The sender doesn't really
6018e2ea53aSyupengknow what happened on the receiver side. The sender just waits until
6028e2ea53aSyupengthe RTO expires for this packet, then the sender assumes this packet
6038e2ea53aSyupenghas been dropped by the receiver.
6048e2ea53aSyupeng
6058e2ea53aSyupeng* TcpExtTCPRenoReorder
606ae5220c6SRandy Dunlap
6078e2ea53aSyupengThe reorder packet is detected by fast recovery. It would only be used
6088e2ea53aSyupengif SACK is disabled. The fast recovery algorithm detects recorder by
6098e2ea53aSyupengthe duplicate ACK number. E.g., if retransmission is triggered, and
6108e2ea53aSyupengthe original retransmitted packet is not lost, it is just out of
6118e2ea53aSyupengorder, the receiver would acknowledge multiple times, one for the
6128e2ea53aSyupengretransmitted packet, another for the arriving of the original out of
6138e2ea53aSyupengorder packet. Thus the sender would find more ACks than its
6148e2ea53aSyupengexpectation, and the sender knows out of order occurs.
6158e2ea53aSyupeng
6168e2ea53aSyupeng* TcpExtTCPTSReorder
617ae5220c6SRandy Dunlap
6188e2ea53aSyupengThe reorder packet is detected when a hole is filled. E.g., assume the
6198e2ea53aSyupengsender sends packet 1,2,3,4,5, and the receiving order is
6208e2ea53aSyupeng1,2,4,5,3. When the sender receives the ACK of packet 3 (which will
6218e2ea53aSyupengfill the hole), two conditions will let TcpExtTCPTSReorder increase
6228e2ea53aSyupeng1: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
6238e2ea53aSyupeng3 is retransmitted but the timestamp of the packet 3's ACK is earlier
6248e2ea53aSyupengthan the retransmission timestamp.
6258e2ea53aSyupeng
6268e2ea53aSyupeng* TcpExtTCPSACKReorder
627ae5220c6SRandy Dunlap
6288e2ea53aSyupengThe reorder packet detected by SACK. The SACK has two methods to
6298e2ea53aSyupengdetect reorder: (1) DSACK is received by the sender. It means the
6308e2ea53aSyupengsender sends the same packet more than one times. And the only reason
6318e2ea53aSyupengis the sender believes an out of order packet is lost so it sends the
6328e2ea53aSyupengpacket again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
6338e2ea53aSyupengthe sender has received SACKs for packet 2 and 5, now the sender
6348e2ea53aSyupengreceives SACK for packet 4 and the sender doesn't retransmit the
6358e2ea53aSyupengpacket yet, the sender would know packet 4 is out of order. The TCP
6368e2ea53aSyupengstack of kernel will increase TcpExtTCPSACKReorder for both of the
6378e2ea53aSyupengabove scenarios.
6388e2ea53aSyupeng
639132c4e9eSyupeng* TcpExtTCPSlowStartRetrans
640132c4e9eSyupeng
641132c4e9eSyupengThe TCP stack wants to retransmit a packet and the congestion control
642132c4e9eSyupengstate is 'Loss'.
643132c4e9eSyupeng
644132c4e9eSyupeng* TcpExtTCPFastRetrans
645132c4e9eSyupeng
646132c4e9eSyupengThe TCP stack wants to retransmit a packet and the congestion control
647132c4e9eSyupengstate is not 'Loss'.
648132c4e9eSyupeng
649132c4e9eSyupeng* TcpExtTCPLostRetransmit
650132c4e9eSyupeng
651132c4e9eSyupengA SACK points out that a retransmission packet is lost again.
652132c4e9eSyupeng
653132c4e9eSyupeng* TcpExtTCPRetransFail
654132c4e9eSyupeng
655132c4e9eSyupengThe TCP stack tries to deliver a retransmission packet to lower layers
656132c4e9eSyupengbut the lower layers return an error.
657132c4e9eSyupeng
658132c4e9eSyupeng* TcpExtTCPSynRetrans
659132c4e9eSyupeng
660132c4e9eSyupengThe TCP stack retransmits a SYN packet.
661132c4e9eSyupeng
6628e2ea53aSyupengDSACK
6638e2ea53aSyupeng=====
6648e2ea53aSyupengThe DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
6658e2ea53aSyupengduplicate packets to the sender. There are two kinds of
6668e2ea53aSyupengduplications: (1) a packet which has been acknowledged is
6678e2ea53aSyupengduplicate. (2) an out of order packet is duplicate. The TCP stack
6688e2ea53aSyupengcounts these two kinds of duplications on both receiver side and
6698e2ea53aSyupengsender side.
6708e2ea53aSyupeng
6718e2ea53aSyupeng.. _RFC2883 : https://tools.ietf.org/html/rfc2883
6728e2ea53aSyupeng
6738e2ea53aSyupeng* TcpExtTCPDSACKOldSent
674ae5220c6SRandy Dunlap
6758e2ea53aSyupengThe TCP stack receives a duplicate packet which has been acked, so it
6768e2ea53aSyupengsends a DSACK to the sender.
6778e2ea53aSyupeng
6788e2ea53aSyupeng* TcpExtTCPDSACKOfoSent
679ae5220c6SRandy Dunlap
6808e2ea53aSyupengThe TCP stack receives an out of order duplicate packet, so it sends a
6818e2ea53aSyupengDSACK to the sender.
6828e2ea53aSyupeng
6838e2ea53aSyupeng* TcpExtTCPDSACKRecv
68465e9a6d2SRandy Dunlap
685a6c7c7aaSyupengThe TCP stack receives a DSACK, which indicates an acknowledged
6868e2ea53aSyupengduplicate packet is received.
6878e2ea53aSyupeng
6888e2ea53aSyupeng* TcpExtTCPDSACKOfoRecv
689ae5220c6SRandy Dunlap
6908e2ea53aSyupengThe TCP stack receives a DSACK, which indicate an out of order
6912b965472Syupengduplicate packet is received.
6922b965472Syupeng
693a6c7c7aaSyupenginvalid SACK and DSACK
69465e9a6d2SRandy Dunlap======================
695a6c7c7aaSyupengWhen a SACK (or DSACK) block is invalid, a corresponding counter would
696a6c7c7aaSyupengbe updated. The validation method is base on the start/end sequence
697a6c7c7aaSyupengnumber of the SACK block. For more details, please refer the comment
698a6c7c7aaSyupengof the function tcp_is_sackblock_valid in the kernel source code. A
699a6c7c7aaSyupengSACK option could have up to 4 blocks, they are checked
700a6c7c7aaSyupengindividually. E.g., if 3 blocks of a SACk is invalid, the
701a6c7c7aaSyupengcorresponding counter would be updated 3 times. The comment of the
702a6c7c7aaSyupeng`Add counters for discarded SACK blocks`_ patch has additional
703ede71caeSMasanari Iidaexplanation:
704a6c7c7aaSyupeng
705a6c7c7aaSyupeng.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
706a6c7c7aaSyupeng
707a6c7c7aaSyupeng* TcpExtTCPSACKDiscard
70865e9a6d2SRandy Dunlap
709a6c7c7aaSyupengThis counter indicates how many SACK blocks are invalid. If the invalid
710a6c7c7aaSyupengSACK block is caused by ACK recording, the TCP stack will only ignore
711a6c7c7aaSyupengit and won't update this counter.
712a6c7c7aaSyupeng
713a6c7c7aaSyupeng* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
71465e9a6d2SRandy Dunlap
715a6c7c7aaSyupengWhen a DSACK block is invalid, one of these two counters would be
716a6c7c7aaSyupengupdated. Which counter will be updated depends on the undo_marker flag
717a6c7c7aaSyupengof the TCP socket. If the undo_marker is not set, the TCP stack isn't
718a6c7c7aaSyupenglikely to re-transmit any packets, and we still receive an invalid
719a6c7c7aaSyupengDSACK block, the reason might be that the packet is duplicated in the
720a6c7c7aaSyupengmiddle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo
721a6c7c7aaSyupengwill be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
722a6c7c7aaSyupengwill be updated. As implied in its name, it might be an old packet.
723a6c7c7aaSyupeng
724a6c7c7aaSyupengSACK shift
72565e9a6d2SRandy Dunlap==========
726a6c7c7aaSyupengThe linux networking stack stores data in sk_buff struct (skb for
727a6c7c7aaSyupengshort). If a SACK block acrosses multiple skb, the TCP stack will try
728a6c7c7aaSyupengto re-arrange data in these skb. E.g. if a SACK block acknowledges seq
729a6c7c7aaSyupeng10 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and
730a6c7c7aaSyupeng15 in skb2 would be moved to skb1. This operation is 'shift'. If a
731a6c7c7aaSyupengSACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has
732a6c7c7aaSyupengseq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
733a6c7c7aaSyupengdiscard, this operation is 'merge'.
734a6c7c7aaSyupeng
735a6c7c7aaSyupeng* TcpExtTCPSackShifted
73665e9a6d2SRandy Dunlap
737a6c7c7aaSyupengA skb is shifted
738a6c7c7aaSyupeng
739a6c7c7aaSyupeng* TcpExtTCPSackMerged
74065e9a6d2SRandy Dunlap
741a6c7c7aaSyupengA skb is merged
742a6c7c7aaSyupeng
743a6c7c7aaSyupeng* TcpExtTCPSackShiftFallback
74465e9a6d2SRandy Dunlap
745a6c7c7aaSyupengA skb should be shifted or merged, but the TCP stack doesn't do it for
746a6c7c7aaSyupengsome reasons.
747a6c7c7aaSyupeng
7482b965472SyupengTCP out of order
749ae5220c6SRandy Dunlap================
7502b965472Syupeng* TcpExtTCPOFOQueue
751ae5220c6SRandy Dunlap
7522b965472SyupengThe TCP layer receives an out of order packet and has enough memory
7532b965472Syupengto queue it.
7542b965472Syupeng
7552b965472Syupeng* TcpExtTCPOFODrop
756ae5220c6SRandy Dunlap
7572b965472SyupengThe TCP layer receives an out of order packet but doesn't have enough
7582b965472Syupengmemory, so drops it. Such packets won't be counted into
7592b965472SyupengTcpExtTCPOFOQueue.
7602b965472Syupeng
7612b965472Syupeng* TcpExtTCPOFOMerge
762ae5220c6SRandy Dunlap
7632b965472SyupengThe received out of order packet has an overlay with the previous
7642b965472Syupengpacket. the overlay part will be dropped. All of TcpExtTCPOFOMerge
7652b965472Syupengpackets will also be counted into TcpExtTCPOFOQueue.
7662b965472Syupeng
7672b965472SyupengTCP PAWS
768ae5220c6SRandy Dunlap========
7692b965472SyupengPAWS (Protection Against Wrapped Sequence numbers) is an algorithm
7702b965472Syupengwhich is used to drop old packets. It depends on the TCP
7712b965472Syupengtimestamps. For detail information, please refer the `timestamp wiki`_
7722b965472Syupengand the `RFC of PAWS`_.
7732b965472Syupeng
7742b965472Syupeng.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
7752b965472Syupeng.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
7762b965472Syupeng
7772b965472Syupeng* TcpExtPAWSActive
778ae5220c6SRandy Dunlap
7792b965472SyupengPackets are dropped by PAWS in Syn-Sent status.
7802b965472Syupeng
7812b965472Syupeng* TcpExtPAWSEstab
782ae5220c6SRandy Dunlap
7832b965472SyupengPackets are dropped by PAWS in any status other than Syn-Sent.
7842b965472Syupeng
7852b965472SyupengTCP ACK skip
786ae5220c6SRandy Dunlap============
7872b965472SyupengIn some scenarios, kernel would avoid sending duplicate ACKs too
7882b965472Syupengfrequently. Please find more details in the tcp_invalid_ratelimit
7892b965472Syupengsection of the `sysctl document`_. When kernel decides to skip an ACK
7902b965472Syupengdue to tcp_invalid_ratelimit, kernel would update one of below
7912b965472Syupengcounters to indicate the ACK is skipped in which scenario. The ACK
7922b965472Syupengwould only be skipped if the received packet is either a SYN packet or
7932b965472Syupengit has no data.
7942b965472Syupeng
7951cec2cacSMauro Carvalho Chehab.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.rst
7962b965472Syupeng
7972b965472Syupeng* TcpExtTCPACKSkippedSynRecv
798ae5220c6SRandy Dunlap
7992b965472SyupengThe ACK is skipped in Syn-Recv status. The Syn-Recv status means the
8002b965472SyupengTCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
8012b965472Syupengwaiting for an ACK. Generally, the TCP stack doesn't need to send ACK
8022b965472Syupengin the Syn-Recv status. But in several scenarios, the TCP stack need
8032b965472Syupengto send an ACK. E.g., the TCP stack receives the same SYN packet
8042b965472Syupengrepeately, the received packet does not pass the PAWS check, or the
8052b965472Syupengreceived packet sequence number is out of window. In these scenarios,
8062b965472Syupengthe TCP stack needs to send ACK. If the ACk sending frequency is higher than
8072b965472Syupengtcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
8082b965472Syupengincrease TcpExtTCPACKSkippedSynRecv.
8092b965472Syupeng
8102b965472Syupeng
8112b965472Syupeng* TcpExtTCPACKSkippedPAWS
812ae5220c6SRandy Dunlap
8132b965472SyupengThe ACK is skipped due to PAWS (Protect Against Wrapped Sequence
8142b965472Syupengnumbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
8152b965472Syupengor Time-Wait statuses, the skipped ACK would be counted to
8162b965472SyupengTcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
8172b965472SyupengTcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
8182b965472Syupengwould be counted to TcpExtTCPACKSkippedPAWS.
8192b965472Syupeng
8202b965472Syupeng* TcpExtTCPACKSkippedSeq
821ae5220c6SRandy Dunlap
8222b965472SyupengThe sequence number is out of window and the timestamp passes the PAWS
8232b965472Syupengcheck and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
8242b965472Syupeng
8252b965472Syupeng* TcpExtTCPACKSkippedFinWait2
826ae5220c6SRandy Dunlap
8272b965472SyupengThe ACK is skipped in Fin-Wait-2 status, the reason would be either
8282b965472SyupengPAWS check fails or the received sequence number is out of window.
8292b965472Syupeng
8302b965472Syupeng* TcpExtTCPACKSkippedTimeWait
831ae5220c6SRandy Dunlap
832ede71caeSMasanari IidaThe ACK is skipped in Time-Wait status, the reason would be either
8332b965472SyupengPAWS check failed or the received sequence number is out of window.
8342b965472Syupeng
8352b965472Syupeng* TcpExtTCPACKSkippedChallenge
836ae5220c6SRandy Dunlap
8372b965472SyupengThe ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
8382b965472Syupeng3 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
8392b965472Syupeng`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
8402b965472Syupengthree scenarios, In some TCP status, the linux TCP stack would also
8412b965472Syupengsend challenge ACKs if the ACK number is before the first
8422b965472Syupengunacknowledged number (more strict than `RFC 5961 section 5.2`_).
8432b965472Syupeng
8442b965472Syupeng.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
8452b965472Syupeng.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
8462b965472Syupeng.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
8472b965472Syupeng
848a6c7c7aaSyupengTCP receive window
849132c4e9eSyupeng==================
850a6c7c7aaSyupeng* TcpExtTCPWantZeroWindowAdv
851132c4e9eSyupeng
852a6c7c7aaSyupengDepending on current memory usage, the TCP stack tries to set receive
853a6c7c7aaSyupengwindow to zero. But the receive window might still be a no-zero
854a6c7c7aaSyupengvalue. For example, if the previous window size is 10, and the TCP
855a6c7c7aaSyupengstack receives 3 bytes, the current window size would be 7 even if the
856a6c7c7aaSyupengwindow size calculated by the memory usage is zero.
857a6c7c7aaSyupeng
858a6c7c7aaSyupeng* TcpExtTCPToZeroWindowAdv
859132c4e9eSyupeng
860a6c7c7aaSyupengThe TCP receive window is set to zero from a no-zero value.
861a6c7c7aaSyupeng
862a6c7c7aaSyupeng* TcpExtTCPFromZeroWindowAdv
863132c4e9eSyupeng
864a6c7c7aaSyupengThe TCP receive window is set to no-zero value from zero.
865a6c7c7aaSyupeng
866a6c7c7aaSyupeng
867a6c7c7aaSyupengDelayed ACK
868132c4e9eSyupeng===========
869a6c7c7aaSyupengThe TCP Delayed ACK is a technique which is used for reducing the
870a6c7c7aaSyupengpacket count in the network. For more details, please refer the
871a6c7c7aaSyupeng`Delayed ACK wiki`_
872a6c7c7aaSyupeng
873a6c7c7aaSyupeng.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
874a6c7c7aaSyupeng
875a6c7c7aaSyupeng* TcpExtDelayedACKs
876132c4e9eSyupeng
877a6c7c7aaSyupengA delayed ACK timer expires. The TCP stack will send a pure ACK packet
878a6c7c7aaSyupengand exit the delayed ACK mode.
879a6c7c7aaSyupeng
880a6c7c7aaSyupeng* TcpExtDelayedACKLocked
881132c4e9eSyupeng
882a6c7c7aaSyupengA delayed ACK timer expires, but the TCP stack can't send an ACK
883a6c7c7aaSyupengimmediately due to the socket is locked by a userspace program. The
884a6c7c7aaSyupengTCP stack will send a pure ACK later (after the userspace program
885a6c7c7aaSyupengunlock the socket). When the TCP stack sends the pure ACK later, the
886a6c7c7aaSyupengTCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
887a6c7c7aaSyupengmode.
888a6c7c7aaSyupeng
889a6c7c7aaSyupeng* TcpExtDelayedACKLost
890132c4e9eSyupeng
891a6c7c7aaSyupengIt will be updated when the TCP stack receives a packet which has been
892a6c7c7aaSyupengACKed. A Delayed ACK loss might cause this issue, but it would also be
893a6c7c7aaSyupengtriggered by other reasons, such as a packet is duplicated in the
894a6c7c7aaSyupengnetwork.
895a6c7c7aaSyupeng
896a6c7c7aaSyupengTail Loss Probe (TLP)
897132c4e9eSyupeng=====================
898a6c7c7aaSyupengTLP is an algorithm which is used to detect TCP packet loss. For more
899a6c7c7aaSyupengdetails, please refer the `TLP paper`_.
900a6c7c7aaSyupeng
901a6c7c7aaSyupeng.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
902a6c7c7aaSyupeng
903a6c7c7aaSyupeng* TcpExtTCPLossProbes
904132c4e9eSyupeng
905a6c7c7aaSyupengA TLP probe packet is sent.
906a6c7c7aaSyupeng
907a6c7c7aaSyupeng* TcpExtTCPLossProbeRecovery
908132c4e9eSyupeng
909a6c7c7aaSyupengA packet loss is detected and recovered by TLP.
9108e2ea53aSyupeng
911c44166feSMauro Carvalho ChehabTCP Fast Open description
912c44166feSMauro Carvalho Chehab=========================
913132c4e9eSyupengTCP Fast Open is a technology which allows data transfer before the
914132c4e9eSyupeng3-way handshake complete. Please refer the `TCP Fast Open wiki`_ for a
915132c4e9eSyupenggeneral description.
916132c4e9eSyupeng
917132c4e9eSyupeng.. _TCP Fast Open wiki: https://en.wikipedia.org/wiki/TCP_Fast_Open
918132c4e9eSyupeng
919132c4e9eSyupeng* TcpExtTCPFastOpenActive
920132c4e9eSyupeng
921132c4e9eSyupengWhen the TCP stack receives an ACK packet in the SYN-SENT status, and
922132c4e9eSyupengthe ACK packet acknowledges the data in the SYN packet, the TCP stack
923132c4e9eSyupengunderstand the TFO cookie is accepted by the other side, then it
924132c4e9eSyupengupdates this counter.
925132c4e9eSyupeng
926132c4e9eSyupeng* TcpExtTCPFastOpenActiveFail
927132c4e9eSyupeng
928132c4e9eSyupengThis counter indicates that the TCP stack initiated a TCP Fast Open,
929132c4e9eSyupengbut it failed. This counter would be updated in three scenarios: (1)
930132c4e9eSyupengthe other side doesn't acknowledge the data in the SYN packet. (2) The
931132c4e9eSyupengSYN packet which has the TFO cookie is timeout at least once. (3)
932132c4e9eSyupengafter the 3-way handshake, the retransmission timeout happens
933132c4e9eSyupengnet.ipv4.tcp_retries1 times, because some middle-boxes may black-hole
934132c4e9eSyupengfast open after the handshake.
935132c4e9eSyupeng
936132c4e9eSyupeng* TcpExtTCPFastOpenPassive
937132c4e9eSyupeng
938132c4e9eSyupengThis counter indicates how many times the TCP stack accepts the fast
939132c4e9eSyupengopen request.
940132c4e9eSyupeng
941132c4e9eSyupeng* TcpExtTCPFastOpenPassiveFail
942132c4e9eSyupeng
943132c4e9eSyupengThis counter indicates how many times the TCP stack rejects the fast
944132c4e9eSyupengopen request. It is caused by either the TFO cookie is invalid or the
945132c4e9eSyupengTCP stack finds an error during the socket creating process.
946132c4e9eSyupeng
947132c4e9eSyupeng* TcpExtTCPFastOpenListenOverflow
948132c4e9eSyupeng
949132c4e9eSyupengWhen the pending fast open request number is larger than
950132c4e9eSyupengfastopenq->max_qlen, the TCP stack will reject the fast open request
951132c4e9eSyupengand update this counter. When this counter is updated, the TCP stack
952132c4e9eSyupengwon't update TcpExtTCPFastOpenPassive or
953132c4e9eSyupengTcpExtTCPFastOpenPassiveFail. The fastopenq->max_qlen is set by the
954132c4e9eSyupengTCP_FASTOPEN socket operation and it could not be larger than
955132c4e9eSyupengnet.core.somaxconn. For example:
956132c4e9eSyupeng
957132c4e9eSyupengsetsockopt(sfd, SOL_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen));
958132c4e9eSyupeng
959132c4e9eSyupeng* TcpExtTCPFastOpenCookieReqd
960132c4e9eSyupeng
961132c4e9eSyupengThis counter indicates how many times a client wants to request a TFO
962132c4e9eSyupengcookie.
963132c4e9eSyupeng
964132c4e9eSyupengSYN cookies
965132c4e9eSyupeng===========
966132c4e9eSyupengSYN cookies are used to mitigate SYN flood, for details, please refer
967132c4e9eSyupengthe `SYN cookies wiki`_.
968132c4e9eSyupeng
969132c4e9eSyupeng.. _SYN cookies wiki: https://en.wikipedia.org/wiki/SYN_cookies
970132c4e9eSyupeng
971132c4e9eSyupeng* TcpExtSyncookiesSent
972132c4e9eSyupeng
973132c4e9eSyupengIt indicates how many SYN cookies are sent.
974132c4e9eSyupeng
975132c4e9eSyupeng* TcpExtSyncookiesRecv
976132c4e9eSyupeng
977132c4e9eSyupengHow many reply packets of the SYN cookies the TCP stack receives.
978132c4e9eSyupeng
979132c4e9eSyupeng* TcpExtSyncookiesFailed
980132c4e9eSyupeng
981132c4e9eSyupengThe MSS decoded from the SYN cookie is invalid. When this counter is
982132c4e9eSyupengupdated, the received packet won't be treated as a SYN cookie and the
983*a266ef69SRandy DunlapTcpExtSyncookiesRecv counter won't be updated.
984132c4e9eSyupeng
985132c4e9eSyupengChallenge ACK
986132c4e9eSyupeng=============
987ede71caeSMasanari IidaFor details of challenge ACK, please refer the explanation of
988132c4e9eSyupengTcpExtTCPACKSkippedChallenge.
989132c4e9eSyupeng
990132c4e9eSyupeng* TcpExtTCPChallengeACK
991132c4e9eSyupeng
992132c4e9eSyupengThe number of challenge acks sent.
993132c4e9eSyupeng
994132c4e9eSyupeng* TcpExtTCPSYNChallenge
995132c4e9eSyupeng
996132c4e9eSyupengThe number of challenge acks sent in response to SYN packets. After
997132c4e9eSyupengupdates this counter, the TCP stack might send a challenge ACK and
998132c4e9eSyupengupdate the TcpExtTCPChallengeACK counter, or it might also skip to
999132c4e9eSyupengsend the challenge and update the TcpExtTCPACKSkippedChallenge.
1000132c4e9eSyupeng
1001132c4e9eSyupengprune
1002132c4e9eSyupeng=====
1003132c4e9eSyupengWhen a socket is under memory pressure, the TCP stack will try to
1004132c4e9eSyupengreclaim memory from the receiving queue and out of order queue. One of
1005ede71caeSMasanari Iidathe reclaiming method is 'collapse', which means allocate a big skb,
1006132c4e9eSyupengcopy the contiguous skbs to the single big skb, and free these
1007132c4e9eSyupengcontiguous skbs.
1008132c4e9eSyupeng
1009132c4e9eSyupeng* TcpExtPruneCalled
1010132c4e9eSyupeng
1011132c4e9eSyupengThe TCP stack tries to reclaim memory for a socket. After updates this
1012132c4e9eSyupengcounter, the TCP stack will try to collapse the out of order queue and
1013132c4e9eSyupengthe receiving queue. If the memory is still not enough, the TCP stack
1014132c4e9eSyupengwill try to discard packets from the out of order queue (and update the
1015132c4e9eSyupengTcpExtOfoPruned counter)
1016132c4e9eSyupeng
1017132c4e9eSyupeng* TcpExtOfoPruned
1018132c4e9eSyupeng
1019132c4e9eSyupengThe TCP stack tries to discard packet on the out of order queue.
1020132c4e9eSyupeng
1021132c4e9eSyupeng* TcpExtRcvPruned
1022132c4e9eSyupeng
1023132c4e9eSyupengAfter 'collapse' and discard packets from the out of order queue, if
1024132c4e9eSyupengthe actually used memory is still larger than the max allowed memory,
1025132c4e9eSyupengthis counter will be updated. It means the 'prune' fails.
1026132c4e9eSyupeng
1027132c4e9eSyupeng* TcpExtTCPRcvCollapsed
1028132c4e9eSyupeng
1029132c4e9eSyupengThis counter indicates how many skbs are freed during 'collapse'.
1030132c4e9eSyupeng
1031b08794a9Syupengexamples
1032ae5220c6SRandy Dunlap========
1033b08794a9Syupeng
1034b08794a9Syupengping test
1035ae5220c6SRandy Dunlap---------
1036b08794a9SyupengRun the ping command against the public dns server 8.8.8.8::
1037b08794a9Syupeng
1038b08794a9Syupeng  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
1039b08794a9Syupeng  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
1040b08794a9Syupeng  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
1041b08794a9Syupeng
1042b08794a9Syupeng  --- 8.8.8.8 ping statistics ---
1043b08794a9Syupeng  1 packets transmitted, 1 received, 0% packet loss, time 0ms
1044b08794a9Syupeng  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
1045b08794a9Syupeng
1046b08794a9SyupengThe nstayt result::
1047b08794a9Syupeng
1048b08794a9Syupeng  nstatuser@nstat-a:~$ nstat
1049b08794a9Syupeng  #kernel
1050b08794a9Syupeng  IpInReceives                    1                  0.0
1051b08794a9Syupeng  IpInDelivers                    1                  0.0
1052b08794a9Syupeng  IpOutRequests                   1                  0.0
1053b08794a9Syupeng  IcmpInMsgs                      1                  0.0
1054b08794a9Syupeng  IcmpInEchoReps                  1                  0.0
1055b08794a9Syupeng  IcmpOutMsgs                     1                  0.0
1056b08794a9Syupeng  IcmpOutEchos                    1                  0.0
1057b08794a9Syupeng  IcmpMsgInType0                  1                  0.0
1058b08794a9Syupeng  IcmpMsgOutType8                 1                  0.0
1059b08794a9Syupeng  IpExtInOctets                   84                 0.0
1060b08794a9Syupeng  IpExtOutOctets                  84                 0.0
1061b08794a9Syupeng  IpExtInNoECTPkts                1                  0.0
1062b08794a9Syupeng
1063b08794a9SyupengThe Linux server sent an ICMP Echo packet, so IpOutRequests,
1064b08794a9SyupengIcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
1065b08794a9Syupengserver got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
1066b08794a9SyupengIcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
1067b08794a9Syupengwas passed to the ICMP layer via IP layer, so IpInDelivers was
1068b08794a9Syupengincreased 1. The default ping data size is 48, so an ICMP Echo packet
1069b08794a9Syupengand its corresponding Echo Reply packet are constructed by:
1070b08794a9Syupeng
1071b08794a9Syupeng* 14 bytes MAC header
1072b08794a9Syupeng* 20 bytes IP header
1073b08794a9Syupeng* 16 bytes ICMP header
1074b08794a9Syupeng* 48 bytes data (default value of the ping command)
1075b08794a9Syupeng
1076b08794a9SyupengSo the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
107780cc4950Syupeng
107880cc4950Syupengtcp 3-way handshake
1079ae5220c6SRandy Dunlap-------------------
108080cc4950SyupengOn server side, we run::
108180cc4950Syupeng
108280cc4950Syupeng  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
108380cc4950Syupeng  Listening on [0.0.0.0] (family 0, port 9000)
108480cc4950Syupeng
108580cc4950SyupengOn client side, we run::
108680cc4950Syupeng
108780cc4950Syupeng  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
108880cc4950Syupeng  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
108980cc4950Syupeng
109080cc4950SyupengThe server listened on tcp 9000 port, the client connected to it, they
109180cc4950Syupengcompleted the 3-way handshake.
109280cc4950Syupeng
109380cc4950SyupengOn server side, we can find below nstat output::
109480cc4950Syupeng
109580cc4950Syupeng  nstatuser@nstat-b:~$ nstat | grep -i tcp
109680cc4950Syupeng  TcpPassiveOpens                 1                  0.0
109780cc4950Syupeng  TcpInSegs                       2                  0.0
109880cc4950Syupeng  TcpOutSegs                      1                  0.0
109980cc4950Syupeng  TcpExtTCPPureAcks               1                  0.0
110080cc4950Syupeng
110180cc4950SyupengOn client side, we can find below nstat output::
110280cc4950Syupeng
110380cc4950Syupeng  nstatuser@nstat-a:~$ nstat | grep -i tcp
110480cc4950Syupeng  TcpActiveOpens                  1                  0.0
110580cc4950Syupeng  TcpInSegs                       1                  0.0
110680cc4950Syupeng  TcpOutSegs                      2                  0.0
110780cc4950Syupeng
110880cc4950SyupengWhen the server received the first SYN, it replied a SYN+ACK, and came into
110980cc4950SyupengSYN-RCVD state, so TcpPassiveOpens increased 1. The server received
111080cc4950SyupengSYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
111180cc4950Syupengpackets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
111280cc4950Syupengof the 3-way handshake is a pure ACK without data, so
111380cc4950SyupengTcpExtTCPPureAcks increased 1.
111480cc4950Syupeng
111580cc4950SyupengWhen the client sent SYN, the client came into the SYN-SENT state, so
111680cc4950SyupengTcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
111780cc4950SyupengACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
111880cc4950Syupeng1, TcpOutSegs increased 2.
111980cc4950Syupeng
112080cc4950SyupengTCP normal traffic
1121ae5220c6SRandy Dunlap------------------
112280cc4950SyupengRun nc on server::
112380cc4950Syupeng
112480cc4950Syupeng  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
112580cc4950Syupeng  Listening on [0.0.0.0] (family 0, port 9000)
112680cc4950Syupeng
112780cc4950SyupengRun nc on client::
112880cc4950Syupeng
112980cc4950Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
113080cc4950Syupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
113180cc4950Syupeng
113280cc4950SyupengInput a string in the nc client ('hello' in our example)::
113380cc4950Syupeng
113480cc4950Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
113580cc4950Syupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
113680cc4950Syupeng  hello
113780cc4950Syupeng
113880cc4950SyupengThe client side nstat output::
113980cc4950Syupeng
114080cc4950Syupeng  nstatuser@nstat-a:~$ nstat
114180cc4950Syupeng  #kernel
114280cc4950Syupeng  IpInReceives                    1                  0.0
114380cc4950Syupeng  IpInDelivers                    1                  0.0
114480cc4950Syupeng  IpOutRequests                   1                  0.0
114580cc4950Syupeng  TcpInSegs                       1                  0.0
114680cc4950Syupeng  TcpOutSegs                      1                  0.0
114780cc4950Syupeng  TcpExtTCPPureAcks               1                  0.0
114880cc4950Syupeng  TcpExtTCPOrigDataSent           1                  0.0
114980cc4950Syupeng  IpExtInOctets                   52                 0.0
115080cc4950Syupeng  IpExtOutOctets                  58                 0.0
115180cc4950Syupeng  IpExtInNoECTPkts                1                  0.0
115280cc4950Syupeng
115380cc4950SyupengThe server side nstat output::
115480cc4950Syupeng
115580cc4950Syupeng  nstatuser@nstat-b:~$ nstat
115680cc4950Syupeng  #kernel
115780cc4950Syupeng  IpInReceives                    1                  0.0
115880cc4950Syupeng  IpInDelivers                    1                  0.0
115980cc4950Syupeng  IpOutRequests                   1                  0.0
116080cc4950Syupeng  TcpInSegs                       1                  0.0
116180cc4950Syupeng  TcpOutSegs                      1                  0.0
116280cc4950Syupeng  IpExtInOctets                   58                 0.0
116380cc4950Syupeng  IpExtOutOctets                  52                 0.0
116480cc4950Syupeng  IpExtInNoECTPkts                1                  0.0
116580cc4950Syupeng
1166ede71caeSMasanari IidaInput a string in nc client side again ('world' in our example)::
116780cc4950Syupeng
116880cc4950Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
116980cc4950Syupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
117080cc4950Syupeng  hello
117180cc4950Syupeng  world
117280cc4950Syupeng
117380cc4950SyupengClient side nstat output::
117480cc4950Syupeng
117580cc4950Syupeng  nstatuser@nstat-a:~$ nstat
117680cc4950Syupeng  #kernel
117780cc4950Syupeng  IpInReceives                    1                  0.0
117880cc4950Syupeng  IpInDelivers                    1                  0.0
117980cc4950Syupeng  IpOutRequests                   1                  0.0
118080cc4950Syupeng  TcpInSegs                       1                  0.0
118180cc4950Syupeng  TcpOutSegs                      1                  0.0
118280cc4950Syupeng  TcpExtTCPHPAcks                 1                  0.0
118380cc4950Syupeng  TcpExtTCPOrigDataSent           1                  0.0
118480cc4950Syupeng  IpExtInOctets                   52                 0.0
118580cc4950Syupeng  IpExtOutOctets                  58                 0.0
118680cc4950Syupeng  IpExtInNoECTPkts                1                  0.0
118780cc4950Syupeng
118880cc4950Syupeng
118980cc4950SyupengServer side nstat output::
119080cc4950Syupeng
119180cc4950Syupeng  nstatuser@nstat-b:~$ nstat
119280cc4950Syupeng  #kernel
119380cc4950Syupeng  IpInReceives                    1                  0.0
119480cc4950Syupeng  IpInDelivers                    1                  0.0
119580cc4950Syupeng  IpOutRequests                   1                  0.0
119680cc4950Syupeng  TcpInSegs                       1                  0.0
119780cc4950Syupeng  TcpOutSegs                      1                  0.0
119880cc4950Syupeng  TcpExtTCPHPHits                 1                  0.0
119980cc4950Syupeng  IpExtInOctets                   58                 0.0
120080cc4950Syupeng  IpExtOutOctets                  52                 0.0
120180cc4950Syupeng  IpExtInNoECTPkts                1                  0.0
120280cc4950Syupeng
120380cc4950SyupengCompare the first client-side nstat and the second client-side nstat,
120480cc4950Syupengwe could find one difference: the first one had a 'TcpExtTCPPureAcks',
120580cc4950Syupengbut the second one had a 'TcpExtTCPHPAcks'. The first server-side
120680cc4950Syupengnstat and the second server-side nstat had a difference too: the
120780cc4950Syupengsecond server-side nstat had a TcpExtTCPHPHits, but the first
120880cc4950Syupengserver-side nstat didn't have it. The network traffic patterns were
120980cc4950Syupengexactly the same: the client sent a packet to the server, the server
121080cc4950Syupengreplied an ACK. But kernel handled them in different ways. When the
121180cc4950SyupengTCP window scale option is not used, kernel will try to enable fast
121280cc4950Syupengpath immediately when the connection comes into the established state,
121380cc4950Syupengbut if the TCP window scale option is used, kernel will disable the
1214ede71caeSMasanari Iidafast path at first, and try to enable it after kernel receives
121580cc4950Syupengpackets. We could use the 'ss' command to verify whether the window
121680cc4950Syupengscale option is used. e.g. run below command on either server or
121780cc4950Syupengclient::
121880cc4950Syupeng
121980cc4950Syupeng  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
122080cc4950Syupeng  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port
122180cc4950Syupeng  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000
122280cc4950Syupeng             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
122380cc4950Syupeng
122480cc4950SyupengThe 'wscale:7,7' means both server and client set the window scale
122580cc4950Syupengoption to 7. Now we could explain the nstat output in our test:
122680cc4950Syupeng
122780cc4950SyupengIn the first nstat output of client side, the client sent a packet, server
122880cc4950Syupengreply an ACK, when kernel handled this ACK, the fast path was not
122980cc4950Syupengenabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
123080cc4950Syupeng
123180cc4950SyupengIn the second nstat output of client side, the client sent a packet again,
123280cc4950Syupengand received another ACK from the server, in this time, the fast path is
123380cc4950Syupengenabled, and the ACK was qualified for fast path, so it was handled by
123480cc4950Syupengthe fast path, so this ACK was counted into TcpExtTCPHPAcks.
123580cc4950Syupeng
123680cc4950SyupengIn the first nstat output of server side, fast path was not enabled,
123780cc4950Syupengso there was no 'TcpExtTCPHPHits'.
123880cc4950Syupeng
123980cc4950SyupengIn the second nstat output of server side, the fast path was enabled,
124080cc4950Syupengand the packet received from client qualified for fast path, so it
124180cc4950Syupengwas counted into 'TcpExtTCPHPHits'.
124280cc4950Syupeng
124380cc4950SyupengTcpExtTCPAbortOnClose
1244ae5220c6SRandy Dunlap---------------------
124580cc4950SyupengOn the server side, we run below python script::
124680cc4950Syupeng
124780cc4950Syupeng  import socket
124880cc4950Syupeng  import time
124980cc4950Syupeng
125080cc4950Syupeng  port = 9000
125180cc4950Syupeng
125280cc4950Syupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
125380cc4950Syupeng  s.bind(('0.0.0.0', port))
125480cc4950Syupeng  s.listen(1)
125580cc4950Syupeng  sock, addr = s.accept()
125680cc4950Syupeng  while True:
125780cc4950Syupeng      time.sleep(9999999)
125880cc4950Syupeng
125980cc4950SyupengThis python script listen on 9000 port, but doesn't read anything from
126080cc4950Syupengthe connection.
126180cc4950Syupeng
126280cc4950SyupengOn the client side, we send the string "hello" by nc::
126380cc4950Syupeng
126480cc4950Syupeng  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
126580cc4950Syupeng
126680cc4950SyupengThen, we come back to the server side, the server has received the "hello"
126780cc4950Syupengpacket, and the TCP layer has acked this packet, but the application didn't
126880cc4950Syupengread it yet. We type Ctrl-C to terminate the server script. Then we
126980cc4950Syupengcould find TcpExtTCPAbortOnClose increased 1 on the server side::
127080cc4950Syupeng
127180cc4950Syupeng  nstatuser@nstat-b:~$ nstat | grep -i abort
127280cc4950Syupeng  TcpExtTCPAbortOnClose           1                  0.0
127380cc4950Syupeng
127480cc4950SyupengIf we run tcpdump on the server side, we could find the server sent a
127580cc4950SyupengRST after we type Ctrl-C.
127680cc4950Syupeng
127780cc4950SyupengTcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
1278ae5220c6SRandy Dunlap---------------------------------------------------
127980cc4950SyupengBelow is an example which let the orphan socket count be higher than
128080cc4950Syupengnet.ipv4.tcp_max_orphans.
128180cc4950SyupengChange tcp_max_orphans to a smaller value on client::
128280cc4950Syupeng
128380cc4950Syupeng  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
128480cc4950Syupeng
128580cc4950SyupengClient code (create 64 connection to server)::
128680cc4950Syupeng
128780cc4950Syupeng  nstatuser@nstat-a:~$ cat client_orphan.py
128880cc4950Syupeng  import socket
128980cc4950Syupeng  import time
129080cc4950Syupeng
129180cc4950Syupeng  server = 'nstat-b' # server address
129280cc4950Syupeng  port = 9000
129380cc4950Syupeng
129480cc4950Syupeng  count = 64
129580cc4950Syupeng
129680cc4950Syupeng  connection_list = []
129780cc4950Syupeng
129880cc4950Syupeng  for i in range(64):
129980cc4950Syupeng      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
130080cc4950Syupeng      s.connect((server, port))
130180cc4950Syupeng      connection_list.append(s)
130280cc4950Syupeng      print("connection_count: %d" % len(connection_list))
130380cc4950Syupeng
130480cc4950Syupeng  while True:
130580cc4950Syupeng      time.sleep(99999)
130680cc4950Syupeng
130780cc4950SyupengServer code (accept 64 connection from client)::
130880cc4950Syupeng
130980cc4950Syupeng  nstatuser@nstat-b:~$ cat server_orphan.py
131080cc4950Syupeng  import socket
131180cc4950Syupeng  import time
131280cc4950Syupeng
131380cc4950Syupeng  port = 9000
131480cc4950Syupeng  count = 64
131580cc4950Syupeng
131680cc4950Syupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
131780cc4950Syupeng  s.bind(('0.0.0.0', port))
131880cc4950Syupeng  s.listen(count)
131980cc4950Syupeng  connection_list = []
132080cc4950Syupeng  while True:
132180cc4950Syupeng      sock, addr = s.accept()
132280cc4950Syupeng      connection_list.append((sock, addr))
132380cc4950Syupeng      print("connection_count: %d" % len(connection_list))
132480cc4950Syupeng
132580cc4950SyupengRun the python scripts on server and client.
132680cc4950Syupeng
132780cc4950SyupengOn server::
132880cc4950Syupeng
132980cc4950Syupeng  python3 server_orphan.py
133080cc4950Syupeng
133180cc4950SyupengOn client::
133280cc4950Syupeng
133380cc4950Syupeng  python3 client_orphan.py
133480cc4950Syupeng
133580cc4950SyupengRun iptables on server::
133680cc4950Syupeng
133780cc4950Syupeng  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
133880cc4950Syupeng
133980cc4950SyupengType Ctrl-C on client, stop client_orphan.py.
134080cc4950Syupeng
134180cc4950SyupengCheck TcpExtTCPAbortOnMemory on client::
134280cc4950Syupeng
134380cc4950Syupeng  nstatuser@nstat-a:~$ nstat | grep -i abort
134480cc4950Syupeng  TcpExtTCPAbortOnMemory          54                 0.0
134580cc4950Syupeng
1346ede71caeSMasanari IidaCheck orphaned socket count on client::
134780cc4950Syupeng
134880cc4950Syupeng  nstatuser@nstat-a:~$ ss -s
134980cc4950Syupeng  Total: 131 (kernel 0)
135080cc4950Syupeng  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
135180cc4950Syupeng
135280cc4950Syupeng  Transport Total     IP        IPv6
135380cc4950Syupeng  *         0         -         -
135480cc4950Syupeng  RAW       1         0         1
135580cc4950Syupeng  UDP       1         1         0
135680cc4950Syupeng  TCP       14        13        1
135780cc4950Syupeng  INET      16        14        2
135880cc4950Syupeng  FRAG      0         0         0
135980cc4950Syupeng
136080cc4950SyupengThe explanation of the test: after run server_orphan.py and
136180cc4950Syupengclient_orphan.py, we set up 64 connections between server and
136280cc4950Syupengclient. Run the iptables command, the server will drop all packets from
136380cc4950Syupengthe client, type Ctrl-C on client_orphan.py, the system of the client
136480cc4950Syupengwould try to close these connections, and before they are closed
136580cc4950Syupenggracefully, these connections became orphan sockets. As the iptables
136680cc4950Syupengof the server blocked packets from the client, the server won't receive fin
136780cc4950Syupengfrom the client, so all connection on clients would be stuck on FIN_WAIT_1
136880cc4950Syupengstage, so they will keep as orphan sockets until timeout. We have echo
136980cc4950Syupeng10 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
137080cc4950Syupengonly keep 10 orphan sockets, for all other orphan sockets, the client
137180cc4950Syupengsystem sent RST for them and delete them. We have 64 connections, so
137280cc4950Syupengthe 'ss -s' command shows the system has 10 orphan sockets, and the
137380cc4950Syupengvalue of TcpExtTCPAbortOnMemory was 54.
137480cc4950Syupeng
137580cc4950SyupengAn additional explanation about orphan socket count: You could find the
137680cc4950Syupengexactly orphan socket count by the 'ss -s' command, but when kernel
137780cc4950Syupengdecide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
137880cc4950Syupengdoesn't always check the exactly orphan socket count. For increasing
137980cc4950Syupengperformance, kernel checks an approximate count firstly, if the
138080cc4950Syupengapproximate count is more than tcp_max_orphans, kernel checks the
138180cc4950Syupengexact count again. So if the approximate count is less than
138280cc4950Syupengtcp_max_orphans, but exactly count is more than tcp_max_orphans, you
138380cc4950Syupengwould find TcpExtTCPAbortOnMemory is not increased at all. If
138480cc4950Syupengtcp_max_orphans is large enough, it won't occur, but if you decrease
138580cc4950Syupengtcp_max_orphans to a small value like our test, you might find this
138680cc4950Syupengissue. So in our test, the client set up 64 connections although the
138780cc4950Syupengtcp_max_orphans is 10. If the client only set up 11 connections, we
138880cc4950Syupengcan't find the change of TcpExtTCPAbortOnMemory.
138980cc4950Syupeng
139080cc4950SyupengContinue the previous test, we wait for several minutes. Because of the
139180cc4950Syupengiptables on the server blocked the traffic, the server wouldn't receive
139280cc4950Syupengfin, and all the client's orphan sockets would timeout on the
139380cc4950SyupengFIN_WAIT_1 state finally. So we wait for a few minutes, we could find
139480cc4950Syupeng10 timeout on the client::
139580cc4950Syupeng
139680cc4950Syupeng  nstatuser@nstat-a:~$ nstat | grep -i abort
139780cc4950Syupeng  TcpExtTCPAbortOnTimeout         10                 0.0
139880cc4950Syupeng
139980cc4950SyupengTcpExtTCPAbortOnLinger
1400ae5220c6SRandy Dunlap----------------------
140180cc4950SyupengThe server side code::
140280cc4950Syupeng
140380cc4950Syupeng  nstatuser@nstat-b:~$ cat server_linger.py
140480cc4950Syupeng  import socket
140580cc4950Syupeng  import time
140680cc4950Syupeng
140780cc4950Syupeng  port = 9000
140880cc4950Syupeng
140980cc4950Syupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
141080cc4950Syupeng  s.bind(('0.0.0.0', port))
141180cc4950Syupeng  s.listen(1)
141280cc4950Syupeng  sock, addr = s.accept()
141380cc4950Syupeng  while True:
141480cc4950Syupeng      time.sleep(9999999)
141580cc4950Syupeng
141680cc4950SyupengThe client side code::
141780cc4950Syupeng
141880cc4950Syupeng  nstatuser@nstat-a:~$ cat client_linger.py
141980cc4950Syupeng  import socket
142080cc4950Syupeng  import struct
142180cc4950Syupeng
142280cc4950Syupeng  server = 'nstat-b' # server address
142380cc4950Syupeng  port = 9000
142480cc4950Syupeng
142580cc4950Syupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
142680cc4950Syupeng  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
142780cc4950Syupeng  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
142880cc4950Syupeng  s.connect((server, port))
142980cc4950Syupeng  s.close()
143080cc4950Syupeng
143180cc4950SyupengRun server_linger.py on server::
143280cc4950Syupeng
143380cc4950Syupeng  nstatuser@nstat-b:~$ python3 server_linger.py
143480cc4950Syupeng
143580cc4950SyupengRun client_linger.py on client::
143680cc4950Syupeng
143780cc4950Syupeng  nstatuser@nstat-a:~$ python3 client_linger.py
143880cc4950Syupeng
143980cc4950SyupengAfter run client_linger.py, check the output of nstat::
144080cc4950Syupeng
144180cc4950Syupeng  nstatuser@nstat-a:~$ nstat | grep -i abort
144280cc4950Syupeng  TcpExtTCPAbortOnLinger          1                  0.0
1443712ee16cSyupeng
1444712ee16cSyupengTcpExtTCPRcvCoalesce
1445ae5220c6SRandy Dunlap--------------------
1446712ee16cSyupengOn the server, we run a program which listen on TCP port 9000, but
1447712ee16cSyupengdoesn't read any data::
1448712ee16cSyupeng
1449712ee16cSyupeng  import socket
1450712ee16cSyupeng  import time
1451712ee16cSyupeng  port = 9000
1452712ee16cSyupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1453712ee16cSyupeng  s.bind(('0.0.0.0', port))
1454712ee16cSyupeng  s.listen(1)
1455712ee16cSyupeng  sock, addr = s.accept()
1456712ee16cSyupeng  while True:
1457712ee16cSyupeng      time.sleep(9999999)
1458712ee16cSyupeng
1459712ee16cSyupengSave the above code as server_coalesce.py, and run::
1460712ee16cSyupeng
1461712ee16cSyupeng  python3 server_coalesce.py
1462712ee16cSyupeng
1463712ee16cSyupengOn the client, save below code as client_coalesce.py::
1464712ee16cSyupeng
1465712ee16cSyupeng  import socket
1466712ee16cSyupeng  server = 'nstat-b'
1467712ee16cSyupeng  port = 9000
1468712ee16cSyupeng  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1469712ee16cSyupeng  s.connect((server, port))
1470712ee16cSyupeng
1471712ee16cSyupengRun::
1472712ee16cSyupeng
1473712ee16cSyupeng  nstatuser@nstat-a:~$ python3 -i client_coalesce.py
1474712ee16cSyupeng
1475712ee16cSyupengWe use '-i' to come into the interactive mode, then a packet::
1476712ee16cSyupeng
1477712ee16cSyupeng  >>> s.send(b'foo')
1478712ee16cSyupeng  3
1479712ee16cSyupeng
1480712ee16cSyupengSend a packet again::
1481712ee16cSyupeng
1482712ee16cSyupeng  >>> s.send(b'bar')
1483712ee16cSyupeng  3
1484712ee16cSyupeng
1485712ee16cSyupengOn the server, run nstat::
1486712ee16cSyupeng
1487712ee16cSyupeng  ubuntu@nstat-b:~$ nstat
1488712ee16cSyupeng  #kernel
1489712ee16cSyupeng  IpInReceives                    2                  0.0
1490712ee16cSyupeng  IpInDelivers                    2                  0.0
1491712ee16cSyupeng  IpOutRequests                   2                  0.0
1492712ee16cSyupeng  TcpInSegs                       2                  0.0
1493712ee16cSyupeng  TcpOutSegs                      2                  0.0
1494712ee16cSyupeng  TcpExtTCPRcvCoalesce            1                  0.0
1495712ee16cSyupeng  IpExtInOctets                   110                0.0
1496712ee16cSyupeng  IpExtOutOctets                  104                0.0
1497712ee16cSyupeng  IpExtInNoECTPkts                2                  0.0
1498712ee16cSyupeng
1499712ee16cSyupengThe client sent two packets, server didn't read any data. When
1500712ee16cSyupengthe second packet arrived at server, the first packet was still in
1501712ee16cSyupengthe receiving queue. So the TCP layer merged the two packets, and we
1502712ee16cSyupengcould find the TcpExtTCPRcvCoalesce increased 1.
1503712ee16cSyupeng
1504712ee16cSyupengTcpExtListenOverflows and TcpExtListenDrops
1505ae5220c6SRandy Dunlap-------------------------------------------
1506712ee16cSyupengOn server, run the nc command, listen on port 9000::
1507712ee16cSyupeng
1508712ee16cSyupeng  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
1509712ee16cSyupeng  Listening on [0.0.0.0] (family 0, port 9000)
1510712ee16cSyupeng
1511712ee16cSyupengOn client, run 3 nc commands in different terminals::
1512712ee16cSyupeng
1513712ee16cSyupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1514712ee16cSyupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
1515712ee16cSyupeng
1516712ee16cSyupengThe nc command only accepts 1 connection, and the accept queue length
1517712ee16cSyupengis 1. On current linux implementation, set queue length to n means the
1518712ee16cSyupengactual queue length is n+1. Now we create 3 connections, 1 is accepted
1519712ee16cSyupengby nc, 2 in accepted queue, so the accept queue is full.
1520712ee16cSyupeng
1521712ee16cSyupengBefore running the 4th nc, we clean the nstat history on the server::
1522712ee16cSyupeng
1523712ee16cSyupeng  nstatuser@nstat-b:~$ nstat -n
1524712ee16cSyupeng
1525712ee16cSyupengRun the 4th nc on the client::
1526712ee16cSyupeng
1527712ee16cSyupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1528712ee16cSyupeng
1529712ee16cSyupengIf the nc server is running on kernel 4.10 or higher version, you
1530712ee16cSyupengwon't see the "Connection to ... succeeded!" string, because kernel
1531712ee16cSyupengwill drop the SYN if the accept queue is full. If the nc client is running
1532712ee16cSyupengon an old kernel, you would see that the connection is succeeded,
1533712ee16cSyupengbecause kernel would complete the 3 way handshake and keep the socket
1534712ee16cSyupengon half open queue. I did the test on kernel 4.15. Below is the nstat
1535712ee16cSyupengon the server::
1536712ee16cSyupeng
1537712ee16cSyupeng  nstatuser@nstat-b:~$ nstat
1538712ee16cSyupeng  #kernel
1539712ee16cSyupeng  IpInReceives                    4                  0.0
1540712ee16cSyupeng  IpInDelivers                    4                  0.0
1541712ee16cSyupeng  TcpInSegs                       4                  0.0
1542712ee16cSyupeng  TcpExtListenOverflows           4                  0.0
1543712ee16cSyupeng  TcpExtListenDrops               4                  0.0
1544712ee16cSyupeng  IpExtInOctets                   240                0.0
1545712ee16cSyupeng  IpExtInNoECTPkts                4                  0.0
1546712ee16cSyupeng
1547712ee16cSyupengBoth TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
1548712ee16cSyupengbetween the 4th nc and the nstat was longer, the value of
1549712ee16cSyupengTcpExtListenOverflows and TcpExtListenDrops would be larger, because
1550712ee16cSyupengthe SYN of the 4th nc was dropped, the client was retrying.
15518e2ea53aSyupeng
15528e2ea53aSyupengIpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
1553ae5220c6SRandy Dunlap-------------------------------------------------
15548e2ea53aSyupengserver A IP address: 192.168.122.250
15558e2ea53aSyupengserver B IP address: 192.168.122.251
15568e2ea53aSyupengPrepare on server A, add a route to server B::
15578e2ea53aSyupeng
15588e2ea53aSyupeng  $ sudo ip route add 8.8.8.8/32 via 192.168.122.251
15598e2ea53aSyupeng
15608e2ea53aSyupengPrepare on server B, disable send_redirects for all interfaces::
15618e2ea53aSyupeng
15628e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
15638e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
15648e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
15658e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
15668e2ea53aSyupeng
15678e2ea53aSyupengWe want to let sever A send a packet to 8.8.8.8, and route the packet
15688e2ea53aSyupengto server B. When server B receives such packet, it might send a ICMP
15698e2ea53aSyupengRedirect message to server A, set send_redirects to 0 will disable
15708e2ea53aSyupengthis behavior.
15718e2ea53aSyupeng
15728e2ea53aSyupengFirst, generate InAddrErrors. On server B, we disable IP forwarding::
15738e2ea53aSyupeng
15748e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.all.forwarding=0
15758e2ea53aSyupeng
15768e2ea53aSyupengOn server A, we send packets to 8.8.8.8::
15778e2ea53aSyupeng
15788e2ea53aSyupeng  $ nc -v 8.8.8.8 53
15798e2ea53aSyupeng
15808e2ea53aSyupengOn server B, we check the output of nstat::
15818e2ea53aSyupeng
15828e2ea53aSyupeng  $ nstat
15838e2ea53aSyupeng  #kernel
15848e2ea53aSyupeng  IpInReceives                    3                  0.0
15858e2ea53aSyupeng  IpInAddrErrors                  3                  0.0
15868e2ea53aSyupeng  IpExtInOctets                   180                0.0
15878e2ea53aSyupeng  IpExtInNoECTPkts                3                  0.0
15888e2ea53aSyupeng
15898e2ea53aSyupengAs we have let server A route 8.8.8.8 to server B, and we disabled IP
15908e2ea53aSyupengforwarding on server B, Server A sent packets to server B, then server B
15918e2ea53aSyupengdropped packets and increased IpInAddrErrors. As the nc command would
15928e2ea53aSyupengre-send the SYN packet if it didn't receive a SYN+ACK, we could find
15938e2ea53aSyupengmultiple IpInAddrErrors.
15948e2ea53aSyupeng
15958e2ea53aSyupengSecond, generate IpExtInNoRoutes. On server B, we enable IP
15968e2ea53aSyupengforwarding::
15978e2ea53aSyupeng
15988e2ea53aSyupeng  $ sudo sysctl -w net.ipv4.conf.all.forwarding=1
15998e2ea53aSyupeng
16008e2ea53aSyupengCheck the route table of server B and remove the default route::
16018e2ea53aSyupeng
16028e2ea53aSyupeng  $ ip route show
16038e2ea53aSyupeng  default via 192.168.122.1 dev ens3 proto static
16048e2ea53aSyupeng  192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
16058e2ea53aSyupeng  $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
16068e2ea53aSyupeng
16078e2ea53aSyupengOn server A, we contact 8.8.8.8 again::
16088e2ea53aSyupeng
16098e2ea53aSyupeng  $ nc -v 8.8.8.8 53
16108e2ea53aSyupeng  nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
16118e2ea53aSyupeng
16128e2ea53aSyupengOn server B, run nstat::
16138e2ea53aSyupeng
16148e2ea53aSyupeng  $ nstat
16158e2ea53aSyupeng  #kernel
16168e2ea53aSyupeng  IpInReceives                    1                  0.0
16178e2ea53aSyupeng  IpOutRequests                   1                  0.0
16188e2ea53aSyupeng  IcmpOutMsgs                     1                  0.0
16198e2ea53aSyupeng  IcmpOutDestUnreachs             1                  0.0
16208e2ea53aSyupeng  IcmpMsgOutType3                 1                  0.0
16218e2ea53aSyupeng  IpExtInNoRoutes                 1                  0.0
16228e2ea53aSyupeng  IpExtInOctets                   60                 0.0
16238e2ea53aSyupeng  IpExtOutOctets                  88                 0.0
16248e2ea53aSyupeng  IpExtInNoECTPkts                1                  0.0
16258e2ea53aSyupeng
16268e2ea53aSyupengWe enabled IP forwarding on server B, when server B received a packet
16278e2ea53aSyupengwhich destination IP address is 8.8.8.8, server B will try to forward
16288e2ea53aSyupengthis packet. We have deleted the default route, there was no route for
16298e2ea53aSyupeng8.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
16308e2ea53aSyupengDestination Unreachable" message to server A.
16318e2ea53aSyupeng
16328e2ea53aSyupengThird, generate IpOutNoRoutes. Run ping command on server B::
16338e2ea53aSyupeng
16348e2ea53aSyupeng  $ ping -c 1 8.8.8.8
16358e2ea53aSyupeng  connect: Network is unreachable
16368e2ea53aSyupeng
16378e2ea53aSyupengRun nstat on server B::
16388e2ea53aSyupeng
16398e2ea53aSyupeng  $ nstat
16408e2ea53aSyupeng  #kernel
16418e2ea53aSyupeng  IpOutNoRoutes                   1                  0.0
16428e2ea53aSyupeng
16438e2ea53aSyupengWe have deleted the default route on server B. Server B couldn't find
16448e2ea53aSyupenga route for the 8.8.8.8 IP address, so server B increased
16458e2ea53aSyupengIpOutNoRoutes.
16462b965472Syupeng
16472b965472SyupengTcpExtTCPACKSkippedSynRecv
1648ae5220c6SRandy Dunlap--------------------------
16492b965472SyupengIn this test, we send 3 same SYN packets from client to server. The
16502b965472Syupengfirst SYN will let server create a socket, set it to Syn-Recv status,
16512b965472Syupengand reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
16522b965472Syupengagain, and record the reply time (the duplicate ACK reply time). The
16532b965472Syupengthird SYN will let server check the previous duplicate ACK reply time,
16542b965472Syupengand decide to skip the duplicate ACK, then increase the
16552b965472SyupengTcpExtTCPACKSkippedSynRecv counter.
16562b965472Syupeng
16572b965472SyupengRun tcpdump to capture a SYN packet::
16582b965472Syupeng
16592b965472Syupeng  nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
16602b965472Syupeng  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
16612b965472Syupeng
16622b965472SyupengOpen another terminal, run nc command::
16632b965472Syupeng
16642b965472Syupeng  nstatuser@nstat-a:~$ nc nstat-b 9000
16652b965472Syupeng
16662b965472SyupengAs the nstat-b didn't listen on port 9000, it should reply a RST, and
16672b965472Syupengthe nc command exited immediately. It was enough for the tcpdump
16682b965472Syupengcommand to capture a SYN packet. A linux server might use hardware
16692b965472Syupengoffload for the TCP checksum, so the checksum in the /tmp/syn.pcap
16702b965472Syupengmight be not correct. We call tcprewrite to fix it::
16712b965472Syupeng
16722b965472Syupeng  nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
16732b965472Syupeng
16742b965472SyupengOn nstat-b, we run nc to listen on port 9000::
16752b965472Syupeng
16762b965472Syupeng  nstatuser@nstat-b:~$ nc -lkv 9000
16772b965472Syupeng  Listening on [0.0.0.0] (family 0, port 9000)
16782b965472Syupeng
16792b965472SyupengOn nstat-a, we blocked the packet from port 9000, or nstat-a would send
16802b965472SyupengRST to nstat-b::
16812b965472Syupeng
16822b965472Syupeng  nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
16832b965472Syupeng
1684*a266ef69SRandy DunlapSend 3 SYN repeatedly to nstat-b::
16852b965472Syupeng
16862b965472Syupeng  nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
16872b965472Syupeng
1688ede71caeSMasanari IidaCheck snmp counter on nstat-b::
16892b965472Syupeng
16902b965472Syupeng  nstatuser@nstat-b:~$ nstat | grep -i skip
16912b965472Syupeng  TcpExtTCPACKSkippedSynRecv      1                  0.0
16922b965472Syupeng
16932b965472SyupengAs we expected, TcpExtTCPACKSkippedSynRecv is 1.
16942b965472Syupeng
16952b965472SyupengTcpExtTCPACKSkippedPAWS
1696ae5220c6SRandy Dunlap-----------------------
16972b965472SyupengTo trigger PAWS, we could send an old SYN.
16982b965472Syupeng
16992b965472SyupengOn nstat-b, let nc listen on port 9000::
17002b965472Syupeng
17012b965472Syupeng  nstatuser@nstat-b:~$ nc -lkv 9000
17022b965472Syupeng  Listening on [0.0.0.0] (family 0, port 9000)
17032b965472Syupeng
17042b965472SyupengOn nstat-a, run tcpdump to capture a SYN::
17052b965472Syupeng
17062b965472Syupeng  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
17072b965472Syupeng  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
17082b965472Syupeng
17092b965472SyupengOn nstat-a, run nc as a client to connect nstat-b::
17102b965472Syupeng
17112b965472Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
17122b965472Syupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
17132b965472Syupeng
17142b965472SyupengNow the tcpdump has captured the SYN and exit. We should fix the
17152b965472Syupengchecksum::
17162b965472Syupeng
17172b965472Syupeng  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
17182b965472Syupeng
17192b965472SyupengSend the SYN packet twice::
17202b965472Syupeng
17212b965472Syupeng  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
17222b965472Syupeng
17232b965472SyupengOn nstat-b, check the snmp counter::
17242b965472Syupeng
17252b965472Syupeng  nstatuser@nstat-b:~$ nstat | grep -i skip
17262b965472Syupeng  TcpExtTCPACKSkippedPAWS         1                  0.0
17272b965472Syupeng
17282b965472SyupengWe sent two SYN via tcpreplay, both of them would let PAWS check
17292b965472Syupengfailed, the nstat-b replied an ACK for the first SYN, skipped the ACK
17302b965472Syupengfor the second SYN, and updated TcpExtTCPACKSkippedPAWS.
17312b965472Syupeng
17322b965472SyupengTcpExtTCPACKSkippedSeq
1733ae5220c6SRandy Dunlap----------------------
17342b965472SyupengTo trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
17352b965472Syupengtimestamp (to pass PAWS check) but the sequence number is out of
17362b965472Syupengwindow. The linux TCP stack would avoid to skip if the packet has
17372b965472Syupengdata, so we need a pure ACK packet. To generate such a packet, we
17382b965472Syupengcould create two sockets: one on port 9000, another on port 9001. Then
17392b965472Syupengwe capture an ACK on port 9001, change the source/destination port
17402b965472Syupengnumbers to match the port 9000 socket. Then we could trigger
17412b965472SyupengTcpExtTCPACKSkippedSeq via this packet.
17422b965472Syupeng
17432b965472SyupengOn nstat-b, open two terminals, run two nc commands to listen on both
17442b965472Syupengport 9000 and port 9001::
17452b965472Syupeng
17462b965472Syupeng  nstatuser@nstat-b:~$ nc -lkv 9000
17472b965472Syupeng  Listening on [0.0.0.0] (family 0, port 9000)
17482b965472Syupeng
17492b965472Syupeng  nstatuser@nstat-b:~$ nc -lkv 9001
17502b965472Syupeng  Listening on [0.0.0.0] (family 0, port 9001)
17512b965472Syupeng
17522b965472SyupengOn nstat-a, run two nc clients::
17532b965472Syupeng
17542b965472Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9000
17552b965472Syupeng  Connection to nstat-b 9000 port [tcp/*] succeeded!
17562b965472Syupeng
17572b965472Syupeng  nstatuser@nstat-a:~$ nc -v nstat-b 9001
17582b965472Syupeng  Connection to nstat-b 9001 port [tcp/*] succeeded!
17592b965472Syupeng
17602b965472SyupengOn nstat-a, run tcpdump to capture an ACK::
17612b965472Syupeng
17622b965472Syupeng  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
17632b965472Syupeng  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
17642b965472Syupeng
17652b965472SyupengOn nstat-b, send a packet via the port 9001 socket. E.g. we sent a
17662b965472Syupengstring 'foo' in our example::
17672b965472Syupeng
17682b965472Syupeng  nstatuser@nstat-b:~$ nc -lkv 9001
17692b965472Syupeng  Listening on [0.0.0.0] (family 0, port 9001)
17702b965472Syupeng  Connection from nstat-a 42132 received!
17712b965472Syupeng  foo
17722b965472Syupeng
1773ede71caeSMasanari IidaOn nstat-a, the tcpdump should have captured the ACK. We should check
17742b965472Syupengthe source port numbers of the two nc clients::
17752b965472Syupeng
17762b965472Syupeng  nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
17772b965472Syupeng  State  Recv-Q   Send-Q         Local Address:Port           Peer Address:Port
17782b965472Syupeng  ESTAB  0        0            192.168.122.250:50208       192.168.122.251:9000
17792b965472Syupeng  ESTAB  0        0            192.168.122.250:42132       192.168.122.251:9001
17802b965472Syupeng
1781ede71caeSMasanari IidaRun tcprewrite, change port 9001 to port 9000, change port 42132 to
17822b965472Syupengport 50208::
17832b965472Syupeng
17842b965472Syupeng  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
17852b965472Syupeng
17862b965472SyupengNow the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
17872b965472Syupeng
17882b965472Syupeng  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
17892b965472Syupeng
17902b965472SyupengCheck TcpExtTCPACKSkippedSeq on nstat-b::
17912b965472Syupeng
17922b965472Syupeng  nstatuser@nstat-b:~$ nstat | grep -i skip
17932b965472Syupeng  TcpExtTCPACKSkippedSeq          1                  0.0
1794