xref: /openbmc/linux/Documentation/networking/snmp_counter.rst (revision 22fc4c4c9fd60427bcda00878cee94e7622cfa7a)
1============
2SNMP counter
3============
4
5This document explains the meaning of SNMP counters.
6
7General IPv4 counters
8=====================
9All layer 4 packets and ICMP packets will change these counters, but
10these counters won't be changed by layer 2 packets (such as STP) or
11ARP packets.
12
13* IpInReceives
14
15Defined in `RFC1213 ipInReceives`_
16
17.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
18
19The number of packets received by the IP layer. It gets increasing at the
20beginning of ip_rcv function, always be updated together with
21IpExtInOctets. It will be increased even if the packet is dropped
22later (e.g. due to the IP header is invalid or the checksum is wrong
23and so on).  It indicates the number of aggregated segments after
24GRO/LRO.
25
26* IpInDelivers
27
28Defined in `RFC1213 ipInDelivers`_
29
30.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
31
32The number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
33ICMP and so on. If no one listens on a raw socket, only kernel
34supported protocols will be delivered, if someone listens on the raw
35socket, all valid IP packets will be delivered.
36
37* IpOutRequests
38
39Defined in `RFC1213 ipOutRequests`_
40
41.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
42
43The number of packets sent via IP layer, for both single cast and
44multicast packets, and would always be updated together with
45IpExtOutOctets.
46
47* IpExtInOctets and IpExtOutOctets
48
49They are Linux kernel extensions, no RFC definitions. Please note,
50RFC1213 indeed defines ifInOctets  and ifOutOctets, but they
51are different things. The ifInOctets and ifOutOctets include the MAC
52layer header size but IpExtInOctets and IpExtOutOctets don't, they
53only include the IP layer header and the IP layer data.
54
55* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
56
57They indicate the number of four kinds of ECN IP packets, please refer
58`Explicit Congestion Notification`_ for more details.
59
60.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
61
62These 4 counters calculate how many packets received per ECN
63status. They count the real frame number regardless the LRO/GRO. So
64for the same packet, you might find that IpInReceives count 1, but
65IpExtInNoECTPkts counts 2 or more.
66
67* IpInHdrErrors
68
69Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
70dropped due to the IP header error. It might happen in both IP input
71and IP forward paths.
72
73.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
74
75* IpInAddrErrors
76
77Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
78scenarios: (1) The IP address is invalid. (2) The destination IP
79address is not a local address and IP forwarding is not enabled
80
81.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
82
83* IpExtInNoRoutes
84
85This counter means the packet is dropped when the IP stack receives a
86packet and can't find a route for it from the route table. It might
87happen when IP forwarding is enabled and the destination IP address is
88not a local address and there is no route for the destination IP
89address.
90
91* IpInUnknownProtos
92
93Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
94layer 4 protocol is unsupported by kernel. If an application is using
95raw socket, kernel will always deliver the packet to the raw socket
96and this counter won't be increased.
97
98.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
99
100* IpExtInTruncatedPkts
101
102For IPv4 packet, it means the actual data size is smaller than the
103"Total Length" field in the IPv4 header.
104
105* IpInDiscards
106
107Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
108in the IP receiving path and due to kernel internal reasons (e.g. no
109enough memory).
110
111.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
112
113* IpOutDiscards
114
115Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
116dropped in the IP sending path and due to kernel internal reasons.
117
118.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
119
120* IpOutNoRoutes
121
122Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
123dropped in the IP sending path and no route is found for it.
124
125.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
126
127ICMP counters
128=============
129* IcmpInMsgs and IcmpOutMsgs
130
131Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
132
133.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
134.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
135
136As mentioned in the RFC1213, these two counters include errors, they
137would be increased even if the ICMP packet has an invalid type. The
138ICMP output path will check the header of a raw socket, so the
139IcmpOutMsgs would still be updated if the IP header is constructed by
140a userspace program.
141
142* ICMP named types
143
144| These counters include most of common ICMP types, they are:
145| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
146| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
147| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
148| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
149| IcmpInRedirects: `RFC1213 icmpInRedirects`_
150| IcmpInEchos: `RFC1213 icmpInEchos`_
151| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
152| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
153| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
154| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
155| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
156| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
157| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
158| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
159| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
160| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
161| IcmpOutEchos: `RFC1213 icmpOutEchos`_
162| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
163| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
164| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
165| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
166| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
167
168.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
169.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
170.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
171.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
172.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
173.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
174.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
175.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
176.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
177.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
178.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
179
180.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
181.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
182.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
183.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
184.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
185.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
186.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
187.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
188.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
189.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
190.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
191
192Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
193Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
194straightforward. The 'In' counter means kernel receives such a packet
195and the 'Out' counter means kernel sends such a packet.
196
197* ICMP numeric types
198
199They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
200ICMP type number. These counters track all kinds of ICMP packets. The
201ICMP type number definition could be found in the `ICMP parameters`_
202document.
203
204.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
205
206For example, if the Linux kernel sends an ICMP Echo packet, the
207IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
208packet, IcmpMsgInType0 would increase 1.
209
210* IcmpInCsumErrors
211
212This counter indicates the checksum of the ICMP packet is
213wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
214before updating IcmpMsgInType[N]. If a packet has bad checksum, the
215IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
216
217* IcmpInErrors and IcmpOutErrors
218
219Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
220
221.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
222.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
223
224When an error occurs in the ICMP packet handler path, these two
225counters would be updated. The receiving packet path use IcmpInErrors
226and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
227is increased, IcmpInErrors would always be increased too.
228
229relationship of the ICMP counters
230---------------------------------
231The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
232are updated at the same time. The sum of IcmpMsgInType[N] plus
233IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
234receives an ICMP packet, kernel follows below logic:
235
2361. increase IcmpInMsgs
2372. if has any error, update IcmpInErrors and finish the process
2383. update IcmpMsgOutType[N]
2394. handle the packet depending on the type, if has any error, update
240   IcmpInErrors and finish the process
241
242So if all errors occur in step (2), IcmpInMsgs should be equal to the
243sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
244step (4), IcmpInMsgs should be equal to the sum of
245IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
246IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
247IcmpInErrors.
248
249General TCP counters
250====================
251* TcpInSegs
252
253Defined in `RFC1213 tcpInSegs`_
254
255.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
256
257The number of packets received by the TCP layer. As mentioned in
258RFC1213, it includes the packets received in error, such as checksum
259error, invalid TCP header and so on. Only one error won't be included:
260if the layer 2 destination address is not the NIC's layer 2
261address. It might happen if the packet is a multicast or broadcast
262packet, or the NIC is in promiscuous mode. In these situations, the
263packets would be delivered to the TCP layer, but the TCP layer will discard
264these packets before increasing TcpInSegs. The TcpInSegs counter
265isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
266counter would only increase 1.
267
268* TcpOutSegs
269
270Defined in `RFC1213 tcpOutSegs`_
271
272.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
273
274The number of packets sent by the TCP layer. As mentioned in RFC1213,
275it excludes the retransmitted packets. But it includes the SYN, ACK
276and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
277GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
278increase 2.
279
280* TcpActiveOpens
281
282Defined in `RFC1213 tcpActiveOpens`_
283
284.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
285
286It means the TCP layer sends a SYN, and come into the SYN-SENT
287state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
288increase 1.
289
290* TcpPassiveOpens
291
292Defined in `RFC1213 tcpPassiveOpens`_
293
294.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
295
296It means the TCP layer receives a SYN, replies a SYN+ACK, come into
297the SYN-RCVD state.
298
299* TcpExtTCPRcvCoalesce
300
301When packets are received by the TCP layer and are not be read by the
302application, the TCP layer will try to merge them. This counter
303indicate how many packets are merged in such situation. If GRO is
304enabled, lots of packets would be merged by GRO, these packets
305wouldn't be counted to TcpExtTCPRcvCoalesce.
306
307* TcpExtTCPAutoCorking
308
309When sending packets, the TCP layer will try to merge small packets to
310a bigger one. This counter increase 1 for every packet merged in such
311situation. Please refer to the LWN article for more details:
312https://lwn.net/Articles/576263/
313
314* TcpExtTCPOrigDataSent
315
316This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
317explaination below::
318
319  TCPOrigDataSent: number of outgoing packets with original data (excluding
320  retransmission but including data-in-SYN). This counter is different from
321  TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
322  more useful to track the TCP retransmission rate.
323
324* TCPSynRetrans
325
326This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
327explaination below::
328
329  TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
330  retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
331
332* TCPFastOpenActiveFail
333
334This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
335explaination below::
336
337  TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
338  the remote does not accept it or the attempts timed out.
339
340.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
341
342* TcpExtListenOverflows and TcpExtListenDrops
343
344When kernel receives a SYN from a client, and if the TCP accept queue
345is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
346At the same time kernel will also add 1 to TcpExtListenDrops. When a
347TCP socket is in LISTEN state, and kernel need to drop a packet,
348kernel would always add 1 to TcpExtListenDrops. So increase
349TcpExtListenOverflows would let TcpExtListenDrops increasing at the
350same time, but TcpExtListenDrops would also increase without
351TcpExtListenOverflows increasing, e.g. a memory allocation fail would
352also let TcpExtListenDrops increase.
353
354Note: The above explanation is based on kernel 4.10 or above version, on
355an old kernel, the TCP stack has different behavior when TCP accept
356queue is full. On the old kernel, TCP stack won't drop the SYN, it
357would complete the 3-way handshake. As the accept queue is full, TCP
358stack will keep the socket in the TCP half-open queue. As it is in the
359half open queue, TCP stack will send SYN+ACK on an exponential backoff
360timer, after client replies ACK, TCP stack checks whether the accept
361queue is still full, if it is not full, moves the socket to the accept
362queue, if it is full, keeps the socket in the half-open queue, at next
363time client replies ACK, this socket will get another chance to move
364to the accept queue.
365
366
367TCP Fast Open
368=============
369When kernel receives a TCP packet, it has two paths to handler the
370packet, one is fast path, another is slow path. The comment in kernel
371code provides a good explanation of them, I pasted them below::
372
373  It is split into a fast path and a slow path. The fast path is
374  disabled when:
375
376  - A zero window was announced from us
377  - zero window probing
378    is only handled properly on the slow path.
379  - Out of order segments arrived.
380  - Urgent data is expected.
381  - There is no buffer space left
382  - Unexpected TCP flags/window values/header lengths are received
383    (detected by checking the TCP header against pred_flags)
384  - Data is sent in both directions. The fast path only supports pure senders
385    or pure receivers (this means either the sequence number or the ack
386    value must stay constant)
387  - Unexpected TCP option.
388
389Kernel will try to use fast path unless any of the above conditions
390are satisfied. If the packets are out of order, kernel will handle
391them in slow path, which means the performance might be not very
392good. Kernel would also come into slow path if the "Delayed ack" is
393used, because when using "Delayed ack", the data is sent in both
394directions. When the TCP window scale option is not used, kernel will
395try to enable fast path immediately when the connection comes into the
396established state, but if the TCP window scale option is used, kernel
397will disable the fast path at first, and try to enable it after kernel
398receives packets.
399
400* TcpExtTCPPureAcks and TcpExtTCPHPAcks
401
402If a packet set ACK flag and has no data, it is a pure ACK packet, if
403kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
404if kernel handles it in the slow path, TcpExtTCPPureAcks will
405increase 1.
406
407* TcpExtTCPHPHits
408
409If a TCP packet has data (which means it is not a pure ACK packet),
410and this packet is handled in the fast path, TcpExtTCPHPHits will
411increase 1.
412
413
414TCP abort
415=========
416
417* TcpExtTCPAbortOnData
418
419It means TCP layer has data in flight, but need to close the
420connection. So TCP layer sends a RST to the other side, indicate the
421connection is not closed very graceful. An easy way to increase this
422counter is using the SO_LINGER option. Please refer to the SO_LINGER
423section of the `socket man page`_:
424
425.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
426
427By default, when an application closes a connection, the close function
428will return immediately and kernel will try to send the in-flight data
429async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
430to a positive number, the close function won't return immediately, but
431wait for the in-flight data are acked by the other side, the max wait
432time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
433when the application closes a connection, kernel will send a RST
434immediately and increase the TcpExtTCPAbortOnData counter.
435
436* TcpExtTCPAbortOnClose
437
438This counter means the application has unread data in the TCP layer when
439the application wants to close the TCP connection. In such a situation,
440kernel will send a RST to the other side of the TCP connection.
441
442* TcpExtTCPAbortOnMemory
443
444When an application closes a TCP connection, kernel still need to track
445the connection, let it complete the TCP disconnect process. E.g. an
446app calls the close method of a socket, kernel sends fin to the other
447side of the connection, then the app has no relationship with the
448socket any more, but kernel need to keep the socket, this socket
449becomes an orphan socket, kernel waits for the reply of the other side,
450and would come to the TIME_WAIT state finally. When kernel has no
451enough memory to keep the orphan socket, kernel would send an RST to
452the other side, and delete the socket, in such situation, kernel will
453increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
454TcpExtTCPAbortOnMemory:
455
4561. the memory used by the TCP protocol is higher than the third value of
457the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
458
459.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
460
4612. the orphan socket count is higher than net.ipv4.tcp_max_orphans
462
463
464* TcpExtTCPAbortOnTimeout
465
466This counter will increase when any of the TCP timers expire. In such
467situation, kernel won't send RST, just give up the connection.
468
469* TcpExtTCPAbortOnLinger
470
471When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
472for the fin packet from the other side, kernel could send a RST and
473delete the socket immediately. This is not the default behavior of
474Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
475you could let kernel follow this behavior.
476
477* TcpExtTCPAbortFailed
478
479The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
480satisfied. If an internal error occurs during this process,
481TcpExtTCPAbortFailed will be increased.
482
483.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
484
485TCP Hybrid Slow Start
486=====================
487The Hybrid Slow Start algorithm is an enhancement of the traditional
488TCP congestion window Slow Start algorithm. It uses two pieces of
489information to detect whether the max bandwidth of the TCP path is
490approached. The two pieces of information are ACK train length and
491increase in packet delay. For detail information, please refer the
492`Hybrid Slow Start paper`_. Either ACK train length or packet delay
493hits a specific threshold, the congestion control algorithm will come
494into the Congestion Avoidance state. Until v4.20, two congestion
495control algorithms are using Hybrid Slow Start, they are cubic (the
496default congestion control algorithm) and cdg. Four snmp counters
497relate with the Hybrid Slow Start algorithm.
498
499.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
500
501* TcpExtTCPHystartTrainDetect
502
503How many times the ACK train length threshold is detected
504
505* TcpExtTCPHystartTrainCwnd
506
507The sum of CWND detected by ACK train length. Dividing this value by
508TcpExtTCPHystartTrainDetect is the average CWND which detected by the
509ACK train length.
510
511* TcpExtTCPHystartDelayDetect
512
513How many times the packet delay threshold is detected.
514
515* TcpExtTCPHystartDelayCwnd
516
517The sum of CWND detected by packet delay. Dividing this value by
518TcpExtTCPHystartDelayDetect is the average CWND which detected by the
519packet delay.
520
521TCP retransmission and congestion control
522=========================================
523The TCP protocol has two retransmission mechanisms: SACK and fast
524recovery. They are exclusive with each other. When SACK is enabled,
525the kernel TCP stack would use SACK, or kernel would use fast
526recovery. The SACK is a TCP option, which is defined in `RFC2018`_,
527the fast recovery is defined in `RFC6582`_, which is also called
528'Reno'.
529
530The TCP congestion control is a big and complex topic. To understand
531the related snmp counter, we need to know the states of the congestion
532control state machine. There are 5 states: Open, Disorder, CWR,
533Recovery and Loss. For details about these states, please refer page 5
534and page 6 of this document:
535https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
536
537.. _RFC2018: https://tools.ietf.org/html/rfc2018
538.. _RFC6582: https://tools.ietf.org/html/rfc6582
539
540* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
541
542When the congestion control comes into Recovery state, if sack is
543used, TcpExtTCPSackRecovery increases 1, if sack is not used,
544TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
545stack begins to retransmit the lost packets.
546
547* TcpExtTCPSACKReneging
548
549A packet was acknowledged by SACK, but the receiver has dropped this
550packet, so the sender needs to retransmit this packet. In this
551situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
552could drop a packet which has been acknowledged by SACK, although it is
553unusual, it is allowed by the TCP protocol. The sender doesn't really
554know what happened on the receiver side. The sender just waits until
555the RTO expires for this packet, then the sender assumes this packet
556has been dropped by the receiver.
557
558* TcpExtTCPRenoReorder
559
560The reorder packet is detected by fast recovery. It would only be used
561if SACK is disabled. The fast recovery algorithm detects recorder by
562the duplicate ACK number. E.g., if retransmission is triggered, and
563the original retransmitted packet is not lost, it is just out of
564order, the receiver would acknowledge multiple times, one for the
565retransmitted packet, another for the arriving of the original out of
566order packet. Thus the sender would find more ACks than its
567expectation, and the sender knows out of order occurs.
568
569* TcpExtTCPTSReorder
570
571The reorder packet is detected when a hole is filled. E.g., assume the
572sender sends packet 1,2,3,4,5, and the receiving order is
5731,2,4,5,3. When the sender receives the ACK of packet 3 (which will
574fill the hole), two conditions will let TcpExtTCPTSReorder increase
5751: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
5763 is retransmitted but the timestamp of the packet 3's ACK is earlier
577than the retransmission timestamp.
578
579* TcpExtTCPSACKReorder
580
581The reorder packet detected by SACK. The SACK has two methods to
582detect reorder: (1) DSACK is received by the sender. It means the
583sender sends the same packet more than one times. And the only reason
584is the sender believes an out of order packet is lost so it sends the
585packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
586the sender has received SACKs for packet 2 and 5, now the sender
587receives SACK for packet 4 and the sender doesn't retransmit the
588packet yet, the sender would know packet 4 is out of order. The TCP
589stack of kernel will increase TcpExtTCPSACKReorder for both of the
590above scenarios.
591
592
593DSACK
594=====
595The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
596duplicate packets to the sender. There are two kinds of
597duplications: (1) a packet which has been acknowledged is
598duplicate. (2) an out of order packet is duplicate. The TCP stack
599counts these two kinds of duplications on both receiver side and
600sender side.
601
602.. _RFC2883 : https://tools.ietf.org/html/rfc2883
603
604* TcpExtTCPDSACKOldSent
605
606The TCP stack receives a duplicate packet which has been acked, so it
607sends a DSACK to the sender.
608
609* TcpExtTCPDSACKOfoSent
610
611The TCP stack receives an out of order duplicate packet, so it sends a
612DSACK to the sender.
613
614* TcpExtTCPDSACKRecv
615
616The TCP stack receives a DSACK, which indicate an acknowledged
617duplicate packet is received.
618
619* TcpExtTCPDSACKOfoRecv
620
621The TCP stack receives a DSACK, which indicate an out of order
622duplicate packet is received.
623
624TCP out of order
625================
626* TcpExtTCPOFOQueue
627
628The TCP layer receives an out of order packet and has enough memory
629to queue it.
630
631* TcpExtTCPOFODrop
632
633The TCP layer receives an out of order packet but doesn't have enough
634memory, so drops it. Such packets won't be counted into
635TcpExtTCPOFOQueue.
636
637* TcpExtTCPOFOMerge
638
639The received out of order packet has an overlay with the previous
640packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
641packets will also be counted into TcpExtTCPOFOQueue.
642
643TCP PAWS
644========
645PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
646which is used to drop old packets. It depends on the TCP
647timestamps. For detail information, please refer the `timestamp wiki`_
648and the `RFC of PAWS`_.
649
650.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
651.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
652
653* TcpExtPAWSActive
654
655Packets are dropped by PAWS in Syn-Sent status.
656
657* TcpExtPAWSEstab
658
659Packets are dropped by PAWS in any status other than Syn-Sent.
660
661TCP ACK skip
662============
663In some scenarios, kernel would avoid sending duplicate ACKs too
664frequently. Please find more details in the tcp_invalid_ratelimit
665section of the `sysctl document`_. When kernel decides to skip an ACK
666due to tcp_invalid_ratelimit, kernel would update one of below
667counters to indicate the ACK is skipped in which scenario. The ACK
668would only be skipped if the received packet is either a SYN packet or
669it has no data.
670
671.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
672
673* TcpExtTCPACKSkippedSynRecv
674
675The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
676TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
677waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
678in the Syn-Recv status. But in several scenarios, the TCP stack need
679to send an ACK. E.g., the TCP stack receives the same SYN packet
680repeately, the received packet does not pass the PAWS check, or the
681received packet sequence number is out of window. In these scenarios,
682the TCP stack needs to send ACK. If the ACk sending frequency is higher than
683tcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
684increase TcpExtTCPACKSkippedSynRecv.
685
686
687* TcpExtTCPACKSkippedPAWS
688
689The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
690numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
691or Time-Wait statuses, the skipped ACK would be counted to
692TcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
693TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
694would be counted to TcpExtTCPACKSkippedPAWS.
695
696* TcpExtTCPACKSkippedSeq
697
698The sequence number is out of window and the timestamp passes the PAWS
699check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
700
701* TcpExtTCPACKSkippedFinWait2
702
703The ACK is skipped in Fin-Wait-2 status, the reason would be either
704PAWS check fails or the received sequence number is out of window.
705
706* TcpExtTCPACKSkippedTimeWait
707
708Tha ACK is skipped in Time-Wait status, the reason would be either
709PAWS check failed or the received sequence number is out of window.
710
711* TcpExtTCPACKSkippedChallenge
712
713The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
7143 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
715`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
716three scenarios, In some TCP status, the linux TCP stack would also
717send challenge ACKs if the ACK number is before the first
718unacknowledged number (more strict than `RFC 5961 section 5.2`_).
719
720.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
721.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
722.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
723
724
725examples
726========
727
728ping test
729---------
730Run the ping command against the public dns server 8.8.8.8::
731
732  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
733  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
734  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
735
736  --- 8.8.8.8 ping statistics ---
737  1 packets transmitted, 1 received, 0% packet loss, time 0ms
738  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
739
740The nstayt result::
741
742  nstatuser@nstat-a:~$ nstat
743  #kernel
744  IpInReceives                    1                  0.0
745  IpInDelivers                    1                  0.0
746  IpOutRequests                   1                  0.0
747  IcmpInMsgs                      1                  0.0
748  IcmpInEchoReps                  1                  0.0
749  IcmpOutMsgs                     1                  0.0
750  IcmpOutEchos                    1                  0.0
751  IcmpMsgInType0                  1                  0.0
752  IcmpMsgOutType8                 1                  0.0
753  IpExtInOctets                   84                 0.0
754  IpExtOutOctets                  84                 0.0
755  IpExtInNoECTPkts                1                  0.0
756
757The Linux server sent an ICMP Echo packet, so IpOutRequests,
758IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
759server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
760IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
761was passed to the ICMP layer via IP layer, so IpInDelivers was
762increased 1. The default ping data size is 48, so an ICMP Echo packet
763and its corresponding Echo Reply packet are constructed by:
764
765* 14 bytes MAC header
766* 20 bytes IP header
767* 16 bytes ICMP header
768* 48 bytes data (default value of the ping command)
769
770So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
771
772tcp 3-way handshake
773-------------------
774On server side, we run::
775
776  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
777  Listening on [0.0.0.0] (family 0, port 9000)
778
779On client side, we run::
780
781  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
782  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
783
784The server listened on tcp 9000 port, the client connected to it, they
785completed the 3-way handshake.
786
787On server side, we can find below nstat output::
788
789  nstatuser@nstat-b:~$ nstat | grep -i tcp
790  TcpPassiveOpens                 1                  0.0
791  TcpInSegs                       2                  0.0
792  TcpOutSegs                      1                  0.0
793  TcpExtTCPPureAcks               1                  0.0
794
795On client side, we can find below nstat output::
796
797  nstatuser@nstat-a:~$ nstat | grep -i tcp
798  TcpActiveOpens                  1                  0.0
799  TcpInSegs                       1                  0.0
800  TcpOutSegs                      2                  0.0
801
802When the server received the first SYN, it replied a SYN+ACK, and came into
803SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
804SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
805packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
806of the 3-way handshake is a pure ACK without data, so
807TcpExtTCPPureAcks increased 1.
808
809When the client sent SYN, the client came into the SYN-SENT state, so
810TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
811ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
8121, TcpOutSegs increased 2.
813
814TCP normal traffic
815------------------
816Run nc on server::
817
818  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
819  Listening on [0.0.0.0] (family 0, port 9000)
820
821Run nc on client::
822
823  nstatuser@nstat-a:~$ nc -v nstat-b 9000
824  Connection to nstat-b 9000 port [tcp/*] succeeded!
825
826Input a string in the nc client ('hello' in our example)::
827
828  nstatuser@nstat-a:~$ nc -v nstat-b 9000
829  Connection to nstat-b 9000 port [tcp/*] succeeded!
830  hello
831
832The client side nstat output::
833
834  nstatuser@nstat-a:~$ nstat
835  #kernel
836  IpInReceives                    1                  0.0
837  IpInDelivers                    1                  0.0
838  IpOutRequests                   1                  0.0
839  TcpInSegs                       1                  0.0
840  TcpOutSegs                      1                  0.0
841  TcpExtTCPPureAcks               1                  0.0
842  TcpExtTCPOrigDataSent           1                  0.0
843  IpExtInOctets                   52                 0.0
844  IpExtOutOctets                  58                 0.0
845  IpExtInNoECTPkts                1                  0.0
846
847The server side nstat output::
848
849  nstatuser@nstat-b:~$ nstat
850  #kernel
851  IpInReceives                    1                  0.0
852  IpInDelivers                    1                  0.0
853  IpOutRequests                   1                  0.0
854  TcpInSegs                       1                  0.0
855  TcpOutSegs                      1                  0.0
856  IpExtInOctets                   58                 0.0
857  IpExtOutOctets                  52                 0.0
858  IpExtInNoECTPkts                1                  0.0
859
860Input a string in nc client side again ('world' in our exmaple)::
861
862  nstatuser@nstat-a:~$ nc -v nstat-b 9000
863  Connection to nstat-b 9000 port [tcp/*] succeeded!
864  hello
865  world
866
867Client side nstat output::
868
869  nstatuser@nstat-a:~$ nstat
870  #kernel
871  IpInReceives                    1                  0.0
872  IpInDelivers                    1                  0.0
873  IpOutRequests                   1                  0.0
874  TcpInSegs                       1                  0.0
875  TcpOutSegs                      1                  0.0
876  TcpExtTCPHPAcks                 1                  0.0
877  TcpExtTCPOrigDataSent           1                  0.0
878  IpExtInOctets                   52                 0.0
879  IpExtOutOctets                  58                 0.0
880  IpExtInNoECTPkts                1                  0.0
881
882
883Server side nstat output::
884
885  nstatuser@nstat-b:~$ nstat
886  #kernel
887  IpInReceives                    1                  0.0
888  IpInDelivers                    1                  0.0
889  IpOutRequests                   1                  0.0
890  TcpInSegs                       1                  0.0
891  TcpOutSegs                      1                  0.0
892  TcpExtTCPHPHits                 1                  0.0
893  IpExtInOctets                   58                 0.0
894  IpExtOutOctets                  52                 0.0
895  IpExtInNoECTPkts                1                  0.0
896
897Compare the first client-side nstat and the second client-side nstat,
898we could find one difference: the first one had a 'TcpExtTCPPureAcks',
899but the second one had a 'TcpExtTCPHPAcks'. The first server-side
900nstat and the second server-side nstat had a difference too: the
901second server-side nstat had a TcpExtTCPHPHits, but the first
902server-side nstat didn't have it. The network traffic patterns were
903exactly the same: the client sent a packet to the server, the server
904replied an ACK. But kernel handled them in different ways. When the
905TCP window scale option is not used, kernel will try to enable fast
906path immediately when the connection comes into the established state,
907but if the TCP window scale option is used, kernel will disable the
908fast path at first, and try to enable it after kerenl receives
909packets. We could use the 'ss' command to verify whether the window
910scale option is used. e.g. run below command on either server or
911client::
912
913  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
914  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port
915  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000
916             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
917
918The 'wscale:7,7' means both server and client set the window scale
919option to 7. Now we could explain the nstat output in our test:
920
921In the first nstat output of client side, the client sent a packet, server
922reply an ACK, when kernel handled this ACK, the fast path was not
923enabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
924
925In the second nstat output of client side, the client sent a packet again,
926and received another ACK from the server, in this time, the fast path is
927enabled, and the ACK was qualified for fast path, so it was handled by
928the fast path, so this ACK was counted into TcpExtTCPHPAcks.
929
930In the first nstat output of server side, fast path was not enabled,
931so there was no 'TcpExtTCPHPHits'.
932
933In the second nstat output of server side, the fast path was enabled,
934and the packet received from client qualified for fast path, so it
935was counted into 'TcpExtTCPHPHits'.
936
937TcpExtTCPAbortOnClose
938---------------------
939On the server side, we run below python script::
940
941  import socket
942  import time
943
944  port = 9000
945
946  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
947  s.bind(('0.0.0.0', port))
948  s.listen(1)
949  sock, addr = s.accept()
950  while True:
951      time.sleep(9999999)
952
953This python script listen on 9000 port, but doesn't read anything from
954the connection.
955
956On the client side, we send the string "hello" by nc::
957
958  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
959
960Then, we come back to the server side, the server has received the "hello"
961packet, and the TCP layer has acked this packet, but the application didn't
962read it yet. We type Ctrl-C to terminate the server script. Then we
963could find TcpExtTCPAbortOnClose increased 1 on the server side::
964
965  nstatuser@nstat-b:~$ nstat | grep -i abort
966  TcpExtTCPAbortOnClose           1                  0.0
967
968If we run tcpdump on the server side, we could find the server sent a
969RST after we type Ctrl-C.
970
971TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
972---------------------------------------------------
973Below is an example which let the orphan socket count be higher than
974net.ipv4.tcp_max_orphans.
975Change tcp_max_orphans to a smaller value on client::
976
977  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
978
979Client code (create 64 connection to server)::
980
981  nstatuser@nstat-a:~$ cat client_orphan.py
982  import socket
983  import time
984
985  server = 'nstat-b' # server address
986  port = 9000
987
988  count = 64
989
990  connection_list = []
991
992  for i in range(64):
993      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
994      s.connect((server, port))
995      connection_list.append(s)
996      print("connection_count: %d" % len(connection_list))
997
998  while True:
999      time.sleep(99999)
1000
1001Server code (accept 64 connection from client)::
1002
1003  nstatuser@nstat-b:~$ cat server_orphan.py
1004  import socket
1005  import time
1006
1007  port = 9000
1008  count = 64
1009
1010  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1011  s.bind(('0.0.0.0', port))
1012  s.listen(count)
1013  connection_list = []
1014  while True:
1015      sock, addr = s.accept()
1016      connection_list.append((sock, addr))
1017      print("connection_count: %d" % len(connection_list))
1018
1019Run the python scripts on server and client.
1020
1021On server::
1022
1023  python3 server_orphan.py
1024
1025On client::
1026
1027  python3 client_orphan.py
1028
1029Run iptables on server::
1030
1031  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
1032
1033Type Ctrl-C on client, stop client_orphan.py.
1034
1035Check TcpExtTCPAbortOnMemory on client::
1036
1037  nstatuser@nstat-a:~$ nstat | grep -i abort
1038  TcpExtTCPAbortOnMemory          54                 0.0
1039
1040Check orphane socket count on client::
1041
1042  nstatuser@nstat-a:~$ ss -s
1043  Total: 131 (kernel 0)
1044  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
1045
1046  Transport Total     IP        IPv6
1047  *         0         -         -
1048  RAW       1         0         1
1049  UDP       1         1         0
1050  TCP       14        13        1
1051  INET      16        14        2
1052  FRAG      0         0         0
1053
1054The explanation of the test: after run server_orphan.py and
1055client_orphan.py, we set up 64 connections between server and
1056client. Run the iptables command, the server will drop all packets from
1057the client, type Ctrl-C on client_orphan.py, the system of the client
1058would try to close these connections, and before they are closed
1059gracefully, these connections became orphan sockets. As the iptables
1060of the server blocked packets from the client, the server won't receive fin
1061from the client, so all connection on clients would be stuck on FIN_WAIT_1
1062stage, so they will keep as orphan sockets until timeout. We have echo
106310 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
1064only keep 10 orphan sockets, for all other orphan sockets, the client
1065system sent RST for them and delete them. We have 64 connections, so
1066the 'ss -s' command shows the system has 10 orphan sockets, and the
1067value of TcpExtTCPAbortOnMemory was 54.
1068
1069An additional explanation about orphan socket count: You could find the
1070exactly orphan socket count by the 'ss -s' command, but when kernel
1071decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
1072doesn't always check the exactly orphan socket count. For increasing
1073performance, kernel checks an approximate count firstly, if the
1074approximate count is more than tcp_max_orphans, kernel checks the
1075exact count again. So if the approximate count is less than
1076tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
1077would find TcpExtTCPAbortOnMemory is not increased at all. If
1078tcp_max_orphans is large enough, it won't occur, but if you decrease
1079tcp_max_orphans to a small value like our test, you might find this
1080issue. So in our test, the client set up 64 connections although the
1081tcp_max_orphans is 10. If the client only set up 11 connections, we
1082can't find the change of TcpExtTCPAbortOnMemory.
1083
1084Continue the previous test, we wait for several minutes. Because of the
1085iptables on the server blocked the traffic, the server wouldn't receive
1086fin, and all the client's orphan sockets would timeout on the
1087FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
108810 timeout on the client::
1089
1090  nstatuser@nstat-a:~$ nstat | grep -i abort
1091  TcpExtTCPAbortOnTimeout         10                 0.0
1092
1093TcpExtTCPAbortOnLinger
1094----------------------
1095The server side code::
1096
1097  nstatuser@nstat-b:~$ cat server_linger.py
1098  import socket
1099  import time
1100
1101  port = 9000
1102
1103  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1104  s.bind(('0.0.0.0', port))
1105  s.listen(1)
1106  sock, addr = s.accept()
1107  while True:
1108      time.sleep(9999999)
1109
1110The client side code::
1111
1112  nstatuser@nstat-a:~$ cat client_linger.py
1113  import socket
1114  import struct
1115
1116  server = 'nstat-b' # server address
1117  port = 9000
1118
1119  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1120  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
1121  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
1122  s.connect((server, port))
1123  s.close()
1124
1125Run server_linger.py on server::
1126
1127  nstatuser@nstat-b:~$ python3 server_linger.py
1128
1129Run client_linger.py on client::
1130
1131  nstatuser@nstat-a:~$ python3 client_linger.py
1132
1133After run client_linger.py, check the output of nstat::
1134
1135  nstatuser@nstat-a:~$ nstat | grep -i abort
1136  TcpExtTCPAbortOnLinger          1                  0.0
1137
1138TcpExtTCPRcvCoalesce
1139--------------------
1140On the server, we run a program which listen on TCP port 9000, but
1141doesn't read any data::
1142
1143  import socket
1144  import time
1145  port = 9000
1146  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1147  s.bind(('0.0.0.0', port))
1148  s.listen(1)
1149  sock, addr = s.accept()
1150  while True:
1151      time.sleep(9999999)
1152
1153Save the above code as server_coalesce.py, and run::
1154
1155  python3 server_coalesce.py
1156
1157On the client, save below code as client_coalesce.py::
1158
1159  import socket
1160  server = 'nstat-b'
1161  port = 9000
1162  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1163  s.connect((server, port))
1164
1165Run::
1166
1167  nstatuser@nstat-a:~$ python3 -i client_coalesce.py
1168
1169We use '-i' to come into the interactive mode, then a packet::
1170
1171  >>> s.send(b'foo')
1172  3
1173
1174Send a packet again::
1175
1176  >>> s.send(b'bar')
1177  3
1178
1179On the server, run nstat::
1180
1181  ubuntu@nstat-b:~$ nstat
1182  #kernel
1183  IpInReceives                    2                  0.0
1184  IpInDelivers                    2                  0.0
1185  IpOutRequests                   2                  0.0
1186  TcpInSegs                       2                  0.0
1187  TcpOutSegs                      2                  0.0
1188  TcpExtTCPRcvCoalesce            1                  0.0
1189  IpExtInOctets                   110                0.0
1190  IpExtOutOctets                  104                0.0
1191  IpExtInNoECTPkts                2                  0.0
1192
1193The client sent two packets, server didn't read any data. When
1194the second packet arrived at server, the first packet was still in
1195the receiving queue. So the TCP layer merged the two packets, and we
1196could find the TcpExtTCPRcvCoalesce increased 1.
1197
1198TcpExtListenOverflows and TcpExtListenDrops
1199-------------------------------------------
1200On server, run the nc command, listen on port 9000::
1201
1202  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
1203  Listening on [0.0.0.0] (family 0, port 9000)
1204
1205On client, run 3 nc commands in different terminals::
1206
1207  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1208  Connection to nstat-b 9000 port [tcp/*] succeeded!
1209
1210The nc command only accepts 1 connection, and the accept queue length
1211is 1. On current linux implementation, set queue length to n means the
1212actual queue length is n+1. Now we create 3 connections, 1 is accepted
1213by nc, 2 in accepted queue, so the accept queue is full.
1214
1215Before running the 4th nc, we clean the nstat history on the server::
1216
1217  nstatuser@nstat-b:~$ nstat -n
1218
1219Run the 4th nc on the client::
1220
1221  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1222
1223If the nc server is running on kernel 4.10 or higher version, you
1224won't see the "Connection to ... succeeded!" string, because kernel
1225will drop the SYN if the accept queue is full. If the nc client is running
1226on an old kernel, you would see that the connection is succeeded,
1227because kernel would complete the 3 way handshake and keep the socket
1228on half open queue. I did the test on kernel 4.15. Below is the nstat
1229on the server::
1230
1231  nstatuser@nstat-b:~$ nstat
1232  #kernel
1233  IpInReceives                    4                  0.0
1234  IpInDelivers                    4                  0.0
1235  TcpInSegs                       4                  0.0
1236  TcpExtListenOverflows           4                  0.0
1237  TcpExtListenDrops               4                  0.0
1238  IpExtInOctets                   240                0.0
1239  IpExtInNoECTPkts                4                  0.0
1240
1241Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
1242between the 4th nc and the nstat was longer, the value of
1243TcpExtListenOverflows and TcpExtListenDrops would be larger, because
1244the SYN of the 4th nc was dropped, the client was retrying.
1245
1246IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
1247-------------------------------------------------
1248server A IP address: 192.168.122.250
1249server B IP address: 192.168.122.251
1250Prepare on server A, add a route to server B::
1251
1252  $ sudo ip route add 8.8.8.8/32 via 192.168.122.251
1253
1254Prepare on server B, disable send_redirects for all interfaces::
1255
1256  $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
1257  $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
1258  $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
1259  $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
1260
1261We want to let sever A send a packet to 8.8.8.8, and route the packet
1262to server B. When server B receives such packet, it might send a ICMP
1263Redirect message to server A, set send_redirects to 0 will disable
1264this behavior.
1265
1266First, generate InAddrErrors. On server B, we disable IP forwarding::
1267
1268  $ sudo sysctl -w net.ipv4.conf.all.forwarding=0
1269
1270On server A, we send packets to 8.8.8.8::
1271
1272  $ nc -v 8.8.8.8 53
1273
1274On server B, we check the output of nstat::
1275
1276  $ nstat
1277  #kernel
1278  IpInReceives                    3                  0.0
1279  IpInAddrErrors                  3                  0.0
1280  IpExtInOctets                   180                0.0
1281  IpExtInNoECTPkts                3                  0.0
1282
1283As we have let server A route 8.8.8.8 to server B, and we disabled IP
1284forwarding on server B, Server A sent packets to server B, then server B
1285dropped packets and increased IpInAddrErrors. As the nc command would
1286re-send the SYN packet if it didn't receive a SYN+ACK, we could find
1287multiple IpInAddrErrors.
1288
1289Second, generate IpExtInNoRoutes. On server B, we enable IP
1290forwarding::
1291
1292  $ sudo sysctl -w net.ipv4.conf.all.forwarding=1
1293
1294Check the route table of server B and remove the default route::
1295
1296  $ ip route show
1297  default via 192.168.122.1 dev ens3 proto static
1298  192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
1299  $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
1300
1301On server A, we contact 8.8.8.8 again::
1302
1303  $ nc -v 8.8.8.8 53
1304  nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
1305
1306On server B, run nstat::
1307
1308  $ nstat
1309  #kernel
1310  IpInReceives                    1                  0.0
1311  IpOutRequests                   1                  0.0
1312  IcmpOutMsgs                     1                  0.0
1313  IcmpOutDestUnreachs             1                  0.0
1314  IcmpMsgOutType3                 1                  0.0
1315  IpExtInNoRoutes                 1                  0.0
1316  IpExtInOctets                   60                 0.0
1317  IpExtOutOctets                  88                 0.0
1318  IpExtInNoECTPkts                1                  0.0
1319
1320We enabled IP forwarding on server B, when server B received a packet
1321which destination IP address is 8.8.8.8, server B will try to forward
1322this packet. We have deleted the default route, there was no route for
13238.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
1324Destination Unreachable" message to server A.
1325
1326Third, generate IpOutNoRoutes. Run ping command on server B::
1327
1328  $ ping -c 1 8.8.8.8
1329  connect: Network is unreachable
1330
1331Run nstat on server B::
1332
1333  $ nstat
1334  #kernel
1335  IpOutNoRoutes                   1                  0.0
1336
1337We have deleted the default route on server B. Server B couldn't find
1338a route for the 8.8.8.8 IP address, so server B increased
1339IpOutNoRoutes.
1340
1341TcpExtTCPACKSkippedSynRecv
1342--------------------------
1343In this test, we send 3 same SYN packets from client to server. The
1344first SYN will let server create a socket, set it to Syn-Recv status,
1345and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
1346again, and record the reply time (the duplicate ACK reply time). The
1347third SYN will let server check the previous duplicate ACK reply time,
1348and decide to skip the duplicate ACK, then increase the
1349TcpExtTCPACKSkippedSynRecv counter.
1350
1351Run tcpdump to capture a SYN packet::
1352
1353  nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
1354  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1355
1356Open another terminal, run nc command::
1357
1358  nstatuser@nstat-a:~$ nc nstat-b 9000
1359
1360As the nstat-b didn't listen on port 9000, it should reply a RST, and
1361the nc command exited immediately. It was enough for the tcpdump
1362command to capture a SYN packet. A linux server might use hardware
1363offload for the TCP checksum, so the checksum in the /tmp/syn.pcap
1364might be not correct. We call tcprewrite to fix it::
1365
1366  nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
1367
1368On nstat-b, we run nc to listen on port 9000::
1369
1370  nstatuser@nstat-b:~$ nc -lkv 9000
1371  Listening on [0.0.0.0] (family 0, port 9000)
1372
1373On nstat-a, we blocked the packet from port 9000, or nstat-a would send
1374RST to nstat-b::
1375
1376  nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
1377
1378Send 3 SYN repeatly to nstat-b::
1379
1380  nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
1381
1382Check snmp cunter on nstat-b::
1383
1384  nstatuser@nstat-b:~$ nstat | grep -i skip
1385  TcpExtTCPACKSkippedSynRecv      1                  0.0
1386
1387As we expected, TcpExtTCPACKSkippedSynRecv is 1.
1388
1389TcpExtTCPACKSkippedPAWS
1390-----------------------
1391To trigger PAWS, we could send an old SYN.
1392
1393On nstat-b, let nc listen on port 9000::
1394
1395  nstatuser@nstat-b:~$ nc -lkv 9000
1396  Listening on [0.0.0.0] (family 0, port 9000)
1397
1398On nstat-a, run tcpdump to capture a SYN::
1399
1400  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
1401  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1402
1403On nstat-a, run nc as a client to connect nstat-b::
1404
1405  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1406  Connection to nstat-b 9000 port [tcp/*] succeeded!
1407
1408Now the tcpdump has captured the SYN and exit. We should fix the
1409checksum::
1410
1411  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
1412
1413Send the SYN packet twice::
1414
1415  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
1416
1417On nstat-b, check the snmp counter::
1418
1419  nstatuser@nstat-b:~$ nstat | grep -i skip
1420  TcpExtTCPACKSkippedPAWS         1                  0.0
1421
1422We sent two SYN via tcpreplay, both of them would let PAWS check
1423failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
1424for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
1425
1426TcpExtTCPACKSkippedSeq
1427----------------------
1428To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
1429timestamp (to pass PAWS check) but the sequence number is out of
1430window. The linux TCP stack would avoid to skip if the packet has
1431data, so we need a pure ACK packet. To generate such a packet, we
1432could create two sockets: one on port 9000, another on port 9001. Then
1433we capture an ACK on port 9001, change the source/destination port
1434numbers to match the port 9000 socket. Then we could trigger
1435TcpExtTCPACKSkippedSeq via this packet.
1436
1437On nstat-b, open two terminals, run two nc commands to listen on both
1438port 9000 and port 9001::
1439
1440  nstatuser@nstat-b:~$ nc -lkv 9000
1441  Listening on [0.0.0.0] (family 0, port 9000)
1442
1443  nstatuser@nstat-b:~$ nc -lkv 9001
1444  Listening on [0.0.0.0] (family 0, port 9001)
1445
1446On nstat-a, run two nc clients::
1447
1448  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1449  Connection to nstat-b 9000 port [tcp/*] succeeded!
1450
1451  nstatuser@nstat-a:~$ nc -v nstat-b 9001
1452  Connection to nstat-b 9001 port [tcp/*] succeeded!
1453
1454On nstat-a, run tcpdump to capture an ACK::
1455
1456  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
1457  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1458
1459On nstat-b, send a packet via the port 9001 socket. E.g. we sent a
1460string 'foo' in our example::
1461
1462  nstatuser@nstat-b:~$ nc -lkv 9001
1463  Listening on [0.0.0.0] (family 0, port 9001)
1464  Connection from nstat-a 42132 received!
1465  foo
1466
1467On nstat-a, the tcpdump should have caputred the ACK. We should check
1468the source port numbers of the two nc clients::
1469
1470  nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
1471  State  Recv-Q   Send-Q         Local Address:Port           Peer Address:Port
1472  ESTAB  0        0            192.168.122.250:50208       192.168.122.251:9000
1473  ESTAB  0        0            192.168.122.250:42132       192.168.122.251:9001
1474
1475Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
1476port 50208::
1477
1478  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
1479
1480Now the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
1481
1482  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
1483
1484Check TcpExtTCPACKSkippedSeq on nstat-b::
1485
1486  nstatuser@nstat-b:~$ nstat | grep -i skip
1487  TcpExtTCPACKSkippedSeq          1                  0.0
1488