1===========
2SNMP counter
3===========
4
5This document explains the meaning of SNMP counters.
6
7General IPv4 counters
8====================
9All layer 4 packets and ICMP packets will change these counters, but
10these counters won't be changed by layer 2 packets (such as STP) or
11ARP packets.
12
13* IpInReceives
14Defined in `RFC1213 ipInReceives`_
15
16.. _RFC1213 ipInReceives: https://tools.ietf.org/html/rfc1213#page-26
17
18The number of packets received by the IP layer. It gets increasing at the
19beginning of ip_rcv function, always be updated together with
20IpExtInOctets. It will be increased even if the packet is dropped
21later (e.g. due to the IP header is invalid or the checksum is wrong
22and so on).  It indicates the number of aggregated segments after
23GRO/LRO.
24
25* IpInDelivers
26Defined in `RFC1213 ipInDelivers`_
27
28.. _RFC1213 ipInDelivers: https://tools.ietf.org/html/rfc1213#page-28
29
30The number of packets delivers to the upper layer protocols. E.g. TCP, UDP,
31ICMP and so on. If no one listens on a raw socket, only kernel
32supported protocols will be delivered, if someone listens on the raw
33socket, all valid IP packets will be delivered.
34
35* IpOutRequests
36Defined in `RFC1213 ipOutRequests`_
37
38.. _RFC1213 ipOutRequests: https://tools.ietf.org/html/rfc1213#page-28
39
40The number of packets sent via IP layer, for both single cast and
41multicast packets, and would always be updated together with
42IpExtOutOctets.
43
44* IpExtInOctets and IpExtOutOctets
45They are Linux kernel extensions, no RFC definitions. Please note,
46RFC1213 indeed defines ifInOctets  and ifOutOctets, but they
47are different things. The ifInOctets and ifOutOctets include the MAC
48layer header size but IpExtInOctets and IpExtOutOctets don't, they
49only include the IP layer header and the IP layer data.
50
51* IpExtInNoECTPkts, IpExtInECT1Pkts, IpExtInECT0Pkts, IpExtInCEPkts
52They indicate the number of four kinds of ECN IP packets, please refer
53`Explicit Congestion Notification`_ for more details.
54
55.. _Explicit Congestion Notification: https://tools.ietf.org/html/rfc3168#page-6
56
57These 4 counters calculate how many packets received per ECN
58status. They count the real frame number regardless the LRO/GRO. So
59for the same packet, you might find that IpInReceives count 1, but
60IpExtInNoECTPkts counts 2 or more.
61
62* IpInHdrErrors
63Defined in `RFC1213 ipInHdrErrors`_. It indicates the packet is
64dropped due to the IP header error. It might happen in both IP input
65and IP forward paths.
66
67.. _RFC1213 ipInHdrErrors: https://tools.ietf.org/html/rfc1213#page-27
68
69* IpInAddrErrors
70Defined in `RFC1213 ipInAddrErrors`_. It will be increased in two
71scenarios: (1) The IP address is invalid. (2) The destination IP
72address is not a local address and IP forwarding is not enabled
73
74.. _RFC1213 ipInAddrErrors: https://tools.ietf.org/html/rfc1213#page-27
75
76* IpExtInNoRoutes
77This counter means the packet is dropped when the IP stack receives a
78packet and can't find a route for it from the route table. It might
79happen when IP forwarding is enabled and the destination IP address is
80not a local address and there is no route for the destination IP
81address.
82
83* IpInUnknownProtos
84Defined in `RFC1213 ipInUnknownProtos`_. It will be increased if the
85layer 4 protocol is unsupported by kernel. If an application is using
86raw socket, kernel will always deliver the packet to the raw socket
87and this counter won't be increased.
88
89.. _RFC1213 ipInUnknownProtos: https://tools.ietf.org/html/rfc1213#page-27
90
91* IpExtInTruncatedPkts
92For IPv4 packet, it means the actual data size is smaller than the
93"Total Length" field in the IPv4 header.
94
95* IpInDiscards
96Defined in `RFC1213 ipInDiscards`_. It indicates the packet is dropped
97in the IP receiving path and due to kernel internal reasons (e.g. no
98enough memory).
99
100.. _RFC1213 ipInDiscards: https://tools.ietf.org/html/rfc1213#page-28
101
102* IpOutDiscards
103Defined in `RFC1213 ipOutDiscards`_. It indicates the packet is
104dropped in the IP sending path and due to kernel internal reasons.
105
106.. _RFC1213 ipOutDiscards: https://tools.ietf.org/html/rfc1213#page-28
107
108* IpOutNoRoutes
109Defined in `RFC1213 ipOutNoRoutes`_. It indicates the packet is
110dropped in the IP sending path and no route is found for it.
111
112.. _RFC1213 ipOutNoRoutes: https://tools.ietf.org/html/rfc1213#page-29
113
114ICMP counters
115============
116* IcmpInMsgs and IcmpOutMsgs
117Defined by `RFC1213 icmpInMsgs`_ and `RFC1213 icmpOutMsgs`_
118
119.. _RFC1213 icmpInMsgs: https://tools.ietf.org/html/rfc1213#page-41
120.. _RFC1213 icmpOutMsgs: https://tools.ietf.org/html/rfc1213#page-43
121
122As mentioned in the RFC1213, these two counters include errors, they
123would be increased even if the ICMP packet has an invalid type. The
124ICMP output path will check the header of a raw socket, so the
125IcmpOutMsgs would still be updated if the IP header is constructed by
126a userspace program.
127
128* ICMP named types
129| These counters include most of common ICMP types, they are:
130| IcmpInDestUnreachs: `RFC1213 icmpInDestUnreachs`_
131| IcmpInTimeExcds: `RFC1213 icmpInTimeExcds`_
132| IcmpInParmProbs: `RFC1213 icmpInParmProbs`_
133| IcmpInSrcQuenchs: `RFC1213 icmpInSrcQuenchs`_
134| IcmpInRedirects: `RFC1213 icmpInRedirects`_
135| IcmpInEchos: `RFC1213 icmpInEchos`_
136| IcmpInEchoReps: `RFC1213 icmpInEchoReps`_
137| IcmpInTimestamps: `RFC1213 icmpInTimestamps`_
138| IcmpInTimestampReps: `RFC1213 icmpInTimestampReps`_
139| IcmpInAddrMasks: `RFC1213 icmpInAddrMasks`_
140| IcmpInAddrMaskReps: `RFC1213 icmpInAddrMaskReps`_
141| IcmpOutDestUnreachs: `RFC1213 icmpOutDestUnreachs`_
142| IcmpOutTimeExcds: `RFC1213 icmpOutTimeExcds`_
143| IcmpOutParmProbs: `RFC1213 icmpOutParmProbs`_
144| IcmpOutSrcQuenchs: `RFC1213 icmpOutSrcQuenchs`_
145| IcmpOutRedirects: `RFC1213 icmpOutRedirects`_
146| IcmpOutEchos: `RFC1213 icmpOutEchos`_
147| IcmpOutEchoReps: `RFC1213 icmpOutEchoReps`_
148| IcmpOutTimestamps: `RFC1213 icmpOutTimestamps`_
149| IcmpOutTimestampReps: `RFC1213 icmpOutTimestampReps`_
150| IcmpOutAddrMasks: `RFC1213 icmpOutAddrMasks`_
151| IcmpOutAddrMaskReps: `RFC1213 icmpOutAddrMaskReps`_
152
153.. _RFC1213 icmpInDestUnreachs: https://tools.ietf.org/html/rfc1213#page-41
154.. _RFC1213 icmpInTimeExcds: https://tools.ietf.org/html/rfc1213#page-41
155.. _RFC1213 icmpInParmProbs: https://tools.ietf.org/html/rfc1213#page-42
156.. _RFC1213 icmpInSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-42
157.. _RFC1213 icmpInRedirects: https://tools.ietf.org/html/rfc1213#page-42
158.. _RFC1213 icmpInEchos: https://tools.ietf.org/html/rfc1213#page-42
159.. _RFC1213 icmpInEchoReps: https://tools.ietf.org/html/rfc1213#page-42
160.. _RFC1213 icmpInTimestamps: https://tools.ietf.org/html/rfc1213#page-42
161.. _RFC1213 icmpInTimestampReps: https://tools.ietf.org/html/rfc1213#page-43
162.. _RFC1213 icmpInAddrMasks: https://tools.ietf.org/html/rfc1213#page-43
163.. _RFC1213 icmpInAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-43
164
165.. _RFC1213 icmpOutDestUnreachs: https://tools.ietf.org/html/rfc1213#page-44
166.. _RFC1213 icmpOutTimeExcds: https://tools.ietf.org/html/rfc1213#page-44
167.. _RFC1213 icmpOutParmProbs: https://tools.ietf.org/html/rfc1213#page-44
168.. _RFC1213 icmpOutSrcQuenchs: https://tools.ietf.org/html/rfc1213#page-44
169.. _RFC1213 icmpOutRedirects: https://tools.ietf.org/html/rfc1213#page-44
170.. _RFC1213 icmpOutEchos: https://tools.ietf.org/html/rfc1213#page-45
171.. _RFC1213 icmpOutEchoReps: https://tools.ietf.org/html/rfc1213#page-45
172.. _RFC1213 icmpOutTimestamps: https://tools.ietf.org/html/rfc1213#page-45
173.. _RFC1213 icmpOutTimestampReps: https://tools.ietf.org/html/rfc1213#page-45
174.. _RFC1213 icmpOutAddrMasks: https://tools.ietf.org/html/rfc1213#page-45
175.. _RFC1213 icmpOutAddrMaskReps: https://tools.ietf.org/html/rfc1213#page-46
176
177Every ICMP type has two counters: 'In' and 'Out'. E.g., for the ICMP
178Echo packet, they are IcmpInEchos and IcmpOutEchos. Their meanings are
179straightforward. The 'In' counter means kernel receives such a packet
180and the 'Out' counter means kernel sends such a packet.
181
182* ICMP numeric types
183They are IcmpMsgInType[N] and IcmpMsgOutType[N], the [N] indicates the
184ICMP type number. These counters track all kinds of ICMP packets. The
185ICMP type number definition could be found in the `ICMP parameters`_
186document.
187
188.. _ICMP parameters: https://www.iana.org/assignments/icmp-parameters/icmp-parameters.xhtml
189
190For example, if the Linux kernel sends an ICMP Echo packet, the
191IcmpMsgOutType8 would increase 1. And if kernel gets an ICMP Echo Reply
192packet, IcmpMsgInType0 would increase 1.
193
194* IcmpInCsumErrors
195This counter indicates the checksum of the ICMP packet is
196wrong. Kernel verifies the checksum after updating the IcmpInMsgs and
197before updating IcmpMsgInType[N]. If a packet has bad checksum, the
198IcmpInMsgs would be updated but none of IcmpMsgInType[N] would be updated.
199
200* IcmpInErrors and IcmpOutErrors
201Defined by `RFC1213 icmpInErrors`_ and `RFC1213 icmpOutErrors`_
202
203.. _RFC1213 icmpInErrors: https://tools.ietf.org/html/rfc1213#page-41
204.. _RFC1213 icmpOutErrors: https://tools.ietf.org/html/rfc1213#page-43
205
206When an error occurs in the ICMP packet handler path, these two
207counters would be updated. The receiving packet path use IcmpInErrors
208and the sending packet path use IcmpOutErrors. When IcmpInCsumErrors
209is increased, IcmpInErrors would always be increased too.
210
211relationship of the ICMP counters
212-------------------------------
213The sum of IcmpMsgOutType[N] is always equal to IcmpOutMsgs, as they
214are updated at the same time. The sum of IcmpMsgInType[N] plus
215IcmpInErrors should be equal or larger than IcmpInMsgs. When kernel
216receives an ICMP packet, kernel follows below logic:
217
2181. increase IcmpInMsgs
2192. if has any error, update IcmpInErrors and finish the process
2203. update IcmpMsgOutType[N]
2214. handle the packet depending on the type, if has any error, update
222   IcmpInErrors and finish the process
223
224So if all errors occur in step (2), IcmpInMsgs should be equal to the
225sum of IcmpMsgOutType[N] plus IcmpInErrors. If all errors occur in
226step (4), IcmpInMsgs should be equal to the sum of
227IcmpMsgOutType[N]. If the errors occur in both step (2) and step (4),
228IcmpInMsgs should be less than the sum of IcmpMsgOutType[N] plus
229IcmpInErrors.
230
231General TCP counters
232==================
233* TcpInSegs
234Defined in `RFC1213 tcpInSegs`_
235
236.. _RFC1213 tcpInSegs: https://tools.ietf.org/html/rfc1213#page-48
237
238The number of packets received by the TCP layer. As mentioned in
239RFC1213, it includes the packets received in error, such as checksum
240error, invalid TCP header and so on. Only one error won't be included:
241if the layer 2 destination address is not the NIC's layer 2
242address. It might happen if the packet is a multicast or broadcast
243packet, or the NIC is in promiscuous mode. In these situations, the
244packets would be delivered to the TCP layer, but the TCP layer will discard
245these packets before increasing TcpInSegs. The TcpInSegs counter
246isn't aware of GRO. So if two packets are merged by GRO, the TcpInSegs
247counter would only increase 1.
248
249* TcpOutSegs
250Defined in `RFC1213 tcpOutSegs`_
251
252.. _RFC1213 tcpOutSegs: https://tools.ietf.org/html/rfc1213#page-48
253
254The number of packets sent by the TCP layer. As mentioned in RFC1213,
255it excludes the retransmitted packets. But it includes the SYN, ACK
256and RST packets. Doesn't like TcpInSegs, the TcpOutSegs is aware of
257GSO, so if a packet would be split to 2 by GSO, TcpOutSegs will
258increase 2.
259
260* TcpActiveOpens
261Defined in `RFC1213 tcpActiveOpens`_
262
263.. _RFC1213 tcpActiveOpens: https://tools.ietf.org/html/rfc1213#page-47
264
265It means the TCP layer sends a SYN, and come into the SYN-SENT
266state. Every time TcpActiveOpens increases 1, TcpOutSegs should always
267increase 1.
268
269* TcpPassiveOpens
270Defined in `RFC1213 tcpPassiveOpens`_
271
272.. _RFC1213 tcpPassiveOpens: https://tools.ietf.org/html/rfc1213#page-47
273
274It means the TCP layer receives a SYN, replies a SYN+ACK, come into
275the SYN-RCVD state.
276
277* TcpExtTCPRcvCoalesce
278When packets are received by the TCP layer and are not be read by the
279application, the TCP layer will try to merge them. This counter
280indicate how many packets are merged in such situation. If GRO is
281enabled, lots of packets would be merged by GRO, these packets
282wouldn't be counted to TcpExtTCPRcvCoalesce.
283
284* TcpExtTCPAutoCorking
285When sending packets, the TCP layer will try to merge small packets to
286a bigger one. This counter increase 1 for every packet merged in such
287situation. Please refer to the LWN article for more details:
288https://lwn.net/Articles/576263/
289
290* TcpExtTCPOrigDataSent
291This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
292explaination below::
293
294  TCPOrigDataSent: number of outgoing packets with original data (excluding
295  retransmission but including data-in-SYN). This counter is different from
296  TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
297  more useful to track the TCP retransmission rate.
298
299* TCPSynRetrans
300This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
301explaination below::
302
303  TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
304  retransmissions into SYN, fast-retransmits, timeout retransmits, etc.
305
306* TCPFastOpenActiveFail
307This counter is explained by `kernel commit f19c29e3e391`_, I pasted the
308explaination below::
309
310  TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed because
311  the remote does not accept it or the attempts timed out.
312
313.. _kernel commit f19c29e3e391: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f19c29e3e391a66a273e9afebaf01917245148cd
314
315* TcpExtListenOverflows and TcpExtListenDrops
316When kernel receives a SYN from a client, and if the TCP accept queue
317is full, kernel will drop the SYN and add 1 to TcpExtListenOverflows.
318At the same time kernel will also add 1 to TcpExtListenDrops. When a
319TCP socket is in LISTEN state, and kernel need to drop a packet,
320kernel would always add 1 to TcpExtListenDrops. So increase
321TcpExtListenOverflows would let TcpExtListenDrops increasing at the
322same time, but TcpExtListenDrops would also increase without
323TcpExtListenOverflows increasing, e.g. a memory allocation fail would
324also let TcpExtListenDrops increase.
325
326Note: The above explanation is based on kernel 4.10 or above version, on
327an old kernel, the TCP stack has different behavior when TCP accept
328queue is full. On the old kernel, TCP stack won't drop the SYN, it
329would complete the 3-way handshake. As the accept queue is full, TCP
330stack will keep the socket in the TCP half-open queue. As it is in the
331half open queue, TCP stack will send SYN+ACK on an exponential backoff
332timer, after client replies ACK, TCP stack checks whether the accept
333queue is still full, if it is not full, moves the socket to the accept
334queue, if it is full, keeps the socket in the half-open queue, at next
335time client replies ACK, this socket will get another chance to move
336to the accept queue.
337
338
339* TcpEstabResets
340Defined in `RFC1213 tcpEstabResets`_.
341
342.. _RFC1213 tcpEstabResets: https://tools.ietf.org/html/rfc1213#page-48
343
344* TcpAttemptFails
345Defined in `RFC1213 tcpAttemptFails`_.
346
347.. _RFC1213 tcpAttemptFails: https://tools.ietf.org/html/rfc1213#page-48
348
349* TcpOutRsts
350Defined in `RFC1213 tcpOutRsts`_. The RFC says this counter indicates
351the 'segments sent containing the RST flag', but in linux kernel, this
352couner indicates the segments kerenl tried to send. The sending
353process might be failed due to some errors (e.g. memory alloc failed).
354
355.. _RFC1213 tcpOutRsts: https://tools.ietf.org/html/rfc1213#page-52
356
357
358TCP Fast Path
359============
360When kernel receives a TCP packet, it has two paths to handler the
361packet, one is fast path, another is slow path. The comment in kernel
362code provides a good explanation of them, I pasted them below::
363
364  It is split into a fast path and a slow path. The fast path is
365  disabled when:
366
367  - A zero window was announced from us
368  - zero window probing
369    is only handled properly on the slow path.
370  - Out of order segments arrived.
371  - Urgent data is expected.
372  - There is no buffer space left
373  - Unexpected TCP flags/window values/header lengths are received
374    (detected by checking the TCP header against pred_flags)
375  - Data is sent in both directions. The fast path only supports pure senders
376    or pure receivers (this means either the sequence number or the ack
377    value must stay constant)
378  - Unexpected TCP option.
379
380Kernel will try to use fast path unless any of the above conditions
381are satisfied. If the packets are out of order, kernel will handle
382them in slow path, which means the performance might be not very
383good. Kernel would also come into slow path if the "Delayed ack" is
384used, because when using "Delayed ack", the data is sent in both
385directions. When the TCP window scale option is not used, kernel will
386try to enable fast path immediately when the connection comes into the
387established state, but if the TCP window scale option is used, kernel
388will disable the fast path at first, and try to enable it after kernel
389receives packets.
390
391* TcpExtTCPPureAcks and TcpExtTCPHPAcks
392If a packet set ACK flag and has no data, it is a pure ACK packet, if
393kernel handles it in the fast path, TcpExtTCPHPAcks will increase 1,
394if kernel handles it in the slow path, TcpExtTCPPureAcks will
395increase 1.
396
397* TcpExtTCPHPHits
398If a TCP packet has data (which means it is not a pure ACK packet),
399and this packet is handled in the fast path, TcpExtTCPHPHits will
400increase 1.
401
402
403TCP abort
404========
405* TcpExtTCPAbortOnData
406It means TCP layer has data in flight, but need to close the
407connection. So TCP layer sends a RST to the other side, indicate the
408connection is not closed very graceful. An easy way to increase this
409counter is using the SO_LINGER option. Please refer to the SO_LINGER
410section of the `socket man page`_:
411
412.. _socket man page: http://man7.org/linux/man-pages/man7/socket.7.html
413
414By default, when an application closes a connection, the close function
415will return immediately and kernel will try to send the in-flight data
416async. If you use the SO_LINGER option, set l_onoff to 1, and l_linger
417to a positive number, the close function won't return immediately, but
418wait for the in-flight data are acked by the other side, the max wait
419time is l_linger seconds. If set l_onoff to 1 and set l_linger to 0,
420when the application closes a connection, kernel will send a RST
421immediately and increase the TcpExtTCPAbortOnData counter.
422
423* TcpExtTCPAbortOnClose
424This counter means the application has unread data in the TCP layer when
425the application wants to close the TCP connection. In such a situation,
426kernel will send a RST to the other side of the TCP connection.
427
428* TcpExtTCPAbortOnMemory
429When an application closes a TCP connection, kernel still need to track
430the connection, let it complete the TCP disconnect process. E.g. an
431app calls the close method of a socket, kernel sends fin to the other
432side of the connection, then the app has no relationship with the
433socket any more, but kernel need to keep the socket, this socket
434becomes an orphan socket, kernel waits for the reply of the other side,
435and would come to the TIME_WAIT state finally. When kernel has no
436enough memory to keep the orphan socket, kernel would send an RST to
437the other side, and delete the socket, in such situation, kernel will
438increase 1 to the TcpExtTCPAbortOnMemory. Two conditions would trigger
439TcpExtTCPAbortOnMemory:
440
4411. the memory used by the TCP protocol is higher than the third value of
442the tcp_mem. Please refer the tcp_mem section in the `TCP man page`_:
443
444.. _TCP man page: http://man7.org/linux/man-pages/man7/tcp.7.html
445
4462. the orphan socket count is higher than net.ipv4.tcp_max_orphans
447
448
449* TcpExtTCPAbortOnTimeout
450This counter will increase when any of the TCP timers expire. In such
451situation, kernel won't send RST, just give up the connection.
452
453* TcpExtTCPAbortOnLinger
454When a TCP connection comes into FIN_WAIT_2 state, instead of waiting
455for the fin packet from the other side, kernel could send a RST and
456delete the socket immediately. This is not the default behavior of
457Linux kernel TCP stack. By configuring the TCP_LINGER2 socket option,
458you could let kernel follow this behavior.
459
460* TcpExtTCPAbortFailed
461The kernel TCP layer will send RST if the `RFC2525 2.17 section`_ is
462satisfied. If an internal error occurs during this process,
463TcpExtTCPAbortFailed will be increased.
464
465.. _RFC2525 2.17 section: https://tools.ietf.org/html/rfc2525#page-50
466
467TCP Hybrid Slow Start
468====================
469The Hybrid Slow Start algorithm is an enhancement of the traditional
470TCP congestion window Slow Start algorithm. It uses two pieces of
471information to detect whether the max bandwidth of the TCP path is
472approached. The two pieces of information are ACK train length and
473increase in packet delay. For detail information, please refer the
474`Hybrid Slow Start paper`_. Either ACK train length or packet delay
475hits a specific threshold, the congestion control algorithm will come
476into the Congestion Avoidance state. Until v4.20, two congestion
477control algorithms are using Hybrid Slow Start, they are cubic (the
478default congestion control algorithm) and cdg. Four snmp counters
479relate with the Hybrid Slow Start algorithm.
480
481.. _Hybrid Slow Start paper: https://pdfs.semanticscholar.org/25e9/ef3f03315782c7f1cbcd31b587857adae7d1.pdf
482
483* TcpExtTCPHystartTrainDetect
484How many times the ACK train length threshold is detected
485
486* TcpExtTCPHystartTrainCwnd
487The sum of CWND detected by ACK train length. Dividing this value by
488TcpExtTCPHystartTrainDetect is the average CWND which detected by the
489ACK train length.
490
491* TcpExtTCPHystartDelayDetect
492How many times the packet delay threshold is detected.
493
494* TcpExtTCPHystartDelayCwnd
495The sum of CWND detected by packet delay. Dividing this value by
496TcpExtTCPHystartDelayDetect is the average CWND which detected by the
497packet delay.
498
499TCP retransmission and congestion control
500======================================
501The TCP protocol has two retransmission mechanisms: SACK and fast
502recovery. They are exclusive with each other. When SACK is enabled,
503the kernel TCP stack would use SACK, or kernel would use fast
504recovery. The SACK is a TCP option, which is defined in `RFC2018`_,
505the fast recovery is defined in `RFC6582`_, which is also called
506'Reno'.
507
508The TCP congestion control is a big and complex topic. To understand
509the related snmp counter, we need to know the states of the congestion
510control state machine. There are 5 states: Open, Disorder, CWR,
511Recovery and Loss. For details about these states, please refer page 5
512and page 6 of this document:
513https://pdfs.semanticscholar.org/0e9c/968d09ab2e53e24c4dca5b2d67c7f7140f8e.pdf
514
515.. _RFC2018: https://tools.ietf.org/html/rfc2018
516.. _RFC6582: https://tools.ietf.org/html/rfc6582
517
518* TcpExtTCPRenoRecovery and TcpExtTCPSackRecovery
519When the congestion control comes into Recovery state, if sack is
520used, TcpExtTCPSackRecovery increases 1, if sack is not used,
521TcpExtTCPRenoRecovery increases 1. These two counters mean the TCP
522stack begins to retransmit the lost packets.
523
524* TcpExtTCPSACKReneging
525A packet was acknowledged by SACK, but the receiver has dropped this
526packet, so the sender needs to retransmit this packet. In this
527situation, the sender adds 1 to TcpExtTCPSACKReneging. A receiver
528could drop a packet which has been acknowledged by SACK, although it is
529unusual, it is allowed by the TCP protocol. The sender doesn't really
530know what happened on the receiver side. The sender just waits until
531the RTO expires for this packet, then the sender assumes this packet
532has been dropped by the receiver.
533
534* TcpExtTCPRenoReorder
535The reorder packet is detected by fast recovery. It would only be used
536if SACK is disabled. The fast recovery algorithm detects recorder by
537the duplicate ACK number. E.g., if retransmission is triggered, and
538the original retransmitted packet is not lost, it is just out of
539order, the receiver would acknowledge multiple times, one for the
540retransmitted packet, another for the arriving of the original out of
541order packet. Thus the sender would find more ACks than its
542expectation, and the sender knows out of order occurs.
543
544* TcpExtTCPTSReorder
545The reorder packet is detected when a hole is filled. E.g., assume the
546sender sends packet 1,2,3,4,5, and the receiving order is
5471,2,4,5,3. When the sender receives the ACK of packet 3 (which will
548fill the hole), two conditions will let TcpExtTCPTSReorder increase
5491: (1) if the packet 3 is not re-retransmitted yet. (2) if the packet
5503 is retransmitted but the timestamp of the packet 3's ACK is earlier
551than the retransmission timestamp.
552
553* TcpExtTCPSACKReorder
554The reorder packet detected by SACK. The SACK has two methods to
555detect reorder: (1) DSACK is received by the sender. It means the
556sender sends the same packet more than one times. And the only reason
557is the sender believes an out of order packet is lost so it sends the
558packet again. (2) Assume packet 1,2,3,4,5 are sent by the sender, and
559the sender has received SACKs for packet 2 and 5, now the sender
560receives SACK for packet 4 and the sender doesn't retransmit the
561packet yet, the sender would know packet 4 is out of order. The TCP
562stack of kernel will increase TcpExtTCPSACKReorder for both of the
563above scenarios.
564
565DSACK
566=====
567The DSACK is defined in `RFC2883`_. The receiver uses DSACK to report
568duplicate packets to the sender. There are two kinds of
569duplications: (1) a packet which has been acknowledged is
570duplicate. (2) an out of order packet is duplicate. The TCP stack
571counts these two kinds of duplications on both receiver side and
572sender side.
573
574.. _RFC2883 : https://tools.ietf.org/html/rfc2883
575
576* TcpExtTCPDSACKOldSent
577The TCP stack receives a duplicate packet which has been acked, so it
578sends a DSACK to the sender.
579
580* TcpExtTCPDSACKOfoSent
581The TCP stack receives an out of order duplicate packet, so it sends a
582DSACK to the sender.
583
584* TcpExtTCPDSACKRecv
585The TCP stack receives a DSACK, which indicates an acknowledged
586duplicate packet is received.
587
588* TcpExtTCPDSACKOfoRecv
589The TCP stack receives a DSACK, which indicate an out of order
590duplicate packet is received.
591
592invalid SACK and DSACK
593====================
594When a SACK (or DSACK) block is invalid, a corresponding counter would
595be updated. The validation method is base on the start/end sequence
596number of the SACK block. For more details, please refer the comment
597of the function tcp_is_sackblock_valid in the kernel source code. A
598SACK option could have up to 4 blocks, they are checked
599individually. E.g., if 3 blocks of a SACk is invalid, the
600corresponding counter would be updated 3 times. The comment of the
601`Add counters for discarded SACK blocks`_ patch has additional
602explaination:
603
604.. _Add counters for discarded SACK blocks: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18f02545a9a16c9a89778b91a162ad16d510bb32
605
606* TcpExtTCPSACKDiscard
607This counter indicates how many SACK blocks are invalid. If the invalid
608SACK block is caused by ACK recording, the TCP stack will only ignore
609it and won't update this counter.
610
611* TcpExtTCPDSACKIgnoredOld and TcpExtTCPDSACKIgnoredNoUndo
612When a DSACK block is invalid, one of these two counters would be
613updated. Which counter will be updated depends on the undo_marker flag
614of the TCP socket. If the undo_marker is not set, the TCP stack isn't
615likely to re-transmit any packets, and we still receive an invalid
616DSACK block, the reason might be that the packet is duplicated in the
617middle of the network. In such scenario, TcpExtTCPDSACKIgnoredNoUndo
618will be updated. If the undo_marker is set, TcpExtTCPDSACKIgnoredOld
619will be updated. As implied in its name, it might be an old packet.
620
621SACK shift
622=========
623The linux networking stack stores data in sk_buff struct (skb for
624short). If a SACK block acrosses multiple skb, the TCP stack will try
625to re-arrange data in these skb. E.g. if a SACK block acknowledges seq
62610 to 15, skb1 has seq 10 to 13, skb2 has seq 14 to 20. The seq 14 and
62715 in skb2 would be moved to skb1. This operation is 'shift'. If a
628SACK block acknowledges seq 10 to 20, skb1 has seq 10 to 13, skb2 has
629seq 14 to 20. All data in skb2 will be moved to skb1, and skb2 will be
630discard, this operation is 'merge'.
631
632* TcpExtTCPSackShifted
633A skb is shifted
634
635* TcpExtTCPSackMerged
636A skb is merged
637
638* TcpExtTCPSackShiftFallback
639A skb should be shifted or merged, but the TCP stack doesn't do it for
640some reasons.
641
642TCP out of order
643===============
644* TcpExtTCPOFOQueue
645The TCP layer receives an out of order packet and has enough memory
646to queue it.
647
648* TcpExtTCPOFODrop
649The TCP layer receives an out of order packet but doesn't have enough
650memory, so drops it. Such packets won't be counted into
651TcpExtTCPOFOQueue.
652
653* TcpExtTCPOFOMerge
654The received out of order packet has an overlay with the previous
655packet. the overlay part will be dropped. All of TcpExtTCPOFOMerge
656packets will also be counted into TcpExtTCPOFOQueue.
657
658TCP PAWS
659=======
660PAWS (Protection Against Wrapped Sequence numbers) is an algorithm
661which is used to drop old packets. It depends on the TCP
662timestamps. For detail information, please refer the `timestamp wiki`_
663and the `RFC of PAWS`_.
664
665.. _RFC of PAWS: https://tools.ietf.org/html/rfc1323#page-17
666.. _timestamp wiki: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#TCP_timestamps
667
668* TcpExtPAWSActive
669Packets are dropped by PAWS in Syn-Sent status.
670
671* TcpExtPAWSEstab
672Packets are dropped by PAWS in any status other than Syn-Sent.
673
674TCP ACK skip
675===========
676In some scenarios, kernel would avoid sending duplicate ACKs too
677frequently. Please find more details in the tcp_invalid_ratelimit
678section of the `sysctl document`_. When kernel decides to skip an ACK
679due to tcp_invalid_ratelimit, kernel would update one of below
680counters to indicate the ACK is skipped in which scenario. The ACK
681would only be skipped if the received packet is either a SYN packet or
682it has no data.
683
684.. _sysctl document: https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt
685
686* TcpExtTCPACKSkippedSynRecv
687The ACK is skipped in Syn-Recv status. The Syn-Recv status means the
688TCP stack receives a SYN and replies SYN+ACK. Now the TCP stack is
689waiting for an ACK. Generally, the TCP stack doesn't need to send ACK
690in the Syn-Recv status. But in several scenarios, the TCP stack need
691to send an ACK. E.g., the TCP stack receives the same SYN packet
692repeately, the received packet does not pass the PAWS check, or the
693received packet sequence number is out of window. In these scenarios,
694the TCP stack needs to send ACK. If the ACk sending frequency is higher than
695tcp_invalid_ratelimit allows, the TCP stack will skip sending ACK and
696increase TcpExtTCPACKSkippedSynRecv.
697
698
699* TcpExtTCPACKSkippedPAWS
700The ACK is skipped due to PAWS (Protect Against Wrapped Sequence
701numbers) check fails. If the PAWS check fails in Syn-Recv, Fin-Wait-2
702or Time-Wait statuses, the skipped ACK would be counted to
703TcpExtTCPACKSkippedSynRecv, TcpExtTCPACKSkippedFinWait2 or
704TcpExtTCPACKSkippedTimeWait. In all other statuses, the skipped ACK
705would be counted to TcpExtTCPACKSkippedPAWS.
706
707* TcpExtTCPACKSkippedSeq
708The sequence number is out of window and the timestamp passes the PAWS
709check and the TCP status is not Syn-Recv, Fin-Wait-2, and Time-Wait.
710
711* TcpExtTCPACKSkippedFinWait2
712The ACK is skipped in Fin-Wait-2 status, the reason would be either
713PAWS check fails or the received sequence number is out of window.
714
715* TcpExtTCPACKSkippedTimeWait
716Tha ACK is skipped in Time-Wait status, the reason would be either
717PAWS check failed or the received sequence number is out of window.
718
719* TcpExtTCPACKSkippedChallenge
720The ACK is skipped if the ACK is a challenge ACK. The RFC 5961 defines
7213 kind of challenge ACK, please refer `RFC 5961 section 3.2`_,
722`RFC 5961 section 4.2`_ and `RFC 5961 section 5.2`_. Besides these
723three scenarios, In some TCP status, the linux TCP stack would also
724send challenge ACKs if the ACK number is before the first
725unacknowledged number (more strict than `RFC 5961 section 5.2`_).
726
727.. _RFC 5961 section 3.2: https://tools.ietf.org/html/rfc5961#page-7
728.. _RFC 5961 section 4.2: https://tools.ietf.org/html/rfc5961#page-9
729.. _RFC 5961 section 5.2: https://tools.ietf.org/html/rfc5961#page-11
730
731TCP receive window
732=================
733* TcpExtTCPWantZeroWindowAdv
734Depending on current memory usage, the TCP stack tries to set receive
735window to zero. But the receive window might still be a no-zero
736value. For example, if the previous window size is 10, and the TCP
737stack receives 3 bytes, the current window size would be 7 even if the
738window size calculated by the memory usage is zero.
739
740* TcpExtTCPToZeroWindowAdv
741The TCP receive window is set to zero from a no-zero value.
742
743* TcpExtTCPFromZeroWindowAdv
744The TCP receive window is set to no-zero value from zero.
745
746
747Delayed ACK
748==========
749The TCP Delayed ACK is a technique which is used for reducing the
750packet count in the network. For more details, please refer the
751`Delayed ACK wiki`_
752
753.. _Delayed ACK wiki: https://en.wikipedia.org/wiki/TCP_delayed_acknowledgment
754
755* TcpExtDelayedACKs
756A delayed ACK timer expires. The TCP stack will send a pure ACK packet
757and exit the delayed ACK mode.
758
759* TcpExtDelayedACKLocked
760A delayed ACK timer expires, but the TCP stack can't send an ACK
761immediately due to the socket is locked by a userspace program. The
762TCP stack will send a pure ACK later (after the userspace program
763unlock the socket). When the TCP stack sends the pure ACK later, the
764TCP stack will also update TcpExtDelayedACKs and exit the delayed ACK
765mode.
766
767* TcpExtDelayedACKLost
768It will be updated when the TCP stack receives a packet which has been
769ACKed. A Delayed ACK loss might cause this issue, but it would also be
770triggered by other reasons, such as a packet is duplicated in the
771network.
772
773Tail Loss Probe (TLP)
774===================
775TLP is an algorithm which is used to detect TCP packet loss. For more
776details, please refer the `TLP paper`_.
777
778.. _TLP paper: https://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
779
780* TcpExtTCPLossProbes
781A TLP probe packet is sent.
782
783* TcpExtTCPLossProbeRecovery
784A packet loss is detected and recovered by TLP.
785
786examples
787=======
788
789ping test
790--------
791Run the ping command against the public dns server 8.8.8.8::
792
793  nstatuser@nstat-a:~$ ping 8.8.8.8 -c 1
794  PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
795  64 bytes from 8.8.8.8: icmp_seq=1 ttl=119 time=17.8 ms
796
797  --- 8.8.8.8 ping statistics ---
798  1 packets transmitted, 1 received, 0% packet loss, time 0ms
799  rtt min/avg/max/mdev = 17.875/17.875/17.875/0.000 ms
800
801The nstayt result::
802
803  nstatuser@nstat-a:~$ nstat
804  #kernel
805  IpInReceives                    1                  0.0
806  IpInDelivers                    1                  0.0
807  IpOutRequests                   1                  0.0
808  IcmpInMsgs                      1                  0.0
809  IcmpInEchoReps                  1                  0.0
810  IcmpOutMsgs                     1                  0.0
811  IcmpOutEchos                    1                  0.0
812  IcmpMsgInType0                  1                  0.0
813  IcmpMsgOutType8                 1                  0.0
814  IpExtInOctets                   84                 0.0
815  IpExtOutOctets                  84                 0.0
816  IpExtInNoECTPkts                1                  0.0
817
818The Linux server sent an ICMP Echo packet, so IpOutRequests,
819IcmpOutMsgs, IcmpOutEchos and IcmpMsgOutType8 were increased 1. The
820server got ICMP Echo Reply from 8.8.8.8, so IpInReceives, IcmpInMsgs,
821IcmpInEchoReps and IcmpMsgInType0 were increased 1. The ICMP Echo Reply
822was passed to the ICMP layer via IP layer, so IpInDelivers was
823increased 1. The default ping data size is 48, so an ICMP Echo packet
824and its corresponding Echo Reply packet are constructed by:
825
826* 14 bytes MAC header
827* 20 bytes IP header
828* 16 bytes ICMP header
829* 48 bytes data (default value of the ping command)
830
831So the IpExtInOctets and IpExtOutOctets are 20+16+48=84.
832
833tcp 3-way handshake
834------------------
835On server side, we run::
836
837  nstatuser@nstat-b:~$ nc -lknv 0.0.0.0 9000
838  Listening on [0.0.0.0] (family 0, port 9000)
839
840On client side, we run::
841
842  nstatuser@nstat-a:~$ nc -nv 192.168.122.251 9000
843  Connection to 192.168.122.251 9000 port [tcp/*] succeeded!
844
845The server listened on tcp 9000 port, the client connected to it, they
846completed the 3-way handshake.
847
848On server side, we can find below nstat output::
849
850  nstatuser@nstat-b:~$ nstat | grep -i tcp
851  TcpPassiveOpens                 1                  0.0
852  TcpInSegs                       2                  0.0
853  TcpOutSegs                      1                  0.0
854  TcpExtTCPPureAcks               1                  0.0
855
856On client side, we can find below nstat output::
857
858  nstatuser@nstat-a:~$ nstat | grep -i tcp
859  TcpActiveOpens                  1                  0.0
860  TcpInSegs                       1                  0.0
861  TcpOutSegs                      2                  0.0
862
863When the server received the first SYN, it replied a SYN+ACK, and came into
864SYN-RCVD state, so TcpPassiveOpens increased 1. The server received
865SYN, sent SYN+ACK, received ACK, so server sent 1 packet, received 2
866packets, TcpInSegs increased 2, TcpOutSegs increased 1. The last ACK
867of the 3-way handshake is a pure ACK without data, so
868TcpExtTCPPureAcks increased 1.
869
870When the client sent SYN, the client came into the SYN-SENT state, so
871TcpActiveOpens increased 1, the client sent SYN, received SYN+ACK, sent
872ACK, so client sent 2 packets, received 1 packet, TcpInSegs increased
8731, TcpOutSegs increased 2.
874
875TCP normal traffic
876-----------------
877Run nc on server::
878
879  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
880  Listening on [0.0.0.0] (family 0, port 9000)
881
882Run nc on client::
883
884  nstatuser@nstat-a:~$ nc -v nstat-b 9000
885  Connection to nstat-b 9000 port [tcp/*] succeeded!
886
887Input a string in the nc client ('hello' in our example)::
888
889  nstatuser@nstat-a:~$ nc -v nstat-b 9000
890  Connection to nstat-b 9000 port [tcp/*] succeeded!
891  hello
892
893The client side nstat output::
894
895  nstatuser@nstat-a:~$ nstat
896  #kernel
897  IpInReceives                    1                  0.0
898  IpInDelivers                    1                  0.0
899  IpOutRequests                   1                  0.0
900  TcpInSegs                       1                  0.0
901  TcpOutSegs                      1                  0.0
902  TcpExtTCPPureAcks               1                  0.0
903  TcpExtTCPOrigDataSent           1                  0.0
904  IpExtInOctets                   52                 0.0
905  IpExtOutOctets                  58                 0.0
906  IpExtInNoECTPkts                1                  0.0
907
908The server side nstat output::
909
910  nstatuser@nstat-b:~$ nstat
911  #kernel
912  IpInReceives                    1                  0.0
913  IpInDelivers                    1                  0.0
914  IpOutRequests                   1                  0.0
915  TcpInSegs                       1                  0.0
916  TcpOutSegs                      1                  0.0
917  IpExtInOctets                   58                 0.0
918  IpExtOutOctets                  52                 0.0
919  IpExtInNoECTPkts                1                  0.0
920
921Input a string in nc client side again ('world' in our exmaple)::
922
923  nstatuser@nstat-a:~$ nc -v nstat-b 9000
924  Connection to nstat-b 9000 port [tcp/*] succeeded!
925  hello
926  world
927
928Client side nstat output::
929
930  nstatuser@nstat-a:~$ nstat
931  #kernel
932  IpInReceives                    1                  0.0
933  IpInDelivers                    1                  0.0
934  IpOutRequests                   1                  0.0
935  TcpInSegs                       1                  0.0
936  TcpOutSegs                      1                  0.0
937  TcpExtTCPHPAcks                 1                  0.0
938  TcpExtTCPOrigDataSent           1                  0.0
939  IpExtInOctets                   52                 0.0
940  IpExtOutOctets                  58                 0.0
941  IpExtInNoECTPkts                1                  0.0
942
943
944Server side nstat output::
945
946  nstatuser@nstat-b:~$ nstat
947  #kernel
948  IpInReceives                    1                  0.0
949  IpInDelivers                    1                  0.0
950  IpOutRequests                   1                  0.0
951  TcpInSegs                       1                  0.0
952  TcpOutSegs                      1                  0.0
953  TcpExtTCPHPHits                 1                  0.0
954  IpExtInOctets                   58                 0.0
955  IpExtOutOctets                  52                 0.0
956  IpExtInNoECTPkts                1                  0.0
957
958Compare the first client-side nstat and the second client-side nstat,
959we could find one difference: the first one had a 'TcpExtTCPPureAcks',
960but the second one had a 'TcpExtTCPHPAcks'. The first server-side
961nstat and the second server-side nstat had a difference too: the
962second server-side nstat had a TcpExtTCPHPHits, but the first
963server-side nstat didn't have it. The network traffic patterns were
964exactly the same: the client sent a packet to the server, the server
965replied an ACK. But kernel handled them in different ways. When the
966TCP window scale option is not used, kernel will try to enable fast
967path immediately when the connection comes into the established state,
968but if the TCP window scale option is used, kernel will disable the
969fast path at first, and try to enable it after kerenl receives
970packets. We could use the 'ss' command to verify whether the window
971scale option is used. e.g. run below command on either server or
972client::
973
974  nstatuser@nstat-a:~$ ss -o state established -i '( dport = :9000 or sport = :9000 )
975  Netid    Recv-Q     Send-Q            Local Address:Port             Peer Address:Port
976  tcp      0          0               192.168.122.250:40654         192.168.122.251:9000
977             ts sack cubic wscale:7,7 rto:204 rtt:0.98/0.49 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:1 send 118.2Mbps lastsnd:46572 lastrcv:46572 lastack:46572 pacing_rate 236.4Mbps rcv_space:29200 rcv_ssthresh:29200 minrtt:0.98
978
979The 'wscale:7,7' means both server and client set the window scale
980option to 7. Now we could explain the nstat output in our test:
981
982In the first nstat output of client side, the client sent a packet, server
983reply an ACK, when kernel handled this ACK, the fast path was not
984enabled, so the ACK was counted into 'TcpExtTCPPureAcks'.
985
986In the second nstat output of client side, the client sent a packet again,
987and received another ACK from the server, in this time, the fast path is
988enabled, and the ACK was qualified for fast path, so it was handled by
989the fast path, so this ACK was counted into TcpExtTCPHPAcks.
990
991In the first nstat output of server side, fast path was not enabled,
992so there was no 'TcpExtTCPHPHits'.
993
994In the second nstat output of server side, the fast path was enabled,
995and the packet received from client qualified for fast path, so it
996was counted into 'TcpExtTCPHPHits'.
997
998TcpExtTCPAbortOnClose
999--------------------
1000On the server side, we run below python script::
1001
1002  import socket
1003  import time
1004
1005  port = 9000
1006
1007  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1008  s.bind(('0.0.0.0', port))
1009  s.listen(1)
1010  sock, addr = s.accept()
1011  while True:
1012      time.sleep(9999999)
1013
1014This python script listen on 9000 port, but doesn't read anything from
1015the connection.
1016
1017On the client side, we send the string "hello" by nc::
1018
1019  nstatuser@nstat-a:~$ echo "hello" | nc nstat-b 9000
1020
1021Then, we come back to the server side, the server has received the "hello"
1022packet, and the TCP layer has acked this packet, but the application didn't
1023read it yet. We type Ctrl-C to terminate the server script. Then we
1024could find TcpExtTCPAbortOnClose increased 1 on the server side::
1025
1026  nstatuser@nstat-b:~$ nstat | grep -i abort
1027  TcpExtTCPAbortOnClose           1                  0.0
1028
1029If we run tcpdump on the server side, we could find the server sent a
1030RST after we type Ctrl-C.
1031
1032TcpExtTCPAbortOnMemory and TcpExtTCPAbortOnTimeout
1033-----------------------------------------------
1034Below is an example which let the orphan socket count be higher than
1035net.ipv4.tcp_max_orphans.
1036Change tcp_max_orphans to a smaller value on client::
1037
1038  sudo bash -c "echo 10 > /proc/sys/net/ipv4/tcp_max_orphans"
1039
1040Client code (create 64 connection to server)::
1041
1042  nstatuser@nstat-a:~$ cat client_orphan.py
1043  import socket
1044  import time
1045
1046  server = 'nstat-b' # server address
1047  port = 9000
1048
1049  count = 64
1050
1051  connection_list = []
1052
1053  for i in range(64):
1054      s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1055      s.connect((server, port))
1056      connection_list.append(s)
1057      print("connection_count: %d" % len(connection_list))
1058
1059  while True:
1060      time.sleep(99999)
1061
1062Server code (accept 64 connection from client)::
1063
1064  nstatuser@nstat-b:~$ cat server_orphan.py
1065  import socket
1066  import time
1067
1068  port = 9000
1069  count = 64
1070
1071  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1072  s.bind(('0.0.0.0', port))
1073  s.listen(count)
1074  connection_list = []
1075  while True:
1076      sock, addr = s.accept()
1077      connection_list.append((sock, addr))
1078      print("connection_count: %d" % len(connection_list))
1079
1080Run the python scripts on server and client.
1081
1082On server::
1083
1084  python3 server_orphan.py
1085
1086On client::
1087
1088  python3 client_orphan.py
1089
1090Run iptables on server::
1091
1092  sudo iptables -A INPUT -i ens3 -p tcp --destination-port 9000 -j DROP
1093
1094Type Ctrl-C on client, stop client_orphan.py.
1095
1096Check TcpExtTCPAbortOnMemory on client::
1097
1098  nstatuser@nstat-a:~$ nstat | grep -i abort
1099  TcpExtTCPAbortOnMemory          54                 0.0
1100
1101Check orphane socket count on client::
1102
1103  nstatuser@nstat-a:~$ ss -s
1104  Total: 131 (kernel 0)
1105  TCP:   14 (estab 1, closed 0, orphaned 10, synrecv 0, timewait 0/0), ports 0
1106
1107  Transport Total     IP        IPv6
1108  *         0         -         -
1109  RAW       1         0         1
1110  UDP       1         1         0
1111  TCP       14        13        1
1112  INET      16        14        2
1113  FRAG      0         0         0
1114
1115The explanation of the test: after run server_orphan.py and
1116client_orphan.py, we set up 64 connections between server and
1117client. Run the iptables command, the server will drop all packets from
1118the client, type Ctrl-C on client_orphan.py, the system of the client
1119would try to close these connections, and before they are closed
1120gracefully, these connections became orphan sockets. As the iptables
1121of the server blocked packets from the client, the server won't receive fin
1122from the client, so all connection on clients would be stuck on FIN_WAIT_1
1123stage, so they will keep as orphan sockets until timeout. We have echo
112410 to /proc/sys/net/ipv4/tcp_max_orphans, so the client system would
1125only keep 10 orphan sockets, for all other orphan sockets, the client
1126system sent RST for them and delete them. We have 64 connections, so
1127the 'ss -s' command shows the system has 10 orphan sockets, and the
1128value of TcpExtTCPAbortOnMemory was 54.
1129
1130An additional explanation about orphan socket count: You could find the
1131exactly orphan socket count by the 'ss -s' command, but when kernel
1132decide whither increases TcpExtTCPAbortOnMemory and sends RST, kernel
1133doesn't always check the exactly orphan socket count. For increasing
1134performance, kernel checks an approximate count firstly, if the
1135approximate count is more than tcp_max_orphans, kernel checks the
1136exact count again. So if the approximate count is less than
1137tcp_max_orphans, but exactly count is more than tcp_max_orphans, you
1138would find TcpExtTCPAbortOnMemory is not increased at all. If
1139tcp_max_orphans is large enough, it won't occur, but if you decrease
1140tcp_max_orphans to a small value like our test, you might find this
1141issue. So in our test, the client set up 64 connections although the
1142tcp_max_orphans is 10. If the client only set up 11 connections, we
1143can't find the change of TcpExtTCPAbortOnMemory.
1144
1145Continue the previous test, we wait for several minutes. Because of the
1146iptables on the server blocked the traffic, the server wouldn't receive
1147fin, and all the client's orphan sockets would timeout on the
1148FIN_WAIT_1 state finally. So we wait for a few minutes, we could find
114910 timeout on the client::
1150
1151  nstatuser@nstat-a:~$ nstat | grep -i abort
1152  TcpExtTCPAbortOnTimeout         10                 0.0
1153
1154TcpExtTCPAbortOnLinger
1155---------------------
1156The server side code::
1157
1158  nstatuser@nstat-b:~$ cat server_linger.py
1159  import socket
1160  import time
1161
1162  port = 9000
1163
1164  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1165  s.bind(('0.0.0.0', port))
1166  s.listen(1)
1167  sock, addr = s.accept()
1168  while True:
1169      time.sleep(9999999)
1170
1171The client side code::
1172
1173  nstatuser@nstat-a:~$ cat client_linger.py
1174  import socket
1175  import struct
1176
1177  server = 'nstat-b' # server address
1178  port = 9000
1179
1180  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1181  s.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 10))
1182  s.setsockopt(socket.SOL_TCP, socket.TCP_LINGER2, struct.pack('i', -1))
1183  s.connect((server, port))
1184  s.close()
1185
1186Run server_linger.py on server::
1187
1188  nstatuser@nstat-b:~$ python3 server_linger.py
1189
1190Run client_linger.py on client::
1191
1192  nstatuser@nstat-a:~$ python3 client_linger.py
1193
1194After run client_linger.py, check the output of nstat::
1195
1196  nstatuser@nstat-a:~$ nstat | grep -i abort
1197  TcpExtTCPAbortOnLinger          1                  0.0
1198
1199TcpExtTCPRcvCoalesce
1200-------------------
1201On the server, we run a program which listen on TCP port 9000, but
1202doesn't read any data::
1203
1204  import socket
1205  import time
1206  port = 9000
1207  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1208  s.bind(('0.0.0.0', port))
1209  s.listen(1)
1210  sock, addr = s.accept()
1211  while True:
1212      time.sleep(9999999)
1213
1214Save the above code as server_coalesce.py, and run::
1215
1216  python3 server_coalesce.py
1217
1218On the client, save below code as client_coalesce.py::
1219
1220  import socket
1221  server = 'nstat-b'
1222  port = 9000
1223  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
1224  s.connect((server, port))
1225
1226Run::
1227
1228  nstatuser@nstat-a:~$ python3 -i client_coalesce.py
1229
1230We use '-i' to come into the interactive mode, then a packet::
1231
1232  >>> s.send(b'foo')
1233  3
1234
1235Send a packet again::
1236
1237  >>> s.send(b'bar')
1238  3
1239
1240On the server, run nstat::
1241
1242  ubuntu@nstat-b:~$ nstat
1243  #kernel
1244  IpInReceives                    2                  0.0
1245  IpInDelivers                    2                  0.0
1246  IpOutRequests                   2                  0.0
1247  TcpInSegs                       2                  0.0
1248  TcpOutSegs                      2                  0.0
1249  TcpExtTCPRcvCoalesce            1                  0.0
1250  IpExtInOctets                   110                0.0
1251  IpExtOutOctets                  104                0.0
1252  IpExtInNoECTPkts                2                  0.0
1253
1254The client sent two packets, server didn't read any data. When
1255the second packet arrived at server, the first packet was still in
1256the receiving queue. So the TCP layer merged the two packets, and we
1257could find the TcpExtTCPRcvCoalesce increased 1.
1258
1259TcpExtListenOverflows and TcpExtListenDrops
1260----------------------------------------
1261On server, run the nc command, listen on port 9000::
1262
1263  nstatuser@nstat-b:~$ nc -lkv 0.0.0.0 9000
1264  Listening on [0.0.0.0] (family 0, port 9000)
1265
1266On client, run 3 nc commands in different terminals::
1267
1268  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1269  Connection to nstat-b 9000 port [tcp/*] succeeded!
1270
1271The nc command only accepts 1 connection, and the accept queue length
1272is 1. On current linux implementation, set queue length to n means the
1273actual queue length is n+1. Now we create 3 connections, 1 is accepted
1274by nc, 2 in accepted queue, so the accept queue is full.
1275
1276Before running the 4th nc, we clean the nstat history on the server::
1277
1278  nstatuser@nstat-b:~$ nstat -n
1279
1280Run the 4th nc on the client::
1281
1282  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1283
1284If the nc server is running on kernel 4.10 or higher version, you
1285won't see the "Connection to ... succeeded!" string, because kernel
1286will drop the SYN if the accept queue is full. If the nc client is running
1287on an old kernel, you would see that the connection is succeeded,
1288because kernel would complete the 3 way handshake and keep the socket
1289on half open queue. I did the test on kernel 4.15. Below is the nstat
1290on the server::
1291
1292  nstatuser@nstat-b:~$ nstat
1293  #kernel
1294  IpInReceives                    4                  0.0
1295  IpInDelivers                    4                  0.0
1296  TcpInSegs                       4                  0.0
1297  TcpExtListenOverflows           4                  0.0
1298  TcpExtListenDrops               4                  0.0
1299  IpExtInOctets                   240                0.0
1300  IpExtInNoECTPkts                4                  0.0
1301
1302Both TcpExtListenOverflows and TcpExtListenDrops were 4. If the time
1303between the 4th nc and the nstat was longer, the value of
1304TcpExtListenOverflows and TcpExtListenDrops would be larger, because
1305the SYN of the 4th nc was dropped, the client was retrying.
1306
1307IpInAddrErrors, IpExtInNoRoutes and IpOutNoRoutes
1308----------------------------------------------
1309server A IP address: 192.168.122.250
1310server B IP address: 192.168.122.251
1311Prepare on server A, add a route to server B::
1312
1313  $ sudo ip route add 8.8.8.8/32 via 192.168.122.251
1314
1315Prepare on server B, disable send_redirects for all interfaces::
1316
1317  $ sudo sysctl -w net.ipv4.conf.all.send_redirects=0
1318  $ sudo sysctl -w net.ipv4.conf.ens3.send_redirects=0
1319  $ sudo sysctl -w net.ipv4.conf.lo.send_redirects=0
1320  $ sudo sysctl -w net.ipv4.conf.default.send_redirects=0
1321
1322We want to let sever A send a packet to 8.8.8.8, and route the packet
1323to server B. When server B receives such packet, it might send a ICMP
1324Redirect message to server A, set send_redirects to 0 will disable
1325this behavior.
1326
1327First, generate InAddrErrors. On server B, we disable IP forwarding::
1328
1329  $ sudo sysctl -w net.ipv4.conf.all.forwarding=0
1330
1331On server A, we send packets to 8.8.8.8::
1332
1333  $ nc -v 8.8.8.8 53
1334
1335On server B, we check the output of nstat::
1336
1337  $ nstat
1338  #kernel
1339  IpInReceives                    3                  0.0
1340  IpInAddrErrors                  3                  0.0
1341  IpExtInOctets                   180                0.0
1342  IpExtInNoECTPkts                3                  0.0
1343
1344As we have let server A route 8.8.8.8 to server B, and we disabled IP
1345forwarding on server B, Server A sent packets to server B, then server B
1346dropped packets and increased IpInAddrErrors. As the nc command would
1347re-send the SYN packet if it didn't receive a SYN+ACK, we could find
1348multiple IpInAddrErrors.
1349
1350Second, generate IpExtInNoRoutes. On server B, we enable IP
1351forwarding::
1352
1353  $ sudo sysctl -w net.ipv4.conf.all.forwarding=1
1354
1355Check the route table of server B and remove the default route::
1356
1357  $ ip route show
1358  default via 192.168.122.1 dev ens3 proto static
1359  192.168.122.0/24 dev ens3 proto kernel scope link src 192.168.122.251
1360  $ sudo ip route delete default via 192.168.122.1 dev ens3 proto static
1361
1362On server A, we contact 8.8.8.8 again::
1363
1364  $ nc -v 8.8.8.8 53
1365  nc: connect to 8.8.8.8 port 53 (tcp) failed: Network is unreachable
1366
1367On server B, run nstat::
1368
1369  $ nstat
1370  #kernel
1371  IpInReceives                    1                  0.0
1372  IpOutRequests                   1                  0.0
1373  IcmpOutMsgs                     1                  0.0
1374  IcmpOutDestUnreachs             1                  0.0
1375  IcmpMsgOutType3                 1                  0.0
1376  IpExtInNoRoutes                 1                  0.0
1377  IpExtInOctets                   60                 0.0
1378  IpExtOutOctets                  88                 0.0
1379  IpExtInNoECTPkts                1                  0.0
1380
1381We enabled IP forwarding on server B, when server B received a packet
1382which destination IP address is 8.8.8.8, server B will try to forward
1383this packet. We have deleted the default route, there was no route for
13848.8.8.8, so server B increase IpExtInNoRoutes and sent the "ICMP
1385Destination Unreachable" message to server A.
1386
1387Third, generate IpOutNoRoutes. Run ping command on server B::
1388
1389  $ ping -c 1 8.8.8.8
1390  connect: Network is unreachable
1391
1392Run nstat on server B::
1393
1394  $ nstat
1395  #kernel
1396  IpOutNoRoutes                   1                  0.0
1397
1398We have deleted the default route on server B. Server B couldn't find
1399a route for the 8.8.8.8 IP address, so server B increased
1400IpOutNoRoutes.
1401
1402TcpExtTCPACKSkippedSynRecv
1403------------------------
1404In this test, we send 3 same SYN packets from client to server. The
1405first SYN will let server create a socket, set it to Syn-Recv status,
1406and reply a SYN/ACK. The second SYN will let server reply the SYN/ACK
1407again, and record the reply time (the duplicate ACK reply time). The
1408third SYN will let server check the previous duplicate ACK reply time,
1409and decide to skip the duplicate ACK, then increase the
1410TcpExtTCPACKSkippedSynRecv counter.
1411
1412Run tcpdump to capture a SYN packet::
1413
1414  nstatuser@nstat-a:~$ sudo tcpdump -c 1 -w /tmp/syn.pcap port 9000
1415  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1416
1417Open another terminal, run nc command::
1418
1419  nstatuser@nstat-a:~$ nc nstat-b 9000
1420
1421As the nstat-b didn't listen on port 9000, it should reply a RST, and
1422the nc command exited immediately. It was enough for the tcpdump
1423command to capture a SYN packet. A linux server might use hardware
1424offload for the TCP checksum, so the checksum in the /tmp/syn.pcap
1425might be not correct. We call tcprewrite to fix it::
1426
1427  nstatuser@nstat-a:~$ tcprewrite --infile=/tmp/syn.pcap --outfile=/tmp/syn_fixcsum.pcap --fixcsum
1428
1429On nstat-b, we run nc to listen on port 9000::
1430
1431  nstatuser@nstat-b:~$ nc -lkv 9000
1432  Listening on [0.0.0.0] (family 0, port 9000)
1433
1434On nstat-a, we blocked the packet from port 9000, or nstat-a would send
1435RST to nstat-b::
1436
1437  nstatuser@nstat-a:~$ sudo iptables -A INPUT -p tcp --sport 9000 -j DROP
1438
1439Send 3 SYN repeatly to nstat-b::
1440
1441  nstatuser@nstat-a:~$ for i in {1..3}; do sudo tcpreplay -i ens3 /tmp/syn_fixcsum.pcap; done
1442
1443Check snmp cunter on nstat-b::
1444
1445  nstatuser@nstat-b:~$ nstat | grep -i skip
1446  TcpExtTCPACKSkippedSynRecv      1                  0.0
1447
1448As we expected, TcpExtTCPACKSkippedSynRecv is 1.
1449
1450TcpExtTCPACKSkippedPAWS
1451----------------------
1452To trigger PAWS, we could send an old SYN.
1453
1454On nstat-b, let nc listen on port 9000::
1455
1456  nstatuser@nstat-b:~$ nc -lkv 9000
1457  Listening on [0.0.0.0] (family 0, port 9000)
1458
1459On nstat-a, run tcpdump to capture a SYN::
1460
1461  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/paws_pre.pcap -c 1 port 9000
1462  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1463
1464On nstat-a, run nc as a client to connect nstat-b::
1465
1466  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1467  Connection to nstat-b 9000 port [tcp/*] succeeded!
1468
1469Now the tcpdump has captured the SYN and exit. We should fix the
1470checksum::
1471
1472  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/paws_pre.pcap --outfile /tmp/paws.pcap --fixcsum
1473
1474Send the SYN packet twice::
1475
1476  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/paws.pcap; done
1477
1478On nstat-b, check the snmp counter::
1479
1480  nstatuser@nstat-b:~$ nstat | grep -i skip
1481  TcpExtTCPACKSkippedPAWS         1                  0.0
1482
1483We sent two SYN via tcpreplay, both of them would let PAWS check
1484failed, the nstat-b replied an ACK for the first SYN, skipped the ACK
1485for the second SYN, and updated TcpExtTCPACKSkippedPAWS.
1486
1487TcpExtTCPACKSkippedSeq
1488--------------------
1489To trigger TcpExtTCPACKSkippedSeq, we send packets which have valid
1490timestamp (to pass PAWS check) but the sequence number is out of
1491window. The linux TCP stack would avoid to skip if the packet has
1492data, so we need a pure ACK packet. To generate such a packet, we
1493could create two sockets: one on port 9000, another on port 9001. Then
1494we capture an ACK on port 9001, change the source/destination port
1495numbers to match the port 9000 socket. Then we could trigger
1496TcpExtTCPACKSkippedSeq via this packet.
1497
1498On nstat-b, open two terminals, run two nc commands to listen on both
1499port 9000 and port 9001::
1500
1501  nstatuser@nstat-b:~$ nc -lkv 9000
1502  Listening on [0.0.0.0] (family 0, port 9000)
1503
1504  nstatuser@nstat-b:~$ nc -lkv 9001
1505  Listening on [0.0.0.0] (family 0, port 9001)
1506
1507On nstat-a, run two nc clients::
1508
1509  nstatuser@nstat-a:~$ nc -v nstat-b 9000
1510  Connection to nstat-b 9000 port [tcp/*] succeeded!
1511
1512  nstatuser@nstat-a:~$ nc -v nstat-b 9001
1513  Connection to nstat-b 9001 port [tcp/*] succeeded!
1514
1515On nstat-a, run tcpdump to capture an ACK::
1516
1517  nstatuser@nstat-a:~$ sudo tcpdump -w /tmp/seq_pre.pcap -c 1 dst port 9001
1518  tcpdump: listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
1519
1520On nstat-b, send a packet via the port 9001 socket. E.g. we sent a
1521string 'foo' in our example::
1522
1523  nstatuser@nstat-b:~$ nc -lkv 9001
1524  Listening on [0.0.0.0] (family 0, port 9001)
1525  Connection from nstat-a 42132 received!
1526  foo
1527
1528On nstat-a, the tcpdump should have caputred the ACK. We should check
1529the source port numbers of the two nc clients::
1530
1531  nstatuser@nstat-a:~$ ss -ta '( dport = :9000 || dport = :9001 )' | tee
1532  State  Recv-Q   Send-Q         Local Address:Port           Peer Address:Port
1533  ESTAB  0        0            192.168.122.250:50208       192.168.122.251:9000
1534  ESTAB  0        0            192.168.122.250:42132       192.168.122.251:9001
1535
1536Run tcprewrite, change port 9001 to port 9000, chagne port 42132 to
1537port 50208::
1538
1539  nstatuser@nstat-a:~$ tcprewrite --infile /tmp/seq_pre.pcap --outfile /tmp/seq.pcap -r 9001:9000 -r 42132:50208 --fixcsum
1540
1541Now the /tmp/seq.pcap is the packet we need. Send it to nstat-b::
1542
1543  nstatuser@nstat-a:~$ for i in {1..2}; do sudo tcpreplay -i ens3 /tmp/seq.pcap; done
1544
1545Check TcpExtTCPACKSkippedSeq on nstat-b::
1546
1547  nstatuser@nstat-b:~$ nstat | grep -i skip
1548  TcpExtTCPACKSkippedSeq          1                  0.0
1549