Revision tags: v6.6.67, v6.6.66, v6.6.65, v6.6.64, v6.6.63, v6.6.62, v6.6.61, v6.6.60, v6.6.59, v6.6.58 |
|
#
7b7fd0ac |
| 17-Oct-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.57' into for/openbmc/dev-6.6
This is the 6.6.57 stable release
|
Revision tags: v6.6.57, v6.6.56, v6.6.55, v6.6.54, v6.6.53, v6.6.52, v6.6.51, v6.6.50, v6.6.49, v6.6.48, v6.6.47, v6.6.46, v6.6.45, v6.6.44, v6.6.43, v6.6.42, v6.6.41, v6.6.40, v6.6.39, v6.6.38, v6.6.37, v6.6.36, v6.6.35, v6.6.34, v6.6.33, v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4 |
|
#
718c49f8 |
| 14-Sep-2023 |
Aananth V <aananthv@google.com> |
tcp: new TCP_INFO stats for RTO events
[ Upstream commit 3868ab0f192581eff978501a05f3dc2e01541d77 ]
The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Lin
tcp: new TCP_INFO stats for RTO events
[ Upstream commit 3868ab0f192581eff978501a05f3dc2e01541d77 ]
The 2023 SIGCOMM paper "Improving Network Availability with Protective ReRoute" has indicated Linux TCP's RTO-triggered txhash rehashing can effectively reduce application disruption during outages. To better measure the efficacy of this feature, this patch adds three more detailed stats during RTO recovery and exports via TCP_INFO. Applications and monitoring systems can leverage this data to measure the network path diversity and end-to-end repair latency during network outages to improve their network infrastructure.
The following counters are added to tcp_sock in order to track RTO events over the lifetime of a TCP socket.
1. u16 total_rto - Counts the total number of RTO timeouts. 2. u16 total_rto_recoveries - Counts the total number of RTO recoveries. 3. u32 total_rto_time - Counts the total time spent (ms) in RTO recoveries. (time spent in CA_Loss and CA_Recovery states)
To compute total_rto_time, we add a new u32 rto_stamp field to tcp_sock. rto_stamp records the start timestamp (ms) of the last RTO recovery (CA_Loss).
Corresponding fields are also added to the tcp_info struct.
Signed-off-by: Aananth V <aananthv@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Stable-dep-of: 27c80efcc204 ("tcp: fix TFO SYN_RECV to not zero retrans_stamp with retransmits out") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
7e24a55b |
| 04-Aug-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.44' into for/openbmc/dev-6.6
This is the 6.6.44 stable release
|
#
bc4f9c2c |
| 28-May-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: fix race in tcp_write_err()
[ Upstream commit 853c3bd7b7917670224c9fe5245bd045cac411dd ]
I noticed flakes in a packetdrill test, expecting an epoll_wait() to return EPOLLERR | EPOLLHUP on a fa
tcp: fix race in tcp_write_err()
[ Upstream commit 853c3bd7b7917670224c9fe5245bd045cac411dd ]
I noticed flakes in a packetdrill test, expecting an epoll_wait() to return EPOLLERR | EPOLLHUP on a failed connect() attempt, after multiple SYN retransmits. It sometimes return EPOLLERR only.
The issue is that tcp_write_err(): 1) writes an error in sk->sk_err, 2) calls sk_error_report(), 3) then calls tcp_done().
tcp_done() is writing SHUTDOWN_MASK into sk->sk_shutdown, among other things.
Problem is that the awaken user thread (from 2) sk_error_report()) might call tcp_poll() before tcp_done() has written sk->sk_shutdown.
tcp_poll() only sees a non zero sk->sk_err and returns EPOLLERR.
This patch fixes the issue by making sure to call sk_error_report() after tcp_done().
tcp_write_err() also lacks an smp_wmb().
We can reuse tcp_done_with_error() to factor out the details, as Neal suggested.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240528125253.1966136-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
6e4f6b5e |
| 18-Jul-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.41' into dev-6.6
This is the 6.6.41 stable release
|
#
dfcdd7f8 |
| 09-Jul-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: avoid too many retransmit packets
commit 97a9063518f198ec0adb2ecb89789de342bb8283 upstream.
If a TCP socket is using TCP_USER_TIMEOUT, and the other peer retracted its window to zero, tcp_retr
tcp: avoid too many retransmit packets
commit 97a9063518f198ec0adb2ecb89789de342bb8283 upstream.
If a TCP socket is using TCP_USER_TIMEOUT, and the other peer retracted its window to zero, tcp_retransmit_timer() can retransmit a packet every two jiffies (2 ms for HZ=1000), for about 4 minutes after TCP_USER_TIMEOUT has 'expired'.
The fix is to make sure tcp_rtx_probe0_timed_out() takes icsk->icsk_user_timeout into account.
Before blamed commit, the socket would not timeout after icsk->icsk_user_timeout, but would use standard exponential backoff for the retransmits.
Also worth noting that before commit e89688e3e978 ("net: tcp: fix unexcepted socket die when snd_wnd is 0"), the issue would last 2 minutes instead of 4.
Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240710001402.2758273-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
b75f281b |
| 07-Jun-2024 |
Eric Dumazet <edumazet@google.com> |
tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()
commit 36534d3c54537bf098224a32dc31397793d4594d upstream.
Due to timer wheel implementation, a timer will usually fire after its schedule.
tcp: use signed arithmetic in tcp_rtx_probe0_timed_out()
commit 36534d3c54537bf098224a32dc31397793d4594d upstream.
Due to timer wheel implementation, a timer will usually fire after its schedule.
For instance, for HZ=1000, a timeout between 512ms and 4s has a granularity of 64ms. For this range of values, the extra delay could be up to 63ms.
For TCP, this means that tp->rcv_tstamp may be after inet_csk(sk)->icsk_timeout whenever the timer interrupt finally triggers, if one packet came during the extra delay.
We need to make sure tcp_rtx_probe0_timed_out() handles this case.
Fixes: e89688e3e978 ("net: tcp: fix unexcepted socket die when snd_wnd is 0") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Menglong Dong <imagedong@tencent.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://lore.kernel.org/r/20240607125652.1472540-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
124886cf |
| 03-Jul-2024 |
Neal Cardwell <ncardwell@google.com> |
tcp: fix incorrect undo caused by DSACK of TLP retransmit
[ Upstream commit 0ec986ed7bab6801faed1440e8839dcc710331ff ]
Loss recovery undo_retrans bookkeeping had a long-standing bug where a DSACK f
tcp: fix incorrect undo caused by DSACK of TLP retransmit
[ Upstream commit 0ec986ed7bab6801faed1440e8839dcc710331ff ]
Loss recovery undo_retrans bookkeeping had a long-standing bug where a DSACK from a spurious TLP retransmit packet could cause an erroneous undo of a fast recovery or RTO recovery that repaired a single really-lost packet (in a sequence range outside that of the TLP retransmit). Basically, because the loss recovery state machine didn't account for the fact that it sent a TLP retransmit, the DSACK for the TLP retransmit could erroneously be implicitly be interpreted as corresponding to the normal fast recovery or RTO recovery retransmit that plugged a real hole, thus resulting in an improper undo.
For example, consider the following buggy scenario where there is a real packet loss but the congestion control response is improperly undone because of this bug:
+ send packets P1, P2, P3, P4 + P1 is really lost + send TLP retransmit of P4 + receive SACK for original P2, P3, P4 + enter fast recovery, fast-retransmit P1, increment undo_retrans to 1 + receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!) + receive cumulative ACK for P1-P4 (fast retransmit plugged real hole)
The fix: when we initialize undo machinery in tcp_init_undo(), if there is a TLP retransmit in flight, then increment tp->undo_retrans so that we make sure that we receive a DSACK corresponding to the TLP retransmit, as well as DSACKs for all later normal retransmits, before triggering a loss recovery undo. Note that we also have to move the line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO we remember the tp->tlp_high_seq value until tcp_init_undo() and clear it only afterward.
Also note that the bug dates back to the original 2013 TLP implementation, commit 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)").
However, this patch will only compile and work correctly with kernels that have tp->tlp_retrans, which was added only in v5.8 in 2020 in commit 76be93fc0702 ("tcp: allow at most one TLP probe per flight"). So we associate this fix with that later commit.
Fixes: 76be93fc0702 ("tcp: allow at most one TLP probe per flight") Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Kevin Yang <yyd@google.com> Link: https://patch.msgid.link/20240703171246.1739561-1-ncardwell.sw@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.3 |
|
#
c900529f |
| 12-Sep-2023 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-fixes into drm-misc-fixes
Forwarding to v6.6-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
Revision tags: v6.5.2, v6.1.51, v6.5.1 |
|
#
1ac731c5 |
| 30-Aug-2023 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 6.6 merge window.
|
Revision tags: v6.1.50 |
|
#
bd6c11bc |
| 29-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni: "Core:
- Increase size limits for to-be-sent skb frag allocat
Merge tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni: "Core:
- Increase size limits for to-be-sent skb frag allocations. This allows tun, tap devices and packet sockets to better cope with large writes operations
- Store netdevs in an xarray, to simplify iterating over netdevs
- Refactor nexthop selection for multipath routes
- Improve sched class lifetime handling
- Add backup nexthop ID support for bridge
- Implement drop reasons support in openvswitch
- Several data races annotations and fixes
- Constify the sk parameter of routing functions
- Prepend kernel version to netconsole message
Protocols:
- Implement support for TCP probing the peer being under memory pressure
- Remove hard coded limitation on IPv6 specific info placement inside the socket struct
- Get rid of sysctl_tcp_adv_win_scale and use an auto-estimated per socket scaling factor
- Scaling-up the IPv6 expired route GC via a separated list of expiring routes
- In-kernel support for the TLS alert protocol
- Better support for UDP reuseport with connected sockets
- Add NEXT-C-SID support for SRv6 End.X behavior, reducing the SR header size
- Get rid of additional ancillary per MPTCP connection struct socket
- Implement support for BPF-based MPTCP packet schedulers
- Format MPTCP subtests selftests results in TAP
- Several new SMC 2.1 features including unique experimental options, max connections per lgr negotiation, max links per lgr negotiation
BPF:
- Multi-buffer support in AF_XDP
- Add multi uprobe BPF links for attaching multiple uprobes and usdt probes, which is significantly faster and saves extra fds
- Implement an fd-based tc BPF attach API (TCX) and BPF link support on top of it
- Add SO_REUSEPORT support for TC bpf_sk_assign
- Support new instructions from cpu v4 to simplify the generated code and feature completeness, for x86, arm64, riscv64
- Support defragmenting IPv(4|6) packets in BPF
- Teach verifier actual bounds of bpf_get_smp_processor_id() and fix perf+libbpf issue related to custom section handling
- Introduce bpf map element count and enable it for all program types
- Add a BPF hook in sys_socket() to change the protocol ID from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
- Introduce bpf_me_mcache_free_rcu() and fix OOM under stress
- Add uprobe support for the bpf_get_func_ip helper
- Check skb ownership against full socket
- Support for up to 12 arguments in BPF trampoline
- Extend link_info for kprobe_multi and perf_event links
Netfilter:
- Speed-up process exit by aborting ruleset validation if a fatal signal is pending
- Allow NLA_POLICY_MASK to be used with BE16/BE32 types
Driver API:
- Page pool optimizations, to improve data locality and cache usage
- Introduce ndo_hwtstamp_get() and ndo_hwtstamp_set() to avoid the need for raw ioctl() handling in drivers
- Simplify genetlink dump operations (doit/dumpit) providing them the common information already populated in struct genl_info
- Extend and use the yaml devlink specs to [re]generate the split ops
- Introduce devlink selective dumps, to allow SF filtering SF based on handle and other attributes
- Add yaml netlink spec for netlink-raw families, allow route, link and address related queries via the ynl tool
- Remove phylink legacy mode support
- Support offload LED blinking to phy
- Add devlink port function attributes for IPsec
New hardware / drivers:
- Ethernet: - Broadcom ASP 2.0 (72165) ethernet controller - MediaTek MT7988 SoC - Texas Instruments AM654 SoC - Texas Instruments IEP driver - Atheros qca8081 phy - Marvell 88Q2110 phy - NXP TJA1120 phy
- WiFi: - MediaTek mt7981 support
- Can: - Kvaser SmartFusion2 PCI Express devices - Allwinner T113 controllers - Texas Instruments tcan4552/4553 chips
- Bluetooth: - Intel Gale Peak - Qualcomm WCN3988 and WCN7850 - NXP AW693 and IW624 - Mediatek MT2925
Drivers:
- Ethernet NICs: - nVidia/Mellanox: - mlx5: - support UDP encapsulation in packet offload mode - IPsec packet offload support in eswitch mode - improve aRFS observability by adding new set of counters - extends MACsec offload support to cover RoCE traffic - dynamic completion EQs - mlx4: - convert to use auxiliary bus instead of custom interface logic - Intel - ice: - implement switchdev bridge offload, even for LAG interfaces - implement SRIOV support for LAG interfaces - igc: - add support for multiple in-flight TX timestamps - Broadcom: - bnxt: - use the unified RX page pool buffers for XDP and non-XDP - use the NAPI skb allocation cache - OcteonTX2: - support Round Robin scheduling HTB offload - TC flower offload support for SPI field - Freescale: - add XDP_TX feature support - AMD: - ionic: add support for PCI FLR event - sfc: - basic conntrack offload - introduce eth, ipv4 and ipv6 pedit offloads - ST Microelectronics: - stmmac: maximze PTP timestamping resolution
- Virtual NICs: - Microsoft vNIC: - batch ringing RX queue doorbell on receiving packets - add page pool for RX buffers - Virtio vNIC: - add per queue interrupt coalescing support - Google vNIC: - add queue-page-list mode support
- Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add port range matching tc-flower offload - permit enslavement to netdevices with uppers
- Ethernet embedded switches: - Marvell (mv88e6xxx): - convert to phylink_pcs - Renesas: - r8A779fx: add speed change support - rzn1: enables vlan support
- Ethernet PHYs: - convert mv88e6xxx to phylink_pcs
- WiFi: - Qualcomm Wi-Fi 7 (ath12k): - extremely High Throughput (EHT) PHY support - RealTek (rtl8xxxu): - enable AP mode for: RTL8192FU, RTL8710BU (RTL8188GU), RTL8192EU and RTL8723BU - RealTek (rtw89): - Introduce Time Averaged SAR (TAS) support
- Connector: - support for event filtering"
* tag 'net-next-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1806 commits) net: ethernet: mtk_wed: minor change in wed_{tx,rx}info_show net: ethernet: mtk_wed: add some more info in wed_txinfo_show handler net: stmmac: clarify difference between "interface" and "phy_interface" r8152: add vendor/device ID pair for D-Link DUB-E250 devlink: move devlink_notify_register/unregister() to dev.c devlink: move small_ops definition into netlink.c devlink: move tracepoint definitions into core.c devlink: push linecard related code into separate file devlink: push rate related code into separate file devlink: push trap related code into separate file devlink: use tracepoint_enabled() helper devlink: push region related code into separate file devlink: push param related code into separate file devlink: push resource related code into separate file devlink: push dpipe related code into separate file devlink: move and rename devlink_dpipe_send_and_alloc_skb() helper devlink: push shared buffer related code into separate file devlink: push port related code into separate file devlink: push object register/unregister notifications into separate helpers inet: fix IP_TRANSPARENT error handling ...
show more ...
|
Revision tags: v6.5, v6.1.49, v6.1.48 |
|
#
fdebffeb |
| 23-Aug-2023 |
Dave Airlie <airlied@redhat.com> |
BackMerge tag 'v6.5-rc7' into drm-next
Linux 6.5-rc7
This is needed for the CI stuff and the msm pull has fixes in it.
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
#
642073c3 |
| 20-Aug-2023 |
Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
Merge commit b320441c04c9 ("Merge tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next
We need the serial-core fixes in here as well.
Signed-off-by: Greg Kr
Merge commit b320441c04c9 ("Merge tag 'tty-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty") into tty-next
We need the serial-core fixes in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
7ff57803 |
| 18-Aug-2023 |
Jakub Kicinski <kuba@kernel.org> |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flo
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR.
Conflicts:
drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flow_indr if it was never registered") 3bf969e88ada ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/
No adjacent changes.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
0e8860d2 |
| 17-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec and netfilter.
No known outstanding reg
Merge tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec and netfilter.
No known outstanding regressions.
Fixes to fixes:
- virtio-net: set queues after driver_ok, avoid a potential race added by recent fix
- Revert "vlan: Fix VLAN 0 memory leak", it may lead to a warning when VLAN 0 is registered explicitly
- nf_tables: - fix false-positive lockdep splat in recent fixes - don't fail inserts if duplicate has expired (fix test failures) - fix races between garbage collection and netns dismantle
Current release - new code bugs:
- mlx5: Fix mlx5_cmd_update_root_ft() error flow
Previous releases - regressions:
- phy: fix IRQ-based wake-on-lan over hibernate / power off
Previous releases - always broken:
- sock: fix misuse of sk_under_memory_pressure() preventing system from exiting global TCP memory pressure if a single cgroup is under pressure
- fix the RTO timer retransmitting skb every 1ms if linear option is enabled
- af_key: fix sadb_x_filter validation, amment netlink policy
- ipsec: fix slab-use-after-free in decode_session6()
- macb: in ZynqMP resume always configure PS GTR for non-wakeup source
Misc:
- netfilter: set default timeout to 3 secs for sctp shutdown send and recv state (from 300ms), align with protocol timers"
* tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits) ice: Block switchdev mode when ADQ is active and vice versa qede: fix firmware halt over suspend and resume net: do not allow gso_size to be set to GSO_BY_FRAGS sock: Fix misuse of sk_under_memory_pressure() sfc: don't fail probe if MAE/TC setup fails sfc: don't unregister flow_indr if it was never registered net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset net/mlx5: Fix mlx5_cmd_update_root_ft() error flow net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECT i40e: fix misleading debug logs iavf: fix FDIR rule fields masks validation ipv6: fix indentation of a config attribute mailmap: add entries for Simon Horman broadcom: b44: Use b44_writephy() return value net: openvswitch: reject negative ifindex team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves net: phy: broadcom: stub c45 read/write for 54810 netfilter: nft_dynset: disallow object maps netfilter: nf_tables: GC transaction race with netns dismantle netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path ...
show more ...
|
Revision tags: v6.1.46, v6.1.45 |
|
#
e4dd0d3a |
| 10-Aug-2023 |
Jason Xing <kernelxing@tencent.com> |
net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear
net: fix the RTO timer retransmitting skb every 1ms if linear option is enabled
In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly.
The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows:
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0
Therefore, the timer expires so quickly without any doubt.
I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric.
Fixes: 36e31b0af587 ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
86f03776 |
| 13-Aug-2023 |
David S. Miller <davem@davemloft.net> |
Merge branch 'tcp-oom-probe'
Menglong Dong says:
==================== net: tcp: support probing OOM
In this series, we make some small changes to make the tcp retransmission become zero-window pro
Merge branch 'tcp-oom-probe'
Menglong Dong says:
==================== net: tcp: support probing OOM
In this series, we make some small changes to make the tcp retransmission become zero-window probes if the receiver drops the skb because of memory pressure.
In the 1st patch, we reply a zero-window ACK if the skb is dropped because out of memory, instead of dropping the skb silently.
In the 2nd patch, we allow a zero-window ACK to update the window.
In the 3rd patch, fix unexcepted socket die when snd_wnd is 0 in tcp_retransmit_timer().
In the 4th patch, we refactor the debug message in tcp_retransmit_timer() to make it more correct.
After these changes, the tcp can probe the OOM of the receiver forever.
Changes since v3: - make the timeout "2 * TCP_RTO_MAX" in the 3rd patch - tp->retrans_stamp is not based on jiffies and can't be compared with icsk->icsk_timeout in the 3rd patch. Fix it. - introduce the 4th patch
Changes since v2: - refactor the code to avoid code duplication in the 1st patch - use after() instead of max() in tcp_rtx_probe0_timed_out()
Changes since v1: - send 0 rwin ACK for the receive queue empty case when necessary in the 1st patch - send the ACK immediately by using the ICSK_ACK_NOW flag in the 1st patch - consider the case of the connection restart from idle, as Neal comment, in the 3rd patch ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
031c44b7 |
| 10-Aug-2023 |
Menglong Dong <imagedong@tencent.com> |
net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because they could be printed even if we did not receive a new ACK packet
net: tcp: refactor the dbg message in tcp_retransmit_timer()
The debug message in tcp_retransmit_timer() is slightly wrong, because they could be printed even if we did not receive a new ACK packet from the remote peer.
Change it to probing zero-window, as it is a expected case now. The description may be not correct.
Adding the duration since the last ACK we received, and the duration of the retransmission, which are useful for debugging.
And the message now like this:
Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 209ms ago, lasting 209ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 404ms ago, lasting 408ms Probing zero-window on 127.0.0.1:9999/46946, seq=3737778959:3737791503, recv 812ms ago, lasting 1224ms
Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
e89688e3 |
| 10-Aug-2023 |
Menglong Dong <imagedong@tencent.com> |
net: tcp: fix unexcepted socket die when snd_wnd is 0
In tcp_retransmit_timer(), a window shrunk connection will be regarded as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
net: tcp: fix unexcepted socket die when snd_wnd is 0
In tcp_retransmit_timer(), a window shrunk connection will be regarded as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not right all the time.
The retransmits will become zero-window probes in tcp_retransmit_timer() if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to TCP_RTO_MAX sooner or later.
However, the timer can be delayed and be triggered after 122877ms, not TCP_RTO_MAX, as I tested.
Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true once the RTO come up to TCP_RTO_MAX, and the socket will die.
Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout', which is exact the timestamp of the timeout.
However, "tp->rcv_tstamp" can restart from idle, then tp->rcv_tstamp could already be a long time (minutes or hours) in the past even on the first RTO. So we double check the timeout with the duration of the retransmission.
Meanwhile, making "2 * TCP_RTO_MAX" as the timeout to avoid the socket dying too soon.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/ Signed-off-by: Menglong Dong <imagedong@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.44 |
|
#
2612e3bb |
| 07-Aug-2023 |
Rodrigo Vivi <rodrigo.vivi@intel.com> |
Merge drm/drm-next into drm-intel-next
Catching-up with drm-next and drm-intel-gt-next. It will unblock a code refactor around the platform definitions (names vs acronyms).
Signed-off-by: Rodrigo V
Merge drm/drm-next into drm-intel-next
Catching-up with drm-next and drm-intel-gt-next. It will unblock a code refactor around the platform definitions (names vs acronyms).
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
show more ...
|
#
9f771739 |
| 07-Aug-2023 |
Joonas Lahtinen <joonas.lahtinen@linux.intel.com> |
Merge drm/drm-next into drm-intel-gt-next
Need to pull in b3e4aae612ec ("drm/i915/hdcp: Modify hdcp_gsc_message msg sending mechanism") as a dependency for https://patchwork.freedesktop.org/series/1
Merge drm/drm-next into drm-intel-gt-next
Need to pull in b3e4aae612ec ("drm/i915/hdcp: Modify hdcp_gsc_message msg sending mechanism") as a dependency for https://patchwork.freedesktop.org/series/121735/
Signed-off-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
show more ...
|
#
16fd7539 |
| 06-Aug-2023 |
David S. Miller <davem@davemloft.net> |
Merge branch 'tcp-options-lockless'
Eric Dumazet says:
==================== tcp: set few options locklessly
This series is avoiding the socket lock for six TCP options.
They are not heavily used,
Merge branch 'tcp-options-lockless'
Eric Dumazet says:
==================== tcp: set few options locklessly
This series is avoiding the socket lock for six TCP options.
They are not heavily used, but this exercise can give ideas for other parts of TCP/IP stack :) ====================
Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a81722dd |
| 04-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: set TCP_LINGER2 locklessly
tp->linger2 can be set locklessly as long as readers use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@googl
tcp: set TCP_LINGER2 locklessly
tp->linger2 can be set locklessly as long as readers use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
d58f2e15 |
| 04-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: set TCP_USER_TIMEOUT locklessly
icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yega
tcp: set TCP_USER_TIMEOUT locklessly
icsk->icsk_user_timeout can be set locklessly, if all read sides use READ_ONCE().
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
d44fd4a7 |
| 04-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
tcp: set TCP_SYNCNT locklessly
icsk->icsk_syn_retries can safely be set without locking the socket.
We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer() and tcp_write_timeout().
S
tcp: set TCP_SYNCNT locklessly
icsk->icsk_syn_retries can safely be set without locking the socket.
We have to add READ_ONCE() annotations in tcp_fastopen_synack_timer() and tcp_write_timeout().
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|