Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4 |
|
#
b261eda8 |
| 21-Oct-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work with setsockopt(SO_REUSEPORT) since v4.6.
With the combination of SO_REUSEP
soreuseport: Fix socket selection for SO_INCOMING_CPU.
Kazuho Oku reported that setsockopt(SO_INCOMING_CPU) does not work with setsockopt(SO_REUSEPORT) since v4.6.
With the combination of SO_REUSEPORT and SO_INCOMING_CPU, we could build a highly efficient server application.
setsockopt(SO_INCOMING_CPU) associates a CPU with a TCP listener or UDP socket, and then incoming packets processed on the CPU will likely be distributed to the socket. Technically, a socket could even receive packets handled on another CPU if no sockets in the reuseport group have the same CPU receiving the flow.
The logic exists in compute_score() so that a socket will get a higher score if it has the same CPU with the flow. However, the score gets ignored after the blamed two commits, which introduced a faster socket selection algorithm for SO_REUSEPORT.
This patch introduces a counter of sockets with SO_INCOMING_CPU in a reuseport group to check if we should iterate all sockets to find a proper one. We increment the counter when
* calling listen() if the socket has SO_INCOMING_CPU and SO_REUSEPORT
* enabling SO_INCOMING_CPU if the socket is in a reuseport group
Also, we decrement it when
* detaching a socket out of the group to apply SO_INCOMING_CPU to migrated TCP requests
* disabling SO_INCOMING_CPU if the socket is in a reuseport group
When the counter reaches 0, we can get back to the O(1) selection algorithm.
The overall changes are negligible for the non-SO_INCOMING_CPU case, and the only notable thing is that we have to update sk_incomnig_cpu under reuseport_lock. Otherwise, the race prevents transitioning to the O(n) algorithm and results in the wrong socket selection.
cpu1 (setsockopt) cpu2 (listen) +-----------------+ +-------------+
lock_sock(sk1) lock_sock(sk2)
reuseport_update_incoming_cpu(sk1, val) . | /* set CPU as 0 */ |- WRITE_ONCE(sk1->incoming_cpu, val) | | spin_lock_bh(&reuseport_lock) | reuseport_grow(sk2, reuse) | . | |- more_socks_size = reuse->max_socks * 2U; | |- if (more_socks_size > U16_MAX && | | reuse->num_closed_socks) | | . | | |- RCU_INIT_POINTER(sk1->sk_reuseport_cb, NULL); | | `- __reuseport_detach_closed_sock(sk1, reuse) | | . | | `- reuseport_put_incoming_cpu(sk1, reuse) | | . | | | /* Read shutdown()ed sk1's sk_incoming_cpu | | | * without lock_sock(). | | | */ | | `- if (sk1->sk_incoming_cpu >= 0) | | . | | | /* decrement not-yet-incremented | | | * count, which is never incremented. | | | */ | | `- __reuseport_put_incoming_cpu(reuse); | | | `- spin_lock_bh(&reuseport_lock) | |- spin_lock_bh(&reuseport_lock) | |- reuse = rcu_dereference_protected(sk1->sk_reuseport_cb, ...) |- if (!reuse) | . | | /* Cannot increment reuse->incoming_cpu. */ | `- goto out; | `- spin_unlock_bh(&reuseport_lock)
Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Kazuho Oku <kazuhooku@gmail.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
Revision tags: v6.0.3, v6.0.2, v5.15.74 |
|
#
69421bf9 |
| 14-Oct-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
udp: Update reuse->has_conns under reuseport_lock.
When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could s
udp: Update reuse->has_conns under reuseport_lock.
When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket.
However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now.
For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking.
cpu1 cpu2 +----+ +----+
__ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
Revision tags: v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56 |
|
#
4177f545 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net:
tcp: Fix data-races around sysctl_tcp_migrate_req.
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a8df9d04 |
| 14-Oct-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
udp: Update reuse->has_conns under reuseport_lock.
[ Upstream commit 69421bf98482d089e50799f45e48b25ce4a8d154 ]
When we call connect() for a UDP socket in a reuseport group, we have to update sk->s
udp: Update reuse->has_conns under reuseport_lock.
[ Upstream commit 69421bf98482d089e50799f45e48b25ce4a8d154 ]
When we call connect() for a UDP socket in a reuseport group, we have to update sk->sk_reuseport_cb->has_conns to 1. Otherwise, the kernel could select a unconnected socket wrongly for packets sent to the connected socket.
However, the current way to set has_conns is illegal and possible to trigger that problem. reuseport_has_conns() changes has_conns under rcu_read_lock(), which upgrades the RCU reader to the updater. Then, it must do the update under the updater's lock, reuseport_lock, but it doesn't for now.
For this reason, there is a race below where we fail to set has_conns resulting in the wrong socket selection. To avoid the race, let's split the reader and updater with proper locking.
cpu1 cpu2 +----+ +----+
__ip[46]_datagram_connect() reuseport_grow() . . |- reuseport_has_conns(sk, true) |- more_reuse = __reuseport_alloc(more_socks_size) | . | | |- rcu_read_lock() | |- reuse = rcu_dereference(sk->sk_reuseport_cb) | | | | | /* reuse->has_conns == 0 here */ | | |- more_reuse->has_conns = reuse->has_conns | |- reuse->has_conns = 1 | /* more_reuse->has_conns SHOULD BE 1 HERE */ | | | | | |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, | | | more_reuse) | `- rcu_read_unlock() `- kfree_rcu(reuse, rcu) | |- sk->sk_state = TCP_ESTABLISHED
Note the likely(reuse) in reuseport_has_conns_set() is always true, but we put the test there for ease of review. [0]
For the record, usually, sk_reuseport_cb is changed under lock_sock(). The only exception is reuseport_grow() & TCP reqsk migration case.
1) shutdown() TCP listener, which is moved into the latter part of reuse->socks[] to migrate reqsk.
2) New listen() overflows reuse->socks[] and call reuseport_grow().
3) reuse->max_socks overflows u16 with the new listener.
4) reuseport_grow() pops the old shutdown()ed listener from the array and update its sk->sk_reuseport_cb as NULL without lock_sock().
shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(), but, reuseport_has_conns_set() is called only for UDP under lock_sock(), so likely(reuse) never be false in reuseport_has_conns_set().
[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/
Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
fcf6c6d8 |
| 15-Jul-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need t
tcp: Fix data-races around sysctl_tcp_migrate_req.
[ Upstream commit 4177f545895b1da08447a80692f30617154efa6e ]
While reading sysctl_tcp_migrate_req, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: f9ac779f881c ("net: Introduce net.ipv4.tcp_migrate_req.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46 |
|
#
55d444b3 |
| 22-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
tcp: Add stats for socket migration.
This commit adds two stats for the socket migration feature to evaluate the effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
If the migration fails beca
tcp: Add stats for socket migration.
This commit adds two stats for the socket migration feature to evaluate the effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).
If the migration fails because of the own_req race in receiving ACK and sending SYN+ACK paths, we do not increment the failure stat. Then another CPU is responsible for the req.
Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/ Suggested-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
d5e4ddae |
| 12-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
bpf: Support socket migration by eBPF.
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF
bpf: Support socket migration by eBPF.
This patch introduces a new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, we run it for socket migration if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE or net.ipv4.tcp_migrate_req is enabled.
Currently, the expected_attach_type is not enforced for the BPF_PROG_TYPE_SK_REUSEPORT type of program. Thus, this commit follows the earlier idea in the commit aac3fc320d94 ("bpf: Post-hooks for sys_bind") to fix up the zero expected_attach_type in bpf_prog_load_fixup_attach_type().
Moreover, this patch adds a new field (migrating_sk) to sk_reuseport_md to select a new listener based on the child socket. migrating_sk varies depending on if it is migrating a request in the accept queue or during 3WHS.
- accept_queue : sock (ESTABLISHED/SYN_RECV) - 3WHS : request_sock (NEW_SYN_RECV)
In the eBPF program, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible.
- SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fallbacks to the random selection - SK_DROP, cancel the migration.
There is a noteworthy point. We select a listening socket in three places, but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. On the other hand, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program.
Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201203042402.6cskdlit5f3mw4ru@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-10-kuniyu@amazon.co.jp
show more ...
|
#
1cd62c21 |
| 12-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
tcp: Add reuseport_migrate_sock() to select a new listener.
reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reusepo
tcp: Add reuseport_migrate_sock() to select a new listener.
reuseport_migrate_sock() does the same check done in reuseport_listen_stop_sock(). If the reuseport group is capable of migration, reuseport_migrate_sock() selects a new listener by the child socket hash and increments the listener's sk_refcnt beforehand. Thus, if we fail in the migration, we have to decrement it later.
We will support migration by eBPF in the later commits.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-5-kuniyu@amazon.co.jp
show more ...
|
#
333bb73f |
| 12-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child
tcp: Keep TCP_CLOSE sockets in the reuseport group.
When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not.
The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed.
Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets.
This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb.
By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0.
When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour:
- we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets
Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
show more ...
|
#
5c040eaf |
| 12-Jun-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
tcp: Add num_closed_socks to struct sock_reuseport.
As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field
tcp: Add num_closed_socks to struct sock_reuseport.
As noted in the following commit, a closed listener has to hold the reference to the reuseport group for socket migration. This patch adds a field (num_closed_socks) to struct sock_reuseport to manage closed sockets within the same reuseport group. Moreover, this and the following commits introduce some helper functions to split socks[] into two sections and keep TCP_LISTEN and TCP_CLOSE sockets in each section. Like a double-ended queue, we will place TCP_LISTEN sockets from the front and TCP_CLOSE sockets from the end.
TCP_LISTEN----------> <-------TCP_CLOSE +---+---+ --- +---+ --- +---+ --- +---+ | 0 | 1 | ... | i | ... | j | ... | k | +---+---+ --- +---+ --- +---+ --- +---+
i = num_socks - 1 j = max_socks - num_closed_socks k = max_socks - 1
This patch also extends reuseport_add_sock() and reuseport_grow() to support num_closed_socks.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-3-kuniyu@amazon.co.jp
show more ...
|
Revision tags: v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14 |
|
#
fd2ddef0 |
| 06-Jan-2021 |
Baptiste Lepers <baptiste.lepers@gmail.com> |
udp: Prevent reuseport_select_sock from reading uninitialized socks
reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, onl
udp: Prevent reuseport_select_sock from reading uninitialized socks
reuse->socks[] is modified concurrently by reuseport_add_sock. To prevent reading values that have not been fully initialized, only read the array up until the last known safe index instead of incorrectly re-reading the last index of the array.
Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets") Signed-off-by: Baptiste Lepers <baptiste.lepers@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|