Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48 |
|
#
cba3f178 |
| 17-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
dccp: annotate data-races in dccp_poll()
We changed tcp_poll() over time, bug never updated dccp.
Note that we also could remove dccp instead of maintaining it.
Fixes: 7c657876b63c ("[DCCP]: Initi
dccp: annotate data-races in dccp_poll()
We changed tcp_poll() over time, bug never updated dccp.
Note that we also could remove dccp instead of maintaining it.
Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230818015820.2701595-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.46, v6.1.45, v6.1.44 |
|
#
a47e598f |
| 03-Aug-2023 |
Eric Dumazet <edumazet@google.com> |
dccp: fix data-race around dp->dccps_mss_cache
dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket. Same thing in do_dccp_getsockopt().
Add READ_ONCE()/WRITE_ONCE() annotations, and
dccp: fix data-race around dp->dccps_mss_cache
dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket. Same thing in do_dccp_getsockopt().
Add READ_ONCE()/WRITE_ONCE() annotations, and change dccp_sendmsg() to check again dccps_mss_cache after socket is locked.
Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34 |
|
#
e1d001fa |
| 09-Jun-2023 |
Breno Leitao <leitao@debian.org> |
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl
net: ioctl: Use kernel memory on protocol ioctl callbacks
Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers.
Change the "struct proto" ioctls to avoid touching userspace memory and operate on kernel buffers, i.e., all protocol's ioctl callbacks is adapted to operate on a kernel memory other than on userspace (so, no more {put,get}_user() and friends being called in the ioctl callback).
This changes the "struct proto" ioctl format in the following way:
int (*ioctl)(struct sock *sk, int cmd, - unsigned long arg); + int *karg);
(Important to say that this patch does not touch the "struct proto_ops" protocols)
So, the "karg" argument, which is passed to the ioctl callback, is a pointer allocated to kernel space memory (inside a function wrapper). This buffer (karg) may contain input argument (copied from userspace in a prep function) and it might return a value/buffer, which is copied back to userspace if necessary. There is not one-size-fits-all format (that is I am using 'may' above), but basically, there are three type of ioctls:
1) Do not read from userspace, returns a result to userspace 2) Read an input parameter from userspace, and does not return anything to userspace 3) Read an input from userspace, and return a buffer to userspace.
The default case (1) (where no input parameter is given, and an "int" is returned to userspace) encompasses more than 90% of the cases, but there are two other exceptions. Here is a list of exceptions:
* Protocol RAW: * cmd = SIOCGETVIFCNT: * input and output = struct sioc_vif_req * cmd = SIOCGETSGCNT * input and output = struct sioc_sg_req * Explanation: for the SIOCGETVIFCNT case, userspace passes the input argument, which is struct sioc_vif_req. Then the callback populates the struct, which is copied back to userspace.
* Protocol RAW6: * cmd = SIOCGETMIFCNT_IN6 * input and output = struct sioc_mif_req6 * cmd = SIOCGETSGCNT_IN6 * input and output = struct sioc_sg_req6
* Protocol PHONET: * cmd == SIOCPNADDRESOURCE | SIOCPNDELRESOURCE * input int (4 bytes) * Nothing is copied back to userspace.
For the exception cases, functions sock_sk_ioctl_inout() will copy the userspace input, and copy it back to kernel space.
The wrapper that prepare the buffer and put the buffer back to user is sk_ioctl(), so, instead of calling sk->sk_prot->ioctl(), the callee now calls sk_ioctl(), which will handle all cases.
Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20230609152800.830401-1-leitao@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
b144fcaf |
| 14-Jun-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp: Print deprecation notice.
DCCP was marked as Orphan in the MAINTAINERS entry 2 years ago in commit 054c4610bd05 ("MAINTAINERS: dccp: move Gerrit Renker to CREDITS"). It says we haven't heard
dccp: Print deprecation notice.
DCCP was marked as Orphan in the MAINTAINERS entry 2 years ago in commit 054c4610bd05 ("MAINTAINERS: dccp: move Gerrit Renker to CREDITS"). It says we haven't heard from the maintainer for five years, so DCCP is not well maintained for 7 years now.
Recently DCCP only receives updates for bugs, and major distros disable it by default.
Removing DCCP would allow for better organisation of TCP fields to reduce the number of cache lines hit in the fast path.
Let's add a deprecation notice when DCCP socket is created and schedule its removal to 2025.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80 |
|
#
e0833d1f |
| 18-Nov-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp/tcp: Fixup bhash2 bucket when connect() fails.
If a socket bound to a wildcard address fails to connect(), we only reset saddr and keep the port. Then, we have to fix up the bhash2 bucket; oth
dccp/tcp: Fixup bhash2 bucket when connect() fails.
If a socket bound to a wildcard address fails to connect(), we only reset saddr and keep the port. Then, we have to fix up the bhash2 bucket; otherwise, the bucket has an inconsistent address in the list.
Also, listen() for such a socket will fire the WARN_ON() in inet_csk_get_port(). [0]
Note that when a system runs out of memory, we give up fixing the bucket and unlink sk from bhash and bhash2 by inet_put_port().
[0]: WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Modules linked in: CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014 RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1)) Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202 RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000 RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280 FS: 00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205) inet_listen (net/ipv4/af_inet.c:228) __sys_listen (net/socket.c:1810) __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) RIP: 0033:0x7f8ac051de5d Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032 RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007 R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388 R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000 </TASK>
Fixes: 28044fc1d495 ("net: Add a bhash2 table hashed by port and address") Reported-by: syzbot <syzkaller@googlegroups.com> Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3 |
|
#
1651951e |
| 19-Oct-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
dccp: Call inet6_destroy_sock() via sk->sk_destruct().
After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct()
dccp: Call inet6_destroy_sock() via sk->sk_destruct().
After commit d38afeec26ed ("tcp/udp: Call inet6_destroy_sock() in IPv6 sk->sk_destruct()."), we call inet6_destroy_sock() in sk->sk_destruct() by setting inet6_sock_destruct() to it to make sure we do not leak inet6-specific resources.
DCCP sets its own sk->sk_destruct() in the dccp_init_sock(), and DCCPv6 socket shares it by calling the same init function via dccp_v6_init_sock().
To call inet6_sock_destruct() from DCCPv6 sk->sk_destruct(), we export it and set dccp_v6_sk_destruct() in the init function.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66 |
|
#
d1e5e640 |
| 07-Sep-2022 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they pena
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation.
The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource.
On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection.
thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1
It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result.
thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0
To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2.
With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention.
We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%.
We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory).
# dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
# sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
# sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
# ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
# sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100
# ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets
When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways:
1) Share the global ehash and create per-netns ehash
First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash.
2) Control write on sysctl by BPF
We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs.
Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications.
Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully:
tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns.
With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent.
In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.65, v5.15.64, v5.15.63 |
|
#
28044fc1 |
| 22-Aug-2022 |
Joanne Koong <joannelkoong@gmail.com> |
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only. In the socket bind path, we have to check for bind conflicts by traversing the specified
net: Add a bhash2 table hashed by port and address
The current bind hashtable (bhash) is hashed by port only. In the socket bind path, we have to check for bind conflicts by traversing the specified port's inet_bind_bucket while holding the hashbucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, the bind conflict check is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock.
This patch adds a second bind table, bhash2, that hashes by port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6). Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the hashbucket spinlock.
Please note a few things: * There can be the case where the a socket's address changes after it has been bound. There are two cases where this happens:
1) The case where there is a bind() call on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will assign the socket an address when it handles the connect()
2) In inet_sk_reselect_saddr(), which is called when rebuilding the sk header and a few pre-conditions are met (eg rerouting fails).
In these two cases, we need to update the bhash2 table by removing the entry for the old address, and add a new entry reflecting the updated address.
* The bhash2 table must have its own lock, even though concurrent accesses on the same port are protected by the bhash lock. Bhash2 must have its own lock to protect against cases where sockets on different ports hash to different bhash hashbuckets but to the same bhash2 hashbucket.
This brings up a few stipulations: 1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock will always be acquired after the bhash lock and released before the bhash lock is released.
2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always acquired+released before another bhash2 lock is acquired+released.
* The bhash table cannot be superseded by the bhash2 table because for bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket bound to that port must be checked for a potential conflict. The bhash table is the only source of port->socket associations.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58 |
|
#
a41b17ff |
| 29-Jul-2022 |
Hangyu Hua <hbh25y@gmail.com> |
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
In the case of sk->dccps_qpolicy == DCCPQ_POLICY_PRIO, dccp_qpolicy_full will drop a skb when qpolicy is full. And the lock in
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
In the case of sk->dccps_qpolicy == DCCPQ_POLICY_PRIO, dccp_qpolicy_full will drop a skb when qpolicy is full. And the lock in dccp_sendmsg is released before sock_alloc_send_skb and then relocked after sock_alloc_send_skb. The following conditions may lead dccp_qpolicy_push to add skb to an already full sk_write_queue:
thread1--->lock thread1--->dccp_qpolicy_full: queue is full. drop a skb thread1--->unlock thread2--->lock thread2--->dccp_qpolicy_full: queue is not full. no need to drop. thread2--->unlock thread1--->lock thread1--->dccp_qpolicy_push: add a skb. queue is full. thread1--->unlock thread2--->lock thread2--->dccp_qpolicy_push: add a skb! thread2--->unlock
Fix this by moving dccp_qpolicy_full.
Fixes: b1308dc015eb ("[DCCP]: Set TX Queue Length Bounds via Sysctl") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lore.kernel.org/r/20220729110027.40569-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48 |
|
#
593d1ebe |
| 15-Jun-2022 |
Joanne Koong <joannelkoong@gmail.com> |
Revert "net: Add a second bind table hashed by port and address"
This reverts:
commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add t
Revert "net: Add a second bind table hashed by port and address"
This reverts:
commit d5a42de8bdbe ("net: Add a second bind table hashed by port and address") commit 538aaf9b2383 ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/
There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks
Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/
Fixes: d5a42de8bdbe ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18 |
|
#
d5a42de8 |
| 19-May-2022 |
Joanne Koong <joannelkoong@gmail.com> |
net: Add a second bind table hashed by port and address
We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by travers
net: Add a second bind table hashed by port and address
We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock.
This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock.
Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.41, v5.15.40, v5.15.39 |
|
#
cae3873c |
| 11-May-2022 |
Martin KaFai Lau <kafai@fb.com> |
net: inet: Retire port only listening_hash
The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address).
After commit 0ee58dad5b06 (
net: inet: Retire port only listening_hash
The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address).
After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash.
The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2.
There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case.
This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash.
Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose.
1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash.
The union is removed from inet_listen_hashbucket because only nulls_head is needed.
2. icsk->icsk_listen_portaddr_node and its helpers are removed.
3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node.
One case is in the inet[6]_lhash2_lookup().
Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating.
4. Move the remaining listening_hash usage to lhash2
inet_reuseport_add_sock() which this series is trying to improve.
inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34 |
|
#
ec095263 |
| 11-Apr-2022 |
Oliver Hartkopp <socketcan@hartkopp.net> |
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to
net: remove noblock parameter from recvmsg() entities
The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg().
Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g.
err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len);
or in
err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len);
instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition).
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
Revision tags: v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5 |
|
#
b4a8e749 |
| 22-Nov-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
dccp: Inline dccp_listen_start().
This patch inlines dccp_listen_start() and removes a stale comment in inet_dccp_listen() so that it looks like inet_listen().
Signed-off-by: Kuniyuki Iwashima <kun
dccp: Inline dccp_listen_start().
This patch inlines dccp_listen_start() and removes a stale comment in inet_dccp_listen() so that it looks like inet_listen().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
e7049395 |
| 22-Nov-2021 |
Kuniyuki Iwashima <kuniyu@amazon.co.jp> |
dccp/tcp: Remove an unused argument in inet_csk_listen_start().
The commit 1295e2cf3065 ("inet: minor optimization for backlog setting in listen(2)") added change so that sk_max_ack_backlog is initi
dccp/tcp: Remove an unused argument in inet_csk_listen_start().
The commit 1295e2cf3065 ("inet: minor optimization for backlog setting in listen(2)") added change so that sk_max_ack_backlog is initialised earlier in inet_dccp_listen() and inet_listen(). Since then, we no longer use backlog in inet_csk_listen_start(), so let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13 |
|
#
19757ceb |
| 14-Oct-2021 |
Eric Dumazet <edumazet@google.com> |
tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more.
Stefan Bach reported
tcp: switch orphan_count to bare per-cpu counters
Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more.
Stefan Bach reported a serious spinlock contention in real workloads, that I was able to reproduce with a netfilter rule dropping incoming FIN packets.
53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath | ---queued_spin_lock_slowpath | --53.51%--_raw_spin_lock_irqsave | --53.51%--__percpu_counter_sum tcp_check_oom | |--39.03%--__tcp_close | tcp_close | inet_release | inet6_release | sock_close | __fput | ____fput | task_work_run | exit_to_usermode_loop | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __GI___libc_close | --14.48%--tcp_out_of_resources tcp_write_timeout tcp_retransmit_timer tcp_write_timer_handler tcp_write_timer call_timer_fn expire_timers __run_timers run_timer_softirq __softirqentry_text_start
As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter batch for dst entries accounting"), default batch size is too big for the default value of tcp_max_orphans (262144).
But even if we reduce batch sizes, there would still be cases where the estimated count of orphans is beyond the limit, and where tcp_too_many_orphans() has to call the expensive percpu_counter_sum_positive().
One solution is to use plain per-cpu counters, and have a timer to periodically refresh this cache.
Updating this cache every 100ms seems about right, tcp pressure state is not radically changing over shorter periods.
percpu_counter was nice 15 years ago while hosts had less than 16 cpus, not anymore by current standards.
v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m, reported by kernel test robot <lkp@intel.com> Remove unused socket argument from tcp_too_many_orphans()
Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bach <sfb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
8890bdb9 |
| 29-Jul-2022 |
Hangyu Hua <hbh25y@gmail.com> |
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
[ Upstream commit a41b17ff9dacd22f5f118ee53d82da0f3e52d5e3 ]
In the case of sk->dccps_qpolicy == DCCPQ_POLICY_PRIO, dccp_qpoli
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
[ Upstream commit a41b17ff9dacd22f5f118ee53d82da0f3e52d5e3 ]
In the case of sk->dccps_qpolicy == DCCPQ_POLICY_PRIO, dccp_qpolicy_full will drop a skb when qpolicy is full. And the lock in dccp_sendmsg is released before sock_alloc_send_skb and then relocked after sock_alloc_send_skb. The following conditions may lead dccp_qpolicy_push to add skb to an already full sk_write_queue:
thread1--->lock thread1--->dccp_qpolicy_full: queue is full. drop a skb thread1--->unlock thread2--->lock thread2--->dccp_qpolicy_full: queue is not full. no need to drop. thread2--->unlock thread1--->lock thread1--->dccp_qpolicy_push: add a skb. queue is full. thread1--->unlock thread2--->lock thread2--->dccp_qpolicy_push: add a skb! thread2--->unlock
Fix this by moving dccp_qpolicy_full.
Fixes: b1308dc015eb ("[DCCP]: Set TX Queue Length Bounds via Sysctl") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Link: https://lore.kernel.org/r/20220729110027.40569-1-hbh25y@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
300ae3a5 |
| 14-Oct-2021 |
Eric Dumazet <edumazet@google.com> |
tcp: switch orphan_count to bare per-cpu counters
[ Upstream commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832 ]
Use of percpu_counter structure to track count of orphaned sockets is causing problems
tcp: switch orphan_count to bare per-cpu counters
[ Upstream commit 19757cebf0c5016a1f36f7fe9810a9f0b33c0832 ]
Use of percpu_counter structure to track count of orphaned sockets is causing problems on modern hosts with 256 cpus or more.
Stefan Bach reported a serious spinlock contention in real workloads, that I was able to reproduce with a netfilter rule dropping incoming FIN packets.
53.56% server [kernel.kallsyms] [k] queued_spin_lock_slowpath | ---queued_spin_lock_slowpath | --53.51%--_raw_spin_lock_irqsave | --53.51%--__percpu_counter_sum tcp_check_oom | |--39.03%--__tcp_close | tcp_close | inet_release | inet6_release | sock_close | __fput | ____fput | task_work_run | exit_to_usermode_loop | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __GI___libc_close | --14.48%--tcp_out_of_resources tcp_write_timeout tcp_retransmit_timer tcp_write_timer_handler tcp_write_timer call_timer_fn expire_timers __run_timers run_timer_softirq __softirqentry_text_start
As explained in commit cf86a086a180 ("net/dst: use a smaller percpu_counter batch for dst entries accounting"), default batch size is too big for the default value of tcp_max_orphans (262144).
But even if we reduce batch sizes, there would still be cases where the estimated count of orphans is beyond the limit, and where tcp_too_many_orphans() has to call the expensive percpu_counter_sum_positive().
One solution is to use plain per-cpu counters, and have a timer to periodically refresh this cache.
Updating this cache every 100ms seems about right, tcp pressure state is not radically changing over shorter periods.
percpu_counter was nice 15 years ago while hosts had less than 16 cpus, not anymore by current standards.
v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m, reported by kernel test robot <lkp@intel.com> Remove unused socket argument from tcp_too_many_orphans()
Fixes: dd24c00191d5 ("net: Use a percpu_counter for orphan_count") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stefan Bach <sfb@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52 |
|
#
990c74e3 |
| 19-Jul-2021 |
Vasily Averin <vvs@virtuozzo.com> |
memcg: enable accounting for inet_bin_bucket cache
net namespace can create up to 64K tcp and dccp ports and force kernel to allocate up to several megabytes of memory per netns for inet_bind_bucket
memcg: enable accounting for inet_bin_bucket cache
net namespace can create up to 64K tcp and dccp ports and force kernel to allocate up to several megabytes of memory per netns for inet_bind_bucket objects.
It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.10.51, v5.10.50, v5.10.49 |
|
#
e3ae2365 |
| 27-Jun-2021 |
Alexander Aring <aahringo@redhat.com> |
net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, f
net: sock: introduce sk_error_report
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors.
Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10, v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61 |
|
#
df561f66 |
| 23-Aug-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through mar
treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
show more ...
|
#
1aecbf18 |
| 22-Aug-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
net: dccp: Convert to use the preferred fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Mil
net: dccp: Convert to use the preferred fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54 |
|
#
d3c48151 |
| 28-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: remove sockptr_advance
sockptr_advance never properly worked. Replace it with _offset variants of copy_from_sockptr and copy_to_sockptr.
Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
net: remove sockptr_advance
sockptr_advance never properly worked. Replace it with _offset variants of copy_from_sockptr and copy_to_sockptr.
Fixes: ba423fdaa589 ("net: add a new sockptr_t type") Reported-by: Jason A. Donenfeld <Jason@zx2c4.com> Reported-by: Ido Schimmel <idosch@idosch.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a7b75c5a |
| 23-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of archit
net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.7.10, v5.4.53 |
|
#
749c08f8 |
| 20-Jul-2020 |
Richard Sailer <richard_siegfried@systemli.org> |
net: dccp: Add SIOCOUTQ IOCTL support (send buffer fill)
This adds support for the SIOCOUTQ IOCTL to get the send buffer fill of a DCCP socket, like UDP and TCP sockets already have.
Regarding the
net: dccp: Add SIOCOUTQ IOCTL support (send buffer fill)
This adds support for the SIOCOUTQ IOCTL to get the send buffer fill of a DCCP socket, like UDP and TCP sockets already have.
Regarding the used data field: DCCP uses per packet sequence numbers, not per byte, so sequence numbers can't be used like in TCP. sk_wmem_queued is not used by DCCP and always 0, even in test on highly congested paths. Therefore this uses sk_wmem_alloc like in UDP.
Signed-off-by: Richard Sailer <richard_siegfried@systemli.org> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|