Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22 |
|
#
8fb1a76a |
| 22-Mar-2023 |
Kui-Feng Lee <kuifeng@meta.com> |
net: Update an existing TCP congestion control algorithm.
This feature lets you immediately transition to another congestion control algorithm or implementation with the same name. Once a name is u
net: Update an existing TCP congestion control algorithm.
This feature lets you immediately transition to another congestion control algorithm or implementation with the same name. Once a name is updated, new connections will apply this new algorithm.
The purpose is to update a customized algorithm implemented in BPF struct_ops with a new version on the flight. The following is an example of using the userspace API implemented in later BPF patches.
link = bpf_map__attach_struct_ops(skel->maps.ca_update_1); ....... err = bpf_link__update_map(link, skel->maps.ca_update_2);
We first load and register an algorithm implemented in BPF struct_ops, then swap it out with a new one using the same name. After that, newly created connections will apply the updated algorithm, while older ones retain the previous version already applied.
This patch also takes this chance to refactor the ca validation into the new tcp_validate_congestion_control() function.
Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> Link: https://lore.kernel.org/r/20230323032405.3735486-3-kuifeng@meta.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
Revision tags: v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10 |
|
#
400031e0 |
| 01-Feb-2023 |
David Vernet <void@manifault.com> |
bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David V
bpf: Add __bpf_kfunc tag to all kfuncs
Now that we have the __bpf_kfunc tag, we should use add it to all existing kfuncs to ensure that they'll never be elided in LTO builds.
Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com
show more ...
|
Revision tags: v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33 |
|
#
15fcdf6a |
| 05-Apr-2022 |
Ping Gan <jacky_gam_2001@163.com> |
tcp: Add tracepoint for tcp_set_ca_state
The congestion status of a tcp flow may be updated since there is congestion between tcp sender and receiver. It makes sense to add tracepoint for congestion
tcp: Add tracepoint for tcp_set_ca_state
The congestion status of a tcp flow may be updated since there is congestion between tcp sender and receiver. It makes sense to add tracepoint for congestion status set function to summate cc status duration and evaluate the performance of network and congestion algorithm. the backgound of this patch is below.
Link: https://github.com/iovisor/bcc/pull/3899
Signed-off-by: Ping Gan <jacky_gam_2001@163.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220406010956.19656-1-jacky_gam_2001@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
40570375 |
| 05-Apr-2022 |
Eric Dumazet <edumazet@google.com> |
tcp: add accessors to read/set tp->snd_cwnd
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero.
Lately, syzbot reported the WARN_ON_ONCE(!tp
tcp: add accessors to read/set tp->snd_cwnd
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero.
Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug.
Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28 |
|
#
515bb307 |
| 10-Mar-2022 |
Christoph Hellwig <hch@lst.de> |
tcp: unexport tcp_ca_get_key_by_name and tcp_ca_get_name_by_key
Both functions are only used by core networking code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/2
tcp: unexport tcp_ca_get_key_by_name and tcp_ca_get_name_by_key
Both functions are only used by core networking code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220310143229.895319-1-hch@lst.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
3308676e |
| 05-Apr-2022 |
Eric Dumazet <edumazet@google.com> |
tcp: add accessors to read/set tp->snd_cwnd
[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is gr
tcp: add accessors to read/set tp->snd_cwnd
[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero.
Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug.
Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
3308676e |
| 05-Apr-2022 |
Eric Dumazet <edumazet@google.com> |
tcp: add accessors to read/set tp->snd_cwnd
[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is gr
tcp: add accessors to read/set tp->snd_cwnd
[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]
We had various bugs over the years with code breaking the assumption that tp->snd_cwnd is greater than zero.
Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction") can trigger, and without a repro we would have to spend considerable time finding the bug.
Instead of complaining too late, we want to catch where and when tp->snd_cwnd is set to an illegal value.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116 |
|
#
8d432592 |
| 01-May-2021 |
Jonathon Reinhart <jonathon.reinhart@gmail.com> |
net: Only allow init netns to set default tcp cong to a restricted algo
tcp_set_default_congestion_control() is netns-safe in that it writes to &net->ipv4.tcp_congestion_control, but it also sets ca
net: Only allow init netns to set default tcp cong to a restricted algo
tcp_set_default_congestion_control() is netns-safe in that it writes to &net->ipv4.tcp_congestion_control, but it also sets ca->flags |= TCP_CONG_NON_RESTRICTED which is not namespaced. This has the unintended side-effect of changing the global net.ipv4.tcp_allowed_congestion_control sysctl, despite the fact that it is read-only: 97684f0970f6 ("net: Make tcp_allowed_congestion_control readonly in non-init netns")
Resolve this netns "leak" by only allowing the init netns to set the default algorithm to one that is restricted. This restriction could be removed if tcp_allowed_congestion_control were namespace-ified in the future.
This bug was uncovered with https://github.com/JonathonReinhart/linux-netns-sysctl-verify
Fixes: 6670e1524477 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control") Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
6c1ea8be |
| 01-May-2021 |
Jonathon Reinhart <jonathon.reinhart@gmail.com> |
net: Only allow init netns to set default tcp cong to a restricted algo
commit 8d432592f30fcc34ef5a10aac4887b4897884493 upstream.
tcp_set_default_congestion_control() is netns-safe in that it write
net: Only allow init netns to set default tcp cong to a restricted algo
commit 8d432592f30fcc34ef5a10aac4887b4897884493 upstream.
tcp_set_default_congestion_control() is netns-safe in that it writes to &net->ipv4.tcp_congestion_control, but it also sets ca->flags |= TCP_CONG_NON_RESTRICTED which is not namespaced. This has the unintended side-effect of changing the global net.ipv4.tcp_allowed_congestion_control sysctl, despite the fact that it is read-only: 97684f0970f6 ("net: Make tcp_allowed_congestion_control readonly in non-init netns")
Resolve this netns "leak" by only allowing the init netns to set the default algorithm to one that is restricted. This restriction could be removed if tcp_allowed_congestion_control were namespace-ified in the future.
This bug was uncovered with https://github.com/JonathonReinhart/linux-netns-sysctl-verify
Fixes: 6670e1524477 ("tcp: Namespace-ify sysctl_tcp_default_congestion_control") Signed-off-by: Jonathon Reinhart <jonathon.reinhart@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10 |
|
#
55472017 |
| 19-Nov-2020 |
Alexander Duyck <alexanderduyck@fb.com> |
tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include t
tcp: Set INET_ECN_xmit configuration in tcp_reinit_congestion_control
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include the ECT0 flag. A bit of simple printk debugging shows that when this is configured without BPF we will see the value INET_ECN_xmit value initialized in tcp_assign_congestion_control however when we configure this via BPF the socket is in the closed state and as such it isn't configured, and I do not see it being initialized when we transition the socket into the listen state. The result of this is that the ECT0 bit is configured based on whatever the default state is for the socket.
Any easy way to reproduce this is to monitor the following with tcpdump: tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca
Without this patch the SYN/ACK will follow whatever the default is. If dctcp all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0 will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK bit matches the value seen on the other packets in the given stream.
Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9 |
|
#
5050bef8 |
| 10-Sep-2020 |
Neal Cardwell <ncardwell@google.com> |
tcp: Simplify tcp_set_congestion_control() load=false case
Simplify tcp_set_congestion_control() by removing the initialization code path for the !load case.
There are only two call sites for tcp_s
tcp: Simplify tcp_set_congestion_control() load=false case
Simplify tcp_set_congestion_control() by removing the initialization code path for the !load case.
There are only two call sites for tcp_set_congestion_control(). The EBPF call site is the only one that passes load=false; it also passes cap_net_admin=true. Because of that, the exact same behavior can be achieved by removing the special if (!load) branch of the logic. Both before and after this commit, the EBPF case will call bpf_try_module_get(), and if that succeeds then call tcp_reinit_congestion_control() or if that fails then return EBUSY.
Note that this returns the logic to a structure very similar to the structure before: commit 91b5b21c7c16 ("bpf: Add support for changing congestion control") except that the CAP_NET_ADMIN status is passed in as a function argument.
This clean-up was suggested by Martin KaFai Lau.
Suggested-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Kevin Yang <yyd@google.com>
show more ...
|
#
29a94932 |
| 10-Sep-2020 |
Neal Cardwell <ncardwell@google.com> |
tcp: simplify tcp_set_congestion_control(): Always reinitialize
Now that the previous patches ensure that all call sites for tcp_set_congestion_control() want to initialize congestion control, we ca
tcp: simplify tcp_set_congestion_control(): Always reinitialize
Now that the previous patches ensure that all call sites for tcp_set_congestion_control() want to initialize congestion control, we can simplify tcp_set_congestion_control() by removing the reinit argument and the code to support it.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
show more ...
|
#
8919a9b3 |
| 10-Sep-2020 |
Neal Cardwell <ncardwell@google.com> |
tcp: Only init congestion control if not initialized already
Change tcp_init_transfer() to only initialize congestion control if it has not been initialized already.
With this new approach, we can
tcp: Only init congestion control if not initialized already
Change tcp_init_transfer() to only initialize congestion control if it has not been initialized already.
With this new approach, we can arrange things so that if the EBPF code sets the congestion control by calling setsockopt(TCP_CONGESTION) then tcp_init_transfer() will not re-initialize the CC module.
This is an approach that has the following beneficial properties:
(1) This allows CC module customizations made by the EBPF called in tcp_init_transfer() to persist, and not be wiped out by a later call to tcp_init_congestion_control() in tcp_init_transfer().
(2) Does not flip the order of EBPF and CC init, to avoid causing bugs for existing code upstream that depends on the current order.
(3) Does not cause 2 initializations for for CC in the case where the EBPF called in tcp_init_transfer() wants to set the CC to a new CC algorithm.
(4) Allows follow-on simplifications to the code in net/core/filter.c and net/ipv4/tcp_cong.c, which currently both have some complexity to special-case CC initialization to avoid double CC initialization if EBPF sets the CC.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Kevin Yang <yyd@google.com> Cc: Lawrence Brakmo <brakmo@fb.com>
show more ...
|
Revision tags: v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51 |
|
#
ce69e563 |
| 08-Jul-2020 |
Christoph Paasch <cpaasch@apple.com> |
tcp: make sure listeners don't initialize congestion-control state
syzkaller found its way into setsockopt with TCP_CONGESTION "cdg". tcp_cdg_init() does a kcalloc to store the gradients. As sk_clon
tcp: make sure listeners don't initialize congestion-control state
syzkaller found its way into setsockopt with TCP_CONGESTION "cdg". tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock just copies all the memory, the allocated pointer will be copied as well, if the app called setsockopt(..., TCP_CONGESTION) on the listener. If now the socket will be destroyed before the congestion-control has properly been initialized (through a call to tcp_init_transfer), we will end up freeing memory that does not belong to that particular socket, opening the door to a double-free:
[ 11.413102] ================================================================== [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0 [ 11.415329] [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80 [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 11.418148] Call Trace: [ 11.418534] <IRQ> [ 11.418834] dump_stack+0x7d/0xb0 [ 11.419297] print_address_description.constprop.0+0x1a/0x210 [ 11.422079] kasan_report_invalid_free+0x51/0x80 [ 11.423433] __kasan_slab_free+0x15e/0x170 [ 11.424761] kfree+0x8c/0x230 [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0 [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100 [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0 [ 11.429457] cookie_v4_check+0x13d0/0x2500 [ 11.433189] tcp_v4_do_rcv+0x60e/0x780 [ 11.433727] tcp_v4_rcv+0x2869/0x2e10 [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190 [ 11.437810] ip_local_deliver+0x294/0x350 [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0 [ 11.441995] process_backlog+0x1b1/0x6b0 [ 11.443148] net_rx_action+0x37e/0xc40 [ 11.445361] __do_softirq+0x18c/0x61a [ 11.445881] asm_call_on_stack+0x12/0x20 [ 11.446409] </IRQ> [ 11.446716] do_softirq_own_stack+0x34/0x40 [ 11.447259] do_softirq.part.0+0x26/0x30 [ 11.447827] __local_bh_enable_ip+0x46/0x50 [ 11.448406] ip_finish_output2+0x60f/0x1bc0 [ 11.450109] __ip_queue_xmit+0x71c/0x1b60 [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0 [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780 [ 11.457995] __release_sock+0x14b/0x2c0 [ 11.458529] release_sock+0x4a/0x170 [ 11.459005] __inet_stream_connect+0x467/0xc80 [ 11.461435] inet_stream_connect+0x4e/0xa0 [ 11.462043] __sys_connect+0x204/0x270 [ 11.465515] __x64_sys_connect+0x6a/0xb0 [ 11.466088] do_syscall_64+0x3e/0x70 [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.467341] RIP: 0033:0x7f56046dc469 [ 11.467844] Code: Bad RIP value. [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469 [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004 [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003 [ 11.474321] [ 11.474527] Allocated by task 4884: [ 11.475031] save_stack+0x1b/0x40 [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 11.476182] tcp_cdg_init+0xf0/0x150 [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0 [ 11.477435] tcp_set_congestion_control+0x270/0x32f [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00 [ 11.478744] __sys_setsockopt+0xff/0x1e0 [ 11.479259] __x64_sys_setsockopt+0xb5/0x150 [ 11.479895] do_syscall_64+0x3e/0x70 [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 11.481097] [ 11.481321] Freed by task 4872: [ 11.481783] save_stack+0x1b/0x40 [ 11.482230] __kasan_slab_free+0x12c/0x170 [ 11.482839] kfree+0x8c/0x230 [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0 [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0 [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0 [ 11.485144] tcp_close+0x932/0xfe0 [ 11.485642] inet_release+0xc1/0x1c0 [ 11.486131] __sock_release+0xc0/0x270 [ 11.486697] sock_close+0xc/0x10 [ 11.487145] __fput+0x277/0x780 [ 11.487632] task_work_run+0xeb/0x180 [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160 [ 11.488834] do_syscall_64+0x4a/0x70 [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750 ("tcp: memset ca_priv data to 0 properly").
This patch here fixes the listener-scenario: We make sure that listeners setting the congestion-control through setsockopt won't initialize it (thus CDG never allocates on listeners). For those who use AF_UNSPEC to reuse a socket, tcp_disconnect() is changed to cleanup afterwards.
(The issue can be reproduced at least down to v4.4.x.)
Cc: Wei Wang <weiwan@google.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9 |
|
#
0baf26b0 |
| 08-Jan-2020 |
Martin KaFai Lau <kafai@fb.com> |
bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops in bpf.
The BPF impleme
bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user of BPF STRUCT_OPS. It allows implementing a tcp_congestion_ops in bpf.
The BPF implemented tcp_congestion_ops can be used like regular kernel tcp-cc through sysctl and setsockopt. e.g. [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic net.ipv4.tcp_congestion_control = bpf_cubic
There has been attempt to move the TCP CC to the user space (e.g. CCP in TCP). The common arguments are faster turn around, get away from long-tail kernel versions in production...etc, which are legit points.
BPF has been the continuous effort to join both kernel and userspace upsides together (e.g. XDP to gain the performance advantage without bypassing the kernel). The recent BPF advancements (in particular BTF-aware verifier, BPF trampoline, BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc) possible in BPF. It allows a faster turnaround for testing algorithm in the production while leveraging the existing (and continue growing) BPF feature/framework instead of building one specifically for userspace TCP CC.
This patch allows write access to a few fields in tcp-sock (in bpf_tcp_ca_btf_struct_access()).
The optional "get_info" is unsupported now. It can be added later. One possible way is to output the info with a btf-id to describe the content.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
show more ...
|
Revision tags: v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12 |
|
#
9bb59a21 |
| 20-Nov-2019 |
Hangbin Liu <liuhangbin@gmail.com> |
tcp: warn if offset reach the maxlen limit when using snprintf
snprintf returns the number of chars that would be written, not number of chars that were actually written. As such, 'offs' may get lar
tcp: warn if offset reach the maxlen limit when using snprintf
snprintf returns the number of chars that would be written, not number of chars that were actually written. As such, 'offs' may get larger than 'tbl.maxlen', causing the 'tbl.maxlen - offs' being < 0, and since the parameter is size_t, it would overflow.
Since using scnprintf may hide the limit error, while the buffer is still enough now, let's just add a WARN_ON_ONCE in case it reach the limit in future.
v2: Use WARN_ON_ONCE as Jiri and Eric suggested.
Suggested-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12, v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2 |
|
#
8d650cde |
| 18-Jul-2019 |
Eric Dumazet <edumazet@google.com> |
tcp: fix tcp_set_congestion_control() use from bpf hook
Neal reported incorrect use of ns_capable() from bpf hook.
bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_ca
tcp: fix tcp_set_congestion_control() use from bpf hook
Neal reported incorrect use of ns_capable() from bpf hook.
bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1)
Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context.
As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site.
The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context.
Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13, v5.1.12, v5.1.11, v5.1.10, v5.1.9, v5.1.8, v5.1.7, v5.1.6, v5.1.5, v5.1.4 |
|
#
457c8996 |
| 19-May-2019 |
Thomas Gleixner <tglx@linutronix.de> |
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was use
treewide: Add SPDX license identifier for missed files
Add SPDX license identifiers to all files which:
- Have no license information of any form
- Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file
These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is:
GPL-2.0-only
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v5.1.3, v5.1.2, v5.1.1, v5.0.14, v5.1, v5.0.13, v5.0.12, v5.0.11, v5.0.10, v5.0.9, v5.0.8, v5.0.7, v5.0.6, v5.0.5, v5.0.4, v5.0.3, v4.19.29, v5.0.2, v4.19.28, v5.0.1, v4.19.27, v5.0, v4.19.26, v4.19.25, v4.19.24, v4.19.23, v4.19.22, v4.19.21, v4.19.20, v4.19.19, v4.19.18, v4.19.17, v4.19.16, v4.19.15, v4.19.14, v4.19.13, v4.19.12, v4.19.11, v4.19.10, v4.19.9, v4.19.8, v4.19.7, v4.19.6, v4.19.5, v4.19.4, v4.18.20, v4.19.3, v4.18.19, v4.19.2, v4.18.18, v4.18.17, v4.19.1, v4.19, v4.18.16, v4.18.15, v4.18.14, v4.18.13, v4.18.12, v4.18.11, v4.18.10, v4.18.9, v4.18.7, v4.18.6, v4.18.5, v4.17.18, v4.18.4, v4.18.3, v4.17.17, v4.18.2, v4.17.16, v4.17.15, v4.18.1, v4.18, v4.17.14, v4.17.13, v4.17.12, v4.17.11, v4.17.10, v4.17.9, v4.17.8, v4.17.7, v4.17.6, v4.17.5, v4.17.4, v4.17.3, v4.17.2, v4.17.1, v4.17, v4.16, v4.15, v4.13.16 |
|
#
6670e152 |
| 14-Nov-2017 |
Stephen Hemminger <stephen@networkplumber.org> |
tcp: Namespace-ify sysctl_tcp_default_congestion_control
Make default TCP default congestion control to a per namespace value. This changes default congestion control to a pointer to congestion ops
tcp: Namespace-ify sysctl_tcp_default_congestion_control
Make default TCP default congestion control to a per namespace value. This changes default congestion control to a pointer to congestion ops (rather than implicit as first element of available lsit).
The congestion control setting of new namespaces is inherited from the current setting of the root namespace.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.14, v4.13.5, v4.13 |
|
#
ebfa00c5 |
| 25-Aug-2017 |
Sabrina Dubroca <sd@queasysnail.net> |
tcp: fix refcnt leak with ebpf congestion control
There are a few bugs around refcnt handling in the new BPF congestion control setsockopt:
- The new ca is assigned to icsk->icsk_ca_ops even in th
tcp: fix refcnt leak with ebpf congestion control
There are a few bugs around refcnt handling in the new BPF congestion control setsockopt:
- The new ca is assigned to icsk->icsk_ca_ops even in the case where we cannot get a reference on it. This would lead to a use after free, since that ca is going away soon.
- Changing the congestion control case doesn't release the refcnt on the previous ca.
- In the reinit case, we first leak a reference on the old ca, then we call tcp_reinit_congestion_control on the ca that we have just assigned, leading to deinitializing the wrong ca (->release of the new ca on the old ca's data) and releasing the refcount on the ca that we actually want to use.
This is visible by building (for example) BIC as a module and setting net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from samples/bpf.
This patch fixes the refcount issues, and moves reinit back into tcp core to avoid passing a ca pointer back to BPF.
Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
4faf7839 |
| 03-Aug-2017 |
Yuchung Cheng <ycheng@google.com> |
tcp: fix cwnd undo in Reno and HTCP congestion controls
Using ssthresh to revert cwnd is less reliable when ssthresh is bounded to 2 packets. This patch uses an existing variable in TCP "prior_cwnd"
tcp: fix cwnd undo in Reno and HTCP congestion controls
Using ssthresh to revert cwnd is less reliable when ssthresh is bounded to 2 packets. This patch uses an existing variable in TCP "prior_cwnd" that snapshots the cwnd right before entering fast recovery and RTO recovery in Reno. This fixes the issue discussed in netdev thread: "A buggy behavior for Linux TCP Reno and HTCP" https://www.spinics.net/lists/netdev/msg444955.html
Suggested-by: Neal Cardwell <ncardwell@google.com> Reported-by: Wei Sun <unlcsewsun@gmail.com> Signed-off-by: Yuchung Cheng <ncardwell@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.12 |
|
#
91b5b21c |
| 30-Jun-2017 |
Lawrence Brakmo <brakmo@fb.com> |
bpf: Add support for changing congestion control
Added support for changing congestion control for SOCK_OPS bpf programs through the setsockopt bpf helper function. It also adds a new SOCK_OPS op, B
bpf: Add support for changing congestion control
Added support for changing congestion control for SOCK_OPS bpf programs through the setsockopt bpf helper function. It also adds a new SOCK_OPS op, BPF_SOCK_OPS_NEEDS_ECN, that is needed for congestion controls, like dctcp, that need to enable ECN in the SYN packets.
Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
44abafc4 |
| 31-May-2017 |
Yuchung Cheng <ycheng@google.com> |
tcp: disallow cwnd undo when switching congestion control
When the sender switches its congestion control during loss recovery, if the recovery is spurious then it may incorrectly revert cwnd and ss
tcp: disallow cwnd undo when switching congestion control
When the sender switches its congestion control during loss recovery, if the recovery is spurious then it may incorrectly revert cwnd and ssthresh to the older values set by a previous congestion control. Consider a congestion control (like BBR) that does not use ssthresh and keeps it infinite: the connection may incorrectly revert cwnd to an infinite value when switching from BBR to another congestion control.
This patch fixes it by disallowing such cwnd undo operation upon switching congestion control. Note that undo_marker is not reset s.t. the packets that were incorrectly marked lost would be corrected. We only avoid undoing the cwnd in tcp_undo_cwnd_reduction().
Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.10.17, v4.10.16, v4.10.15, v4.10.14, v4.10.13 |
|
#
c1201444 |
| 25-Apr-2017 |
Wei Wang <weiwan@google.com> |
tcp: memset ca_priv data to 0 properly
Always zero out ca_priv data in tcp_assign_congestion_control() so that ca_priv data is cleared out during socket creation. Also always zero out ca_priv data i
tcp: memset ca_priv data to 0 properly
Always zero out ca_priv data in tcp_assign_congestion_control() so that ca_priv data is cleared out during socket creation. Also always zero out ca_priv data in tcp_reinit_congestion_control() so that when cc algorithm is changed, ca_priv data is cleared out as well. We should still zero out ca_priv data even in TCP_CLOSE state because user could call connect() on AF_UNSPEC to disconnect the socket and leave it in TCP_CLOSE state and later call setsockopt() to switch cc algorithm on this socket.
Fixes: 2b0a8c9ee ("tcp: add CDG congestion control") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v4.10.12, v4.10.11, v4.10.10, v4.10.9, v4.10.8, v4.10.7, v4.10.6, v4.10.5, v4.10.4, v4.10.3, v4.10.2, v4.10.1, v4.10, v4.9 |
|
#
e9799183 |
| 21-Nov-2016 |
Florian Westphal <fw@strlen.de> |
tcp: make undo_cwnd mandatory for congestion modules
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour.
It seems more appropriate to let congc
tcp: make undo_cwnd mandatory for congestion modules
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh, which un-does reno halving behaviour.
It seems more appropriate to let congctl algorithms pair .ssthresh and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it up for all congestion algorithms that used to rely on the fallback.
Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|