History log of /openbmc/linux/net/ipv4/tcp_bbr.c (Results 1 – 25 of 87)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10
# 400031e0 01-Feb-2023 David Vernet <void@manifault.com>

bpf: Add __bpf_kfunc tag to all kfuncs

Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.

Signed-off-by: David V

bpf: Add __bpf_kfunc tag to all kfuncs

Now that we have the __bpf_kfunc tag, we should use add it to all
existing kfuncs to ensure that they'll never be elided in LTO builds.

Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20230201173016.342758-4-void@manifault.com

show more ...


Revision tags: v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1
# 8032bf12 09-Oct-2022 Jason A. Donenfeld <Jason@zx2c4.com>

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-

treewide: use get_random_u32_below() instead of deprecated function

This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
(E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

show more ...


Revision tags: v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56
# a4703e31 21-Jul-2022 Kumar Kartikeya Dwivedi <memxor@gmail.com>

bpf: Switch to new kfunc flags infrastructure

Instead of populating multiple sets to indicate some attribute and then
researching the same BTF ID in them, prepare a single unified BTF set
which indi

bpf: Switch to new kfunc flags infrastructure

Instead of populating multiple sets to indicate some attribute and then
researching the same BTF ID in them, prepare a single unified BTF set
which indicates whether a kfunc is allowed to be called, and also its
attributes if any at the same time. Now, only one call is needed to
perform the lookup for both kfunc availability and its attributes.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220721134245.2450-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


Revision tags: v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40
# 7c4e983c 13-May-2022 Alexander Duyck <alexanderduyck@fb.com>

net: allow gso_max_size to exceed 65536

The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The o

net: allow gso_max_size to exceed 65536

The code for gso_max_size was added originally to allow for debugging and
workaround of buggy devices that couldn't support TSO with blocks 64K in
size. The original reason for limiting it to 64K was because that was the
existing limits of IPv4 and non-jumbogram IPv6 length fields.

With the addition of Big TCP we can remove this limit and allow the value
to potentially go up to UINT_MAX and instead be limited by the tso_max_size
value.

So in order to support this we need to go through and clean up the
remaining users of the gso_max_size value so that the values will cap at
64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value
so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper
limit for GSO_MAX_SIZE.

v6: (edumazet) fixed a compile error if CONFIG_IPV6=n,
in a new sk_trim_gso_size() helper.
netif_set_tso_max_size() caps the requested TSO size
with GSO_MAX_SIZE.

Signed-off-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33
# 40570375 05-Apr-2022 Eric Dumazet <edumazet@google.com>

tcp: add accessors to read/set tp->snd_cwnd

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is greater
than zero.

Lately, syzbot reported the WARN_ON_ONCE(!tp

tcp: add accessors to read/set tp->snd_cwnd

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is greater
than zero.

Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction")
can trigger, and without a repro we would have to spend
considerable time finding the bug.

Instead of complaining too late, we want to catch where
and when tp->snd_cwnd is set to an illegal value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


Revision tags: v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15
# b202d844 14-Jan-2022 Kumar Kartikeya Dwivedi <memxor@gmail.com>

bpf: Remove check_kfunc_call callback and old kfunc BTF ID API

Completely remove the old code for check_kfunc_call to help it work
with modules, and also the callback itself.

The previous commit ad

bpf: Remove check_kfunc_call callback and old kfunc BTF ID API

Completely remove the old code for check_kfunc_call to help it work
with modules, and also the callback itself.

The previous commit adds infrastructure to register all sets and put
them in vmlinux or module BTF, and concatenates all related sets
organized by the hook and the type. Once populated, these sets remain
immutable for the lifetime of the struct btf.

Also, since we don't need the 'owner' module anywhere when doing
check_kfunc_call, drop the 'btf_modp' module parameter from
find_kfunc_desc_btf.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20220114163953.1455836-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

show more ...


Revision tags: v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10
# 0e32dfc8 01-Oct-2021 Kumar Kartikeya Dwivedi <memxor@gmail.com>

bpf: Enable TCP congestion control kfunc from modules

This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their se

bpf: Enable TCP congestion control kfunc from modules

This commit moves BTF ID lookup into the newly added registration
helper, in a way that the bbr, cubic, and dctcp implementation set up
their sets in the bpf_tcp_ca kfunc_btf_set list, while the ones not
dependent on modules are looked up from the wrapper function.

This lifts the restriction for them to be compiled as built in objects,
and can be loaded as modules if required. Also modify Makefile.modfinal
to call resolve_btfids for each module.

Note that since kernel kfunc_ids never overlap with module kfunc_ids, we
only match the owner for module btf id sets.

See following commits for background on use of:

CONFIG_X86 ifdef:
569c484f9995 (bpf: Limit static tcp-cc functions in the .BTF_ids list to x86)

CONFIG_DYNAMIC_FTRACE ifdef:
7aae231ac93b (bpf: tcp: Limit calling some tcp cc functions to CONFIG_DYNAMIC_FTRACE)

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211002011757.311265-6-memxor@gmail.com

show more ...


# 3308676e 05-Apr-2022 Eric Dumazet <edumazet@google.com>

tcp: add accessors to read/set tp->snd_cwnd

[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is gr

tcp: add accessors to read/set tp->snd_cwnd

[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is greater
than zero.

Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction")
can trigger, and without a repro we would have to spend
considerable time finding the bug.

Instead of complaining too late, we want to catch where
and when tp->snd_cwnd is set to an illegal value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


# 3308676e 05-Apr-2022 Eric Dumazet <edumazet@google.com>

tcp: add accessors to read/set tp->snd_cwnd

[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is gr

tcp: add accessors to read/set tp->snd_cwnd

[ Upstream commit 40570375356c874b1578e05c1dcc3ff7c1322dbe ]

We had various bugs over the years with code
breaking the assumption that tp->snd_cwnd is greater
than zero.

Lately, syzbot reported the WARN_ON_ONCE(!tp->prior_cwnd) added
in commit 8b8a321ff72c ("tcp: fix zero cwnd in tcp_cwnd_reduction")
can trigger, and without a repro we would have to spend
considerable time finding the bug.

Instead of complaining too late, we want to catch where
and when tp->snd_cwnd is set to an illegal value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20220405233538.947344-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60
# 6de035fe 10-Aug-2021 Neal Cardwell <ncardwell@google.com>

tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets

Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the pha

tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets

Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.

The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:

!before(rs->prior_delivered, bbr->next_rtt_delivered)

and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:

bbr->round_start = 1;

This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.

This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.

This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


# 696afe28 10-Aug-2021 Neal Cardwell <ncardwell@google.com>

tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets

[ Upstream commit 6de035fec045f8ae5ee5f3a02373a18b939e91fb ]

Currently if BBR congestion control is initialized after

tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets

[ Upstream commit 6de035fec045f8ae5ee5f3a02373a18b939e91fb ]

Currently if BBR congestion control is initialized after more than 2B
packets have been delivered, depending on the phase of the
tp->delivered counter the tracking of BBR round trips can get stuck.

The bug arises because if tp->delivered is between 2^31 and 2^32 at
the time the BBR congestion control module is initialized, then the
initialization of bbr->next_rtt_delivered to 0 will cause the logic to
believe that the end of the round trip is still billions of packets in
the future. More specifically, the following check will fail
repeatedly:

!before(rs->prior_delivered, bbr->next_rtt_delivered)

and thus the connection will take up to 2B packets delivered before
that check will pass and the connection will set:

bbr->round_start = 1;

This could cause many mechanisms in BBR to fail to trigger, for
example bbr_check_full_bw_reached() would likely never exit STARTUP.

This bug is 5 years old and has not been observed, and as a practical
matter this would likely rarely trigger, since it would require
transferring at least 2B packets, or likely more than 3 terabytes of
data, before switching congestion control algorithms to BBR.

This patch is a stable candidate for kernels as far back as v4.9,
when tcp_bbr.c was added.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Kevin Yang <yyd@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210811024056.235161-1-ncardwell@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10
# 1b9e2a8c 16-Nov-2020 Ryan Sharpelletti <sharpelletti@google.com>

tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate

During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond

tcp: only postpone PROBE_RTT if RTT is < current min_rtt estimate

During loss recovery, retransmitted packets are forced to use TCP
timestamps to calculate the RTT samples, which have a millisecond
granularity. BBR is designed using a microsecond granularity. As a
result, multiple RTT samples could be truncated to the same RTT value
during loss recovery. This is problematic, as BBR will not enter
PROBE_RTT if the RTT sample is <= the current min_rtt sample, meaning
that if there are persistent losses, PROBE_RTT will constantly be
pushed off and potentially never re-entered. This patch makes sure
that BBR enters PROBE_RTT by checking if RTT sample is < the current
min_rtt sample, rather than <=.

The Netflix transport/TCP team discovered this bug in the Linux TCP
BBR code during lab tests.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Ryan Sharpelletti <sharpelletti@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Link: https://lore.kernel.org/r/20201116174412.1433277-1-sharpelletti.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

show more ...


Revision tags: v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14
# 5b2f1f30 20-Jan-2020 Wen Yang <wenyang@linux.alibaba.com>

tcp_bbr: improve arithmetic division in bbr_update_bw()

do_div() does a 64-by-32 division. Use div64_long() instead of it
if the divisor is long, to avoid truncation to 32-bit.
And as a nice side ef

tcp_bbr: improve arithmetic division in bbr_update_bw()

do_div() does a 64-by-32 division. Use div64_long() instead of it
if the divisor is long, to avoid truncation to 32-bit.
And as a nice side effect also cleans up the function a bit.

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4
# 7c68fa2b 16-Dec-2019 Eric Dumazet <edumazet@google.com>

net: annotate lockless accesses to sk->sk_pacing_shift

sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future sy

net: annotate lockless accesses to sk->sk_pacing_shift

sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future syzbot complains.

This might also avoid unexpected false sharing
in sk_pacing_shift_update(), as the compiler
could remove the conditional check and always
write over sk->sk_pacing_shift :

if (sk->sk_pacing_shift != val)
sk->sk_pacing_shift = val;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2
# 6b3656a6 26-Sep-2019 Kevin(Yudong) Yang <yyd@google.com>

tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth

There was a bug in the previous logic that attempted to ensure gain cycling
gets inflight above BDP even for small BDPs. Thi

tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth

There was a bug in the previous logic that attempted to ensure gain cycling
gets inflight above BDP even for small BDPs. This code correctly raised and
lowered target inflight values during the gain cycle. And this code
correctly ensured that cwnd was raised when probing bandwidth. However, it
did not correspondingly ensure that cwnd was *not* raised in this way when
*not* probing for bandwidth. The result was that small-BDP flows that were
always cwnd-bound could go for many cycles with a fixed cwnd, and not probe
or yield bandwidth at all. This meant that multiple small-BDP flows could
fail to converge in their bandwidth allocations.

Fixes: 3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs")
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12
# de8e1beb 29-Aug-2019 Luke Hsiao <lukehsiao@google.com>

tcp_bbr: clarify that bbr_bdp() rounds up in comments

This explicitly clarifies that bbr_bdp() returns the rounded-up value of
the bandwidth-delay product and why in the comments.

Signed-off-by: Lu

tcp_bbr: clarify that bbr_bdp() rounds up in comments

This explicitly clarifies that bbr_bdp() returns the rounded-up value of
the bandwidth-delay product and why in the comments.

Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13, v5.1.12, v5.1.11, v5.1.10, v5.1.9, v5.1.8, v5.1.7, v5.1.6, v5.1.5, v5.1.4, v5.1.3, v5.1.2, v5.1.1, v5.0.14, v5.1, v5.0.13, v5.0.12, v5.0.11, v5.0.10, v5.0.9, v5.0.8, v5.0.7, v5.0.6, v5.0.5, v5.0.4, v5.0.3, v4.19.29, v5.0.2, v4.19.28, v5.0.1, v4.19.27, v5.0, v4.19.26, v4.19.25, v4.19.24, v4.19.23, v4.19.22, v4.19.21, v4.19.20, v4.19.19, v4.19.18
# 78dc70eb 23-Jan-2019 Priyaranjan Jha <priyarjha@google.com>

tcp_bbr: adapt cwnd based on ack aggregation estimation

Aggregation effects are extremely common with wifi, cellular, and cable
modem link technologies, ACK decimation in middleboxes, and LRO and GR

tcp_bbr: adapt cwnd based on ack aggregation estimation

Aggregation effects are extremely common with wifi, cellular, and cable
modem link technologies, ACK decimation in middleboxes, and LRO and GRO
in receiving hosts. The aggregation can happen in either direction,
data or ACKs, but in either case the aggregation effect is visible
to the sender in the ACK stream.

Previously BBR's sending was often limited by cwnd under severe ACK
aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
the bottleneck idle for potentially long periods. Note that loss-based
congestion control does not have this issue because when facing
aggregation it continues increasing cwnd after bursts of ACKs, growing
cwnd until the buffer is full.

To achieve good throughput in the presence of aggregation effects, this
algorithm allows the BBR sender to put extra data in flight to keep the
bottleneck utilized during silences in the ACK stream that it has evidence
to suggest were caused by aggregation.

A summary of the algorithm: when a burst of packets are acked by a
stretched ACK or a burst of ACKs or both, BBR first estimates the expected
amount of data that should have been acked, based on its estimated
bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
filter to estimate the recent level of observed ACK aggregation. Then cwnd
is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
being cwnd-limited in the face of ACK silences that recent history suggests
were caused by aggregation. As a sanity check, the ACK aggregation degree
is upper-bounded by the cwnd (at the time of measurement) and a global max
of BW * 100ms. The algorithm is further described by the following
presentation:
https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00

In our internal testing, we observed a significant increase in BBR
throughput (measured using netperf), in a basic wifi setup.
- Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
- 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
- 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps

Also, this code is running globally on YouTube TCP connections and produced
significant bandwidth increases for YouTube traffic.

This is based on Ian Swett's max_ack_height_ algorithm from the
QUIC BBR implementation.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 232aa8ec 23-Jan-2019 Priyaranjan Jha <priyarjha@google.com>

tcp_bbr: refactor bbr_target_cwnd() for general inflight provisioning

Because bbr_target_cwnd() is really a general-purpose BBR helper for
computing some volume of inflight data as a function of the

tcp_bbr: refactor bbr_target_cwnd() for general inflight provisioning

Because bbr_target_cwnd() is really a general-purpose BBR helper for
computing some volume of inflight data as a function of the estimated
BDP, refactor it into following helper functions:
- bbr_bdp()
- bbr_quantization_budget()
- bbr_inflight()

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v4.19.17, v4.19.16, v4.19.15, v4.19.14, v4.19.13, v4.19.12, v4.19.11, v4.19.10, v4.19.9, v4.19.8, v4.19.7, v4.19.6, v4.19.5, v4.19.4, v4.18.20, v4.19.3, v4.18.19, v4.19.2, v4.18.18
# 1106a5ad 08-Nov-2018 Neal Cardwell <ncardwell@google.com>

tcp_bbr: update comments to reflect pacing_margin_percent

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
app

tcp_bbr: update comments to reflect pacing_margin_percent

Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v4.18.17, v4.19.1, v4.19, v4.18.16, v4.18.15
# cf33e25c 16-Oct-2018 Neal Cardwell <ncardwell@google.com>

tcp_bbr: centralize code to set gains

Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the

tcp_bbr: centralize code to set gains

Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# a87c83d5 16-Oct-2018 Neal Cardwell <ncardwell@google.com>

tcp_bbr: adjust TCP BBR for departure time pacing

Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure tim

tcp_bbr: adjust TCP BBR for departure time pacing

Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure time model").

With TSQ and pacing at lower layers, there are often several skbs
queued in the pacing layer, and thus there is less data "in the
network" than "in flight".

With departure time pacing at lower layers (e.g. fq or potential
future NICs), the data in the pacing layer now has a pre-scheduled
("baked-in") departure time that cannot be changed, even if the
congestion control algorithm decides to use a new pacing rate.

This means that there can be a non-trivial lag between when BBR makes
a pacing rate change and when the inter-skb pacing delays
change. After a pacing rate change, the number of packets in the
network can gradually evolve to be higher or lower, depending on
whether the sending rate is higher or lower than the delivery
rate. Thus ignoring this lag can cause significant overshoot, with the
flow ending up with too many or too few packets in the network.

This commit changes BBR to adapt its pacing rate based on the amount
of data in the network that it estimates has already been "baked in"
by previous departure time decisions. We estimate the number of our
packets that will be in the network at the earliest departure time
(EDT) for the next skb scheduled as:

in_network_at_edt = inflight_at_edt - (EDT - now) * bw

If we're increasing the amount of data in the network ("in_network"),
then we want to know if the transmit of the EDT skb will push
in_network above the target, so our answer includes
bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
in_network, then we want to know if in_network will sink too low just
before the EDT transmit, so our answer does not include the segments
from the skb departing at EDT.

Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
differently? The in_network curve is a step function: in_network goes
up on transmits, and down on ACKs. To accurately predict when
in_network will go beyond our target value, this will happen on
different events, depending on whether we're concerned about
in_network potentially going too high or too low:

o if pushing in_network up (pacing_gain > 1.0),
then in_network goes above target upon a transmit event

o if pushing in_network down (pacing_gain < 1.0),
then in_network goes below target upon an ACK event

This commit changes the BBR state machine to use this estimated
"packets in network" value to make its decisions.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 97ec3eb3 15-Oct-2018 Neal Cardwell <ncardwell@google.com>

tcp_bbr: fix typo in bbr_pacing_margin_percent

There was a typo in this parameter name.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-

tcp_bbr: fix typo in bbr_pacing_margin_percent

There was a typo in this parameter name.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


# 76a9ebe8 15-Oct-2018 Eric Dumazet <edumazet@google.com>

net: extend sk_pacing_rate to unsigned long

sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace hig

net: extend sk_pacing_rate to unsigned long

sk_pacing_rate has beed introduced as a u32 field in 2013,
effectively limiting per flow pacing to 34Gbit.

We believe it is time to allow TCP to pace high speed flows
on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

This patch adds no cost for 32bit kernels.

The tcpi_pacing_rate and tcpi_max_pacing_rate were already
exported as 64bit, so iproute2/ss command require no changes.

Unfortunately the SO_MAX_PACING_RATE socket option will stay
32bit and we will need to add a new option to let applications
control high pacing rates.

State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
timer:(on,003ms,0) ino:91863 sk:2 <->
skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
rcvmss:536 advmss:1448
cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
segs_in:3916318 data_segs_out:177279175
bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
send 28045.5Mbps lastrcv:73333
pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
notsent:2085120 minrtt:0.013

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v4.18.14, v4.18.13, v4.18.12, v4.18.11, v4.18.10
# ab408b6d 21-Sep-2018 Eric Dumazet <edumazet@google.com>

tcp: switch tcp and sch_fq to new earliest departure time model

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RT

tcp: switch tcp and sch_fq to new earliest departure time model

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


Revision tags: v4.18.9, v4.18.7, v4.18.6, v4.18.5
# 8e995bf1 22-Aug-2018 Kevin Yang <yyd@google.com>

tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0

This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segmen

tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0

This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.

The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

show more ...


1234