Revision tags: v5.8.9, v5.8.8, v5.8.7 |
|
#
383e3f3e |
| 05-Sep-2020 |
Wang Hai <wanghai38@huawei.com> |
net/packet: Remove unused macro BLOCK_PRIV
BLOCK_PRIV is never used after it was introduced. So better to remove it.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@h
net/packet: Remove unused macro BLOCK_PRIV
BLOCK_PRIV is never used after it was introduced. So better to remove it.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Hai <wanghai38@huawei.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
acf69c94 |
| 03-Sep-2020 |
Or Cohen <orcohen@paloaltonetworks.com> |
net/packet: fix overflow in tpacket_rcv
Using tp_reserve to calculate netoff can overflow as tp_reserve is unsigned int and netoff is unsigned short.
This may lead to macoff receving a smaller valu
net/packet: fix overflow in tpacket_rcv
Using tp_reserve to calculate netoff can overflow as tp_reserve is unsigned int and netoff is unsigned short.
This may lead to macoff receving a smaller value then sizeof(struct virtio_net_hdr), and if po->has_vnet_hdr is set, an out-of-bounds write will occur when calling virtio_net_hdr_from_skb.
The bug is fixed by converting netoff to unsigned int and checking if it exceeds USHRT_MAX.
This addresses CVE-2020-14386
Fixes: 8913336a7e8d ("packet: add PACKET_RESERVE sockopt") Signed-off-by: Or Cohen <orcohen@paloaltonetworks.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61 |
|
#
df561f66 |
| 23-Aug-2020 |
Gustavo A. R. Silva <gustavoars@kernel.org> |
treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through mar
treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case.
[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
show more ...
|
Revision tags: v5.8.3, v5.4.60, v5.8.2, v5.4.59 |
|
#
88fd1cb8 |
| 13-Aug-2020 |
John Ogness <john.ogness@linutronix.de> |
af_packet: TPACKET_V3: fix fill status rwlock imbalance
After @blk_fill_in_prog_lock is acquired there is an early out vnet situation that can occur. In that case, the rwlock needs to be released.
af_packet: TPACKET_V3: fix fill status rwlock imbalance
After @blk_fill_in_prog_lock is acquired there is an early out vnet situation that can occur. In that case, the rwlock needs to be released.
Also, since @blk_fill_in_prog_lock is only acquired when @tp_version is exactly TPACKET_V3, only release it on that exact condition as well.
And finally, add sparse annotation so that it is clearer that prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring and releasing @blk_fill_in_prog_lock, respectively. sparse is still unable to understand the balance, but the warnings are now on a higher level that make more sense.
Fixes: 632ca50f2cbd ("af_packet: TPACKET_V3: replace busy-wait loop") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54 |
|
#
a7b75c5a |
| 23-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of archit
net: pass a sockptr_t into ->setsockopt
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
b1ea9ff6 |
| 23-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: switch copy_bpf_fprog_from_user to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-of
net: switch copy_bpf_fprog_from_user to sockptr_t
Pass a sockptr_t to prepare for set_fs-less handling of the kernel pointer from bpf-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.7.10, v5.4.53 |
|
#
a44d9e72 |
| 17-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: make ->{get,set}sockopt in proto_ops optional
Just check for a NULL method instead of wiring up sock_no_{get,set}sockopt.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marc Kleine-Bu
net: make ->{get,set}sockopt in proto_ops optional
Just check for a NULL method instead of wiring up sock_no_{get,set}sockopt.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
4d295e54 |
| 17-Jul-2020 |
Christoph Hellwig <hch@lst.de> |
net: simplify cBPF setsockopt compat handling
Add a helper that copies either a native or compat bpf_fprog from userspace after verifying the length, and remove the compat setsockopt handlers that n
net: simplify cBPF setsockopt compat handling
Add a helper that copies either a native or compat bpf_fprog from userspace after verifying the length, and remove the compat setsockopt handlers that now aren't required.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.52, v5.7.9, v5.7.8, v5.4.51 |
|
#
632ca50f |
| 07-Jul-2020 |
John Ogness <john.ogness@linutronix.de> |
af_packet: TPACKET_V3: replace busy-wait loop
A busy-wait loop is used to implement waiting for bits to be copied from the skb to the kernel buffer before retiring a block. This is a problem on PREE
af_packet: TPACKET_V3: replace busy-wait loop
A busy-wait loop is used to implement waiting for bits to be copied from the skb to the kernel buffer before retiring a block. This is a problem on PREEMPT_RT because the copying task could be preempted by the busy-waiting task and thus live lock in the busy-wait loop.
Replace the busy-wait logic with an rwlock_t. This provides lockdep coverage and makes the code RT ready.
Signed-off-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
2a6d6c31 |
| 01-Jul-2020 |
Colin Ian King <colin.king@canonical.com> |
net/packet: remove redundant initialization of variable err
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization
net/packet: remove redundant initialization of variable err
The variable err is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed.
Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26 |
|
#
61fad681 |
| 13-Mar-2020 |
Willem de Bruijn <willemb@google.com> |
net/packet: tpacket_rcv: avoid a producer race condition
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This
net/packet: tpacket_rcv: avoid a producer race condition
PACKET_RX_RING can cause multiple writers to access the same slot if a fast writer wraps the ring while a slow writer is still copying. This is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL while holding the sk receive queue lock. They release this lock before copying and set tp_status to TP_STATUS_USER to release to userspace when done. During copying, another writer may take the lock, also see TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a slot, test and set with the lock held. To release race-free, update tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as in tp_padding. But any test for this field may match a value left by userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the shadow slot before setting TP_STATUS_USER. The below reproducer showed that this race is not academic. If releasing the slot after tp_status, the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store of two fields. But, legacy applications may interpret all non-zero tp_status as owned by the user. As libpcap does. So this is possible only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation. The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different than scaling on multicore or multiqueue h/w. In practice, below reproducer nor small packet tcpdump showed a noticeable change in perf report in cycles spent in spinlock. Where contention is problematic, packet sockets support mitigation through PACKET_FANOUT. And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv, which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html Suggested-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jon Rosen <jrosen@cisco.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.25 |
|
#
46e4c421 |
| 09-Mar-2020 |
Willem de Bruijn <willemb@google.com> |
net/packet: tpacket_rcv: do not increment ring index on drop
In one error case, tpacket_rcv drops packets after incrementing the ring producer index.
If this happens, it does not update tp_status t
net/packet: tpacket_rcv: do not increment ring index on drop
In one error case, tpacket_rcv drops packets after incrementing the ring producer index.
If this happens, it does not update tp_status to TP_STATUS_USER and thus the reader is stalled for an iteration of the ring, causing out of order arrival.
The only such error path is when virtio_net_hdr_from_skb fails due to encountering an unknown GSO type.
Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7 |
|
#
0914d2bb |
| 23-Dec-2019 |
Mao Wenan <maowenan@huawei.com> |
af_packet: refactoring code for prb_calc_retire_blk_tmo
If __ethtool_get_link_ksettings() is failed and with non-zero value, prb_calc_retire_blk_tmo() should return DEFAULT_PRB_RETIRE_TOV firstly.
af_packet: refactoring code for prb_calc_retire_blk_tmo
If __ethtool_get_link_ksettings() is failed and with non-zero value, prb_calc_retire_blk_tmo() should return DEFAULT_PRB_RETIRE_TOV firstly.
This patch is to refactory code and make it more readable.
Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12, v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13, v5.1.12, v5.1.11, v5.1.10, v5.1.9, v5.1.8, v5.1.7, v5.1.6, v5.1.5, v5.1.4, v5.1.3, v5.1.2, v5.1.1, v5.0.14, v5.1, v5.0.13, v5.0.12, v5.0.11, v5.0.10, v5.0.9, v5.0.8, v5.0.7, v5.0.6, v5.0.5, v5.0.4, v5.0.3, v4.19.29, v5.0.2, v4.19.28, v5.0.1, v4.19.27, v5.0, v4.19.26, v4.19.25, v4.19.24, v4.19.23, v4.19.22, v4.19.21, v4.19.20, v4.19.19, v4.19.18, v4.19.17, v4.19.16, v4.19.15, v4.19.14, v4.19.13, v4.19.12, v4.19.11, v4.19.10, v4.19.9, v4.19.8, v4.19.7, v4.19.6, v4.19.5, v4.19.4, v4.18.20, v4.19.3, v4.18.19, v4.19.2, v4.18.18, v4.18.17, v4.19.1, v4.19, v4.18.16, v4.18.15, v4.18.14, v4.18.13, v4.18.12, v4.18.11, v4.18.10, v4.18.9, v4.18.7, v4.18.6, v4.18.5, v4.17.18, v4.18.4, v4.18.3, v4.17.17, v4.18.2, v4.17.16, v4.17.15, v4.18.1, v4.18, v4.17.14, v4.17.13, v4.17.12, v4.17.11, v4.17.10, v4.17.9, v4.17.8, v4.17.7, v4.17.6, v4.17.5, v4.17.4, v4.17.3, v4.17.2, v4.17.1, v4.17, v4.16, v4.15 |
|
#
d413fcb4 |
| 27-Nov-2017 |
Arnd Bergmann <arnd@arndb.de> |
packet: clarify timestamp overflow
The memory mapped packet socket data structure in version 1 through 3 all contain 32-bit second values for the packet time stamps, which makes them suffer from the
packet: clarify timestamp overflow
The memory mapped packet socket data structure in version 1 through 3 all contain 32-bit second values for the packet time stamps, which makes them suffer from the overflow of time_t in y2038 or y2106 (depending on whether user space interprets the value as signed or unsigned).
The implementation uses the deprecated getnstimeofday() function.
In order to get rid of that, this changes the code to use ktime_get_real_ts64() as a replacement, documenting the nature of the overflow. As long as the user applications treat the timestamps as unsigned, or only use the difference between timestamps, they are fine, and changing the timestamps to 64-bit wouldn't require a more invasive user space API change.
Note: a lot of other APIs suffer from incompatible structures when time_t gets redefined to 64-bit in 32-bit user space, but this one does not.
Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/lkml/CAF=yD-Jomr-gWSR-EBNKnSpFL46UeG564FLfqTCMNEm-prEaXA@mail.gmail.com/T/#u Signed-off-by: Arnd Bergmann <arnd@arndb.de>
show more ...
|
#
b43d1f9f |
| 09-Dec-2019 |
Mao Wenan <maowenan@huawei.com> |
af_packet: set defaule value for tmo
There is softlockup when using TPACKET_V3: ... NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms! (__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqresto
af_packet: set defaule value for tmo
There is softlockup when using TPACKET_V3: ... NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms! (__irq_svc) from [<c0558a0c>] (_raw_spin_unlock_irqrestore+0x44/0x54) (_raw_spin_unlock_irqrestore) from [<c027b7e8>] (mod_timer+0x210/0x25c) (mod_timer) from [<c0549c30>] (prb_retire_rx_blk_timer_expired+0x68/0x11c) (prb_retire_rx_blk_timer_expired) from [<c027a7ac>] (call_timer_fn+0x90/0x17c) (call_timer_fn) from [<c027ab6c>] (run_timer_softirq+0x2d4/0x2fc) (run_timer_softirq) from [<c021eaf4>] (__do_softirq+0x218/0x318) (__do_softirq) from [<c021eea0>] (irq_exit+0x88/0xac) (irq_exit) from [<c0240130>] (msa_irq_exit+0x11c/0x1d4) (msa_irq_exit) from [<c0209cf0>] (handle_IPI+0x650/0x7f4) (handle_IPI) from [<c02015bc>] (gic_handle_irq+0x108/0x118) (gic_handle_irq) from [<c0558ee4>] (__irq_usr+0x44/0x5c) ...
If __ethtool_get_link_ksettings() is failed in prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies is zero and the timer expire for retire_blk_timer is turn to mod_timer(&pkc->retire_blk_timer, jiffies + 0), which will trigger cpu usage of softirq is 100%.
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Tested-by: Xiao Jiangfeng <xiaojiangfeng@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
b756ad92 |
| 08-Nov-2019 |
Eric Dumazet <edumazet@google.com> |
packet: fix data-race in fanout_flow_is_huge()
KCSAN reported the following data-race [1]
Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.
Since the report hinted about multiple cpus
packet: fix data-race in fanout_flow_is_huge()
KCSAN reported the following data-race [1]
Adding a couple of READ_ONCE()/WRITE_ONCE() should silence it.
Since the report hinted about multiple cpus using the history concurrently, I added a test avoiding writing on it if the victim slot already contains the desired value.
[1]
BUG: KCSAN: data-race in fanout_demux_rollover / fanout_demux_rollover
read to 0xffff8880b01786cc of 4 bytes by task 18921 on cpu 1: fanout_flow_is_huge net/packet/af_packet.c:1303 [inline] fanout_demux_rollover+0x33e/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9
write to 0xffff8880b01786cc of 4 bytes by task 18922 on cpu 0: fanout_flow_is_huge net/packet/af_packet.c:1306 [inline] fanout_demux_rollover+0x3a4/0x3f0 net/packet/af_packet.c:1353 packet_rcv_fanout+0x34e/0x490 net/packet/af_packet.c:1453 deliver_skb net/core/dev.c:1888 [inline] dev_queue_xmit_nit+0x15b/0x540 net/core/dev.c:1958 xmit_one net/core/dev.c:3195 [inline] dev_hard_start_xmit+0x3f5/0x430 net/core/dev.c:3215 __dev_queue_xmit+0x14ab/0x1b40 net/core/dev.c:3792 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825 neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530 neigh_output include/net/neighbour.h:511 [inline] ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline] __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175 dst_output include/net/dst.h:436 [inline] ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0x9f/0xc0 net/socket.c:657 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311 __sys_sendmmsg+0x123/0x350 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 18922 Comm: syz-executor.3 Not tainted 5.4.0-rc6+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Fixes: 3b3a5b0aab5b ("packet: rollover huge flows before small flows") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
895b5c9f |
| 29-Sep-2019 |
Florian Westphal <fw@strlen.de> |
netfilter: drop bridge nf reset from nf_reset
commit 174e23810cd31 ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ex
netfilter: drop bridge nf reset from nf_reset
commit 174e23810cd31 ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ext_del() that is performed via nf_reset on napi skb recycle is not needed anymore.
Most nf_reset() calls in the stack are there so queued skb won't block 'rmmod nf_conntrack' indefinitely.
This removes the skb_ext_del from nf_reset, and renames it to a more fitting nf_reset_ct().
In a few selected places, add a call to skb_ext_reset to make sure that no active extensions remain.
I am submitting this for "net", because we're still early in the release cycle. The patch applies to net-next too, but I think the rename causes needless divergence between those trees.
Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
show more ...
|
#
32d3182c |
| 14-Aug-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: fix race in tpacket_snd()
packet_sendmsg() checks tx_ring.pg_vec to decide if it must call tpacket_snd().
Problem is that the check is lockless, meaning another thread can issue a concu
net/packet: fix race in tpacket_snd()
packet_sendmsg() checks tx_ring.pg_vec to decide if it must call tpacket_snd().
Problem is that the check is lockless, meaning another thread can issue a concurrent setsockopt(PACKET_TX_RING ) to flip tx_ring.pg_vec back to NULL.
Given that tpacket_snd() grabs pg_vec_lock mutex, we can perform the check again to solve the race.
syzbot reported :
kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 11429 Comm: syz-executor394 Not tainted 5.3.0-rc4+ #101 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:packet_lookup_frame+0x8d/0x270 net/packet/af_packet.c:474 Code: c1 ee 03 f7 73 0c 80 3c 0e 00 0f 85 cb 01 00 00 48 8b 0b 89 c0 4c 8d 24 c1 48 b8 00 00 00 00 00 fc ff df 4c 89 e1 48 c1 e9 03 <80> 3c 01 00 0f 85 94 01 00 00 48 8d 7b 10 4d 8b 3c 24 48 b8 00 00 RSP: 0018:ffff88809f82f7b8 EFLAGS: 00010246 RAX: dffffc0000000000 RBX: ffff8880a45c7030 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 1ffff110148b8e06 RDI: ffff8880a45c703c RBP: ffff88809f82f7e8 R08: ffff888087aea200 R09: fffffbfff134ae50 R10: fffffbfff134ae4f R11: ffffffff89a5727f R12: 0000000000000000 R13: 0000000000000001 R14: ffff8880a45c6ac0 R15: 0000000000000000 FS: 00007fa04716f700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa04716edb8 CR3: 0000000091eb4000 CR4: 00000000001406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: packet_current_frame net/packet/af_packet.c:487 [inline] tpacket_snd net/packet/af_packet.c:2667 [inline] packet_sendmsg+0x590/0x6250 net/packet/af_packet.c:2975 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:657 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg net/socket.c:2439 [inline] __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
89ed5b51 |
| 25-Jun-2019 |
Neil Horman <nhorman@tuxdriver.com> |
af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET
When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socke
af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET
When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete).
However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete.
We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made.
Tested by myself and the reporter, with good results
Change Notes:
V1->V2: Enhance the sleep logic to support being interruptible and allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
V2->V3: Rearrage the point at which we wait for the completion queue, to avoid needing to check for ph/skb being null at the end of the loop. Also move the complete call to the skb destructor to avoid needing to modify __packet_set_status. Also gate calling complete on packet_read_pending returning zero to avoid multiple calls to complete. (Willem de Bruijn)
Move timeo computation within loop, to re-fetch the socket timeout since we also use the timeo variable to record the return code from the wait_for_complete call (Neil Horman)
V3->V4: Willem has requested that the control flow be restored to the previous state. Doing so lets us eliminate the need for the po->wait_on_complete flag variable, and lets us get rid of the packet_next_frame function, but introduces another complexity. Specifically, but using the packet pending count, we can, if an applications calls sendmsg multiple times with MSG_DONTWAIT set, each set of transmitted frames, when complete, will cause tpacket_destruct_skb to issue a complete call, for which there will never be a wait_on_completion call. This imbalance will lead to any future call to wait_for_completion here to return early, when the frames they sent may not have completed. To correct this, we need to re-init the completion queue on every call to tpacket_snd before we enter the loop so as to ensure we wait properly for the frames we send in this iteration.
Change the timeout and interrupted gotos to out_put rather than out_status so that we don't try to free a non-existant skb Clean up some extra newlines (Willem de Bruijn)
Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Matteo Croce <mcroce@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
55655e3d |
| 24-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: fix memory leak in packet_set_ring()
syzbot found we can leak memory in packet_set_ring(), if user application provides buggy parameters.
Fixes: 7f953ab2ba46 ("af_packet: TX_RING suppor
net/packet: fix memory leak in packet_set_ring()
syzbot found we can leak memory in packet_set_ring(), if user application provides buggy parameters.
Fixes: 7f953ab2ba46 ("af_packet: TX_RING support for TPACKET_V3") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
9bb6cd65 |
| 12-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: introduce packet_rcv_try_clear_pressure() helper
There are two places where we want to clear the pressure if possible, add a helper to make it more obvious.
Signed-off-by: Eric Dumazet
net/packet: introduce packet_rcv_try_clear_pressure() helper
There are two places where we want to clear the pressure if possible, add a helper to make it more obvious.
Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Willem de Bruijn <willemb@google.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
3a2bb84e |
| 12-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: remove locking from packet_rcv_has_room()
__packet_rcv_has_room() can now be run without lock being held.
po->pressure is only a non persistent hint, we can mark all read/write accesses
net/packet: remove locking from packet_rcv_has_room()
__packet_rcv_has_room() can now be run without lock being held.
po->pressure is only a non persistent hint, we can mark all read/write accesses with READ_ONCE()/WRITE_ONCE() to document the fact that the field could be written without any synchronization.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
2c51c627 |
| 12-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: implement shortcut in tpacket_rcv()
tpacket_rcv() can be hit under DDOS quite hard, since it will always grab a socket spinlock, to eventually find there is no room for an additional pac
net/packet: implement shortcut in tpacket_rcv()
tpacket_rcv() can be hit under DDOS quite hard, since it will always grab a socket spinlock, to eventually find there is no room for an additional packet.
Using tcpdump [1] on a busy host can lead to catastrophic consequences, because of all cpus spinning on a contended spinlock.
This replicates a similar strategy used in packet_rcv()
[1] Also some applications mistakenly use af_packet socket bound to ETH_P_ALL only to send packets. Receive queue is never drained and immediately full.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
8e8e2951 |
| 12-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: make tp_drops atomic
Under DDOS, we want to be able to increment tp_drops without touching the spinlock. This will help readers to drain the receive queue slightly faster :/
Signed-off-
net/packet: make tp_drops atomic
Under DDOS, we want to be able to increment tp_drops without touching the spinlock. This will help readers to drain the receive queue slightly faster :/
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
0338a145 |
| 12-Jun-2019 |
Eric Dumazet <edumazet@google.com> |
net/packet: constify __packet_rcv_has_room()
Goal is use the helper without lock being held.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
|