Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
6915b1b2 |
| 11-Mar-2024 |
Eric Dumazet <edumazet@google.com> |
net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX check
[ Upstream commit 343041b59b7810f9cdca371f445dd43b35c740b1 ]
taprio_parse_tc_entry() is not correctly checking TCA_TAPRIO_TC_ENTRY_INDEX at
net/sched: taprio: proper TCA_TAPRIO_TC_ENTRY_INDEX check
[ Upstream commit 343041b59b7810f9cdca371f445dd43b35c740b1 ]
taprio_parse_tc_entry() is not correctly checking TCA_TAPRIO_TC_ENTRY_INDEX attribute:
int tc; // Signed value
tc = nla_get_u32(tb[TCA_TAPRIO_TC_ENTRY_INDEX]); if (tc >= TC_QOPT_MAX_QUEUE) { NL_SET_ERR_MSG_MOD(extack, "TC entry index out of range"); return -ERANGE; }
syzbot reported that it could fed arbitary negative values:
UBSAN: shift-out-of-bounds in net/sched/sch_taprio.c:1722:18 shift exponent -2147418108 is negative CPU: 0 PID: 5066 Comm: syz-executor367 Not tainted 6.8.0-rc7-syzkaller-00136-gc8a5c731fd12 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:217 [inline] __ubsan_handle_shift_out_of_bounds+0x3c7/0x420 lib/ubsan.c:386 taprio_parse_tc_entry net/sched/sch_taprio.c:1722 [inline] taprio_parse_tc_entries net/sched/sch_taprio.c:1768 [inline] taprio_change+0xb87/0x57d0 net/sched/sch_taprio.c:1877 taprio_init+0x9da/0xc80 net/sched/sch_taprio.c:2134 qdisc_create+0x9d4/0x1190 net/sched/sch_api.c:1355 tc_modify_qdisc+0xa26/0x1e40 net/sched/sch_api.c:1776 rtnetlink_rcv_msg+0x885/0x1040 net/core/rtnetlink.c:6617 netlink_rcv_skb+0x1e3/0x430 net/netlink/af_netlink.c:2543 netlink_unicast_kernel net/netlink/af_netlink.c:1341 [inline] netlink_unicast+0x7ea/0x980 net/netlink/af_netlink.c:1367 netlink_sendmsg+0xa3b/0xd70 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x221/0x270 net/socket.c:745 ____sys_sendmsg+0x525/0x7d0 net/socket.c:2584 ___sys_sendmsg net/socket.c:2638 [inline] __sys_sendmsg+0x2b0/0x3a0 net/socket.c:2667 do_syscall_64+0xf9/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f1b2dea3759 Code: 48 83 c4 28 c3 e8 d7 19 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd4de452f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f1b2def0390 RCX: 00007f1b2dea3759 RDX: 0000000000000000 RSI: 00000000200007c0 RDI: 0000000000000004 RBP: 0000000000000003 R08: 0000555500000000 R09: 0000555500000000 R10: 0000555500000000 R11: 0000000000000246 R12: 00007ffd4de45340 R13: 00007ffd4de45310 R14: 0000000000000001 R15: 00007ffd4de45340
Fixes: a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU") Reported-and-tested-by: syzbot+a340daa06412d6028918@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44 |
|
#
665338b2 |
| 07-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: dump class stats for the actual q->qdiscs[]
This makes a difference for the software scheduling mode, where dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself,
net/sched: taprio: dump class stats for the actual q->qdiscs[]
This makes a difference for the software scheduling mode, where dev_queue->qdisc_sleeping is the same as the taprio root Qdisc itself, but when we're talking about what Qdisc and stats get reported for a traffic class, the root taprio isn't what comes to mind, but q->qdiscs[] is.
To understand the difference, I've attempted to send 100 packets in software mode through class 8001:5, and recorded the stats before and after the change.
Here is before:
$ tc -s class show dev eth0 class taprio 8001:1 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:2 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:3 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:4 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:5 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:6 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:7 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:8 root leaf 8001: Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0
and here is after:
class taprio 8001:1 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:2 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:3 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:4 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:5 root Sent 9400 bytes 100 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:6 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:7 root Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0 class taprio 8001:8 root leaf 800d: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 window_drops 0
The most glaring (and expected) difference is that before, all class stats reported the global stats, whereas now, they really report just the counters for that traffic class.
Finally, Pedro Tammela points out that there is a tc selftest which checks specifically which handle do the child Qdiscs corresponding to each class have. That's changing here - taprio no longer reports tcm->tcm_info as the same handle "1:" as itself (the root Qdisc), but 0 (the handle of the default pfifo child Qdiscs). Since iproute2 does not print a child Qdisc handle of 0, adjust the test's expected output.
Link: https://lore.kernel.org/netdev/3b83fcf6-a5e8-26fb-8c8a-ec34ec4c3342@mojatatu.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-6-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
6e0ec800 |
| 07-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: delete misleading comment about preallocating child qdiscs
As mentioned in commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child
net/sched: taprio: delete misleading comment about preallocating child qdiscs
As mentioned in commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"") - unlike mqprio, taprio doesn't use q->qdiscs[] only as a temporary transport between Qdisc_ops :: init() and Qdisc_ops :: attach().
Delete the comment, which is just stolen from mqprio, but there, the usage patterns are a lot different, and this is nothing but confusing.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-5-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
98766add |
| 07-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
This is another stab at commit 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), l
net/sched: taprio: try again to report q->qdiscs[] to qdisc_leaf()
This is another stab at commit 1461d212ab27 ("net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"), later reverted in commit af7b29b1deaa ("Revert "net/sched: taprio: make qdisc_leaf() see the per-netdev-queue pfifo child qdiscs"").
I believe that the problems that caused the revert were fixed, and thus, this change is identical to the original patch.
Its purpose is to properly reject attaching a software taprio child qdisc to a software taprio parent. Because unoffloaded taprio currently reports itself (the root Qdisc) as the return value from qdisc_leaf(), then the process of attaching another taprio as child to a Qdisc class of the root will just result in a Qdisc_ops :: change() call for the root. Whereas that's not we want. We want Qdisc_ops :: init() to be called for the taprio child, in order to give the taprio child a chance to check whether its sch->parent is TC_H_ROOT or not (and reject this configuration).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
25b0d4e4 |
| 07-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
Normally, Qdiscs have one reference on them held by their owner and one held for each TXQ to which they are attached, howev
net/sched: taprio: keep child Qdisc refcount elevated at 2 in offload mode
Normally, Qdiscs have one reference on them held by their owner and one held for each TXQ to which they are attached, however this is not the case with the children of an offloaded taprio. Instead, the taprio qdisc currently lives in the following fragile equilibrium.
In the software scheduling case, taprio attaches itself (the root Qdisc) to all TXQs, thus having a refcount of 1 + the number of TX queues. In this mode, the q->qdiscs[] children are not visible directly to the Qdisc API. The lifetime of the Qdiscs from this private array lasts until qdisc_destroy() -> taprio_destroy().
In the fully offloaded case, the root taprio has a refcount of 1, and all child q->qdiscs[] also have a refcount of 1. The child q->qdiscs[] are attached to the netdev TXQs directly and thus are visible to the Qdisc API, however taprio loses a reference to them very early - during qdisc_graft(parent==NULL) -> taprio_attach(). At that time, taprio frees the q->qdiscs[] array to not leak memory, but interestingly, it does not release a reference on these qdiscs because it doesn't effectively own them - they are created by taprio but owned by the Qdisc core, and will be freed by qdisc_graft(parent==NULL, new==NULL) -> qdisc_put(old) when the Qdisc is deleted or when the child Qdisc is replaced with something else.
My interest is to change this equilibrium such that taprio also owns a reference on the q->qdiscs[] child Qdiscs for the lifetime of the root Qdisc, including in full offload mode. I want this because I would like taprio_leaf(), taprio_dump_class(), taprio_dump_class_stats() to have insight into q->qdiscs[] for the software scheduling mode - currently they look at dev_queue->qdisc_sleeping, which is, as mentioned, the same as the root taprio.
The following set of changes is necessary: - don't free q->qdiscs[] early in taprio_attach(), free it late in taprio_destroy() for consistency with software mode. But: - currently that's not possible, because taprio doesn't own a reference on q->qdiscs[]. So hold that reference - once during the initial attach() and once during subsequent graft() calls when the child is changed. - always keep track of the current child in q->qdiscs[], even for full offload mode, so that we free in taprio_destroy() what we should, and not something stale.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
09e0c3bb |
| 07-Aug-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()
This is a simple code transformation with no intended behavior change, just to make it absolutely clear that q->qdiscs
net/sched: taprio: don't access q->qdiscs[] in unoffloaded mode during attach()
This is a simple code transformation with no intended behavior change, just to make it absolutely clear that q->qdiscs[] is only attached to the child taprio classes in full offload mode.
Right now we use the q->qdiscs[] variable in taprio_attach() for software mode too, but that is quite confusing and avoidable. We use it only to reach the netdev TX queue, but we could as well just use netdev_get_tx_queue() for that.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230807193324.4128292-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.43 |
|
#
e7397184 |
| 28-Jul-2023 |
Kuniyuki Iwashima <kuniyu@amazon.com> |
net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
syzkaller found zero division error [0] in div_s64_rem() called from get_cycle_time_elapsed(), where sched->cycle_time is the di
net/sched: taprio: Limit TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME to INT_MAX.
syzkaller found zero division error [0] in div_s64_rem() called from get_cycle_time_elapsed(), where sched->cycle_time is the divisor.
We have tests in parse_taprio_schedule() so that cycle_time will never be 0, and actually cycle_time is not 0 in get_cycle_time_elapsed().
The problem is that the types of divisor are different; cycle_time is s64, but the argument of div_s64_rem() is s32.
syzkaller fed this input and 0x100000000 is cast to s32 to be 0.
@TCA_TAPRIO_ATTR_SCHED_CYCLE_TIME={0xc, 0x8, 0x100000000}
We use s64 for cycle_time to cast it to ktime_t, so let's keep it and set max for cycle_time.
While at it, we prevent overflow in setup_txtime() and add another test in parse_taprio_schedule() to check if cycle_time overflows.
Also, we add a new tdc test case for this issue.
[0]: divide error: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 103 Comm: kworker/1:3 Not tainted 6.5.0-rc1-00330-g60cc1f7d0605 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: ipv6_addrconf addrconf_dad_work RIP: 0010:div_s64_rem include/linux/math64.h:42 [inline] RIP: 0010:get_cycle_time_elapsed net/sched/sch_taprio.c:223 [inline] RIP: 0010:find_entry_to_transmit+0x252/0x7e0 net/sched/sch_taprio.c:344 Code: 3c 02 00 0f 85 5e 05 00 00 48 8b 4c 24 08 4d 8b bd 40 01 00 00 48 8b 7c 24 48 48 89 c8 4c 29 f8 48 63 f7 48 99 48 89 74 24 70 <48> f7 fe 48 29 d1 48 8d 04 0f 49 89 cc 48 89 44 24 20 49 8d 85 10 RSP: 0018:ffffc90000acf260 EFLAGS: 00010206 RAX: 177450e0347560cf RBX: 0000000000000000 RCX: 177450e0347560cf RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000100000000 RBP: 0000000000000056 R08: 0000000000000000 R09: ffffed10020a0934 R10: ffff8880105049a7 R11: ffff88806cf3a520 R12: ffff888010504800 R13: ffff88800c00d800 R14: ffff8880105049a0 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0edf84f0e8 CR3: 000000000d73c002 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> get_packet_txtime net/sched/sch_taprio.c:508 [inline] taprio_enqueue_one+0x900/0xff0 net/sched/sch_taprio.c:577 taprio_enqueue+0x378/0xae0 net/sched/sch_taprio.c:658 dev_qdisc_enqueue+0x46/0x170 net/core/dev.c:3732 __dev_xmit_skb net/core/dev.c:3821 [inline] __dev_queue_xmit+0x1b2f/0x3000 net/core/dev.c:4169 dev_queue_xmit include/linux/netdevice.h:3088 [inline] neigh_resolve_output net/core/neighbour.c:1552 [inline] neigh_resolve_output+0x4a7/0x780 net/core/neighbour.c:1532 neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0x924/0x17d0 net/ipv6/ip6_output.c:135 __ip6_finish_output+0x620/0xaa0 net/ipv6/ip6_output.c:196 ip6_finish_output net/ipv6/ip6_output.c:207 [inline] NF_HOOK_COND include/linux/netfilter.h:292 [inline] ip6_output+0x206/0x410 net/ipv6/ip6_output.c:228 dst_output include/net/dst.h:458 [inline] NF_HOOK.constprop.0+0xea/0x260 include/linux/netfilter.h:303 ndisc_send_skb+0x872/0xe80 net/ipv6/ndisc.c:508 ndisc_send_ns+0xb5/0x130 net/ipv6/ndisc.c:666 addrconf_dad_work+0xc14/0x13f0 net/ipv6/addrconf.c:4175 process_one_work+0x92c/0x13a0 kernel/workqueue.c:2597 worker_thread+0x60f/0x1240 kernel/workqueue.c:2748 kthread+0x2fe/0x3f0 kernel/kthread.c:389 ret_from_fork+0x2c/0x50 arch/x86/entry/entry_64.S:308 </TASK> Modules linked in:
Fixes: 4cfd5779bd6e ("taprio: Add support for txtime-assist mode") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Co-developed-by: Eric Dumazet <edumazet@google.com> Co-developed-by: Pedro Tammela <pctammela@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34 |
|
#
2b84960f |
| 09-Jun-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: report class offload stats per TXQ, not per TC
The taprio Qdisc creates child classes per netdev TX queue, but taprio_dump_class_stats() currently reports offload statistics per t
net/sched: taprio: report class offload stats per TXQ, not per TC
The taprio Qdisc creates child classes per netdev TX queue, but taprio_dump_class_stats() currently reports offload statistics per traffic class. Traffic classes are groups of TXQs sharing the same dequeue priority, so this is incorrect and we shouldn't be bundling up the TXQ stats when reporting them, as we currently do in enetc.
Modify the API from taprio to drivers such that they report TXQ offload stats and not TC offload stats.
There is no change in the UAPI or in the global Qdisc stats.
Fixes: 6c1adb650c8d ("net/sched: taprio: add netlink reporting for offload statistics counters") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.33 |
|
#
d457a0e3 |
| 08-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
net: move gso declarations and functions to their own files
Move declarations into include/net/gso.h and code into net/core/gso.c
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fom
net: move gso declarations and functions to their own files
Move declarations into include/net/gso.h and code into net/core/gso.c
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stanislav Fomichev <sdf@google.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230608191738.3947077-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
be3618d9 |
| 08-Jun-2023 |
Zhengchao Shao <shaozhengchao@huawei.com> |
net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq
As shown in [1], out-of-bounds access occurs in two cases: 1)when the qdisc of the taprio type is used to replace the previo
net/sched: taprio: fix slab-out-of-bounds Read in taprio_dequeue_from_txq
As shown in [1], out-of-bounds access occurs in two cases: 1)when the qdisc of the taprio type is used to replace the previously configured taprio, count and offset in tc_to_txq can be set to 0. In this case, the value of *txq in taprio_next_tc_txq() will increases continuously. When the number of accessed queues exceeds the number of queues on the device, out-of-bounds access occurs. 2)When packets are dequeued, taprio can be deleted. In this case, the tc rule of dev is cleared. The count and offset values are also set to 0. In this case, out-of-bounds access is also caused.
Now the restriction on the queue number is added.
[1] https://groups.google.com/g/syzkaller-bugs/c/_lYOKgkBVMg Fixes: 2f530df76c8c ("net/sched: taprio: give higher priority to higher TCs in software dequeue mode") Reported-by: syzbot+04afcb3d2c840447559a@syzkaller.appspotmail.com Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Tested-by: Pedro Tammela <pctammela@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
d636fc5d |
| 06-Jun-2023 |
Eric Dumazet <edumazet@google.com> |
net: sched: add rcu annotations around qdisc->qdisc_sleeping
syzbot reported a race around qdisc->qdisc_sleeping [1]
It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sl
net: sched: add rcu annotations around qdisc->qdisc_sleeping
syzbot reported a race around qdisc->qdisc_sleeping [1]
It is time we add proper annotations to reads and writes to/from qdisc->qdisc_sleeping.
[1] BUG: KCSAN: data-race in dev_graft_qdisc / qdisc_lookup_rcu
read to 0xffff8881286fc618 of 8 bytes by task 6928 on cpu 1: qdisc_lookup_rcu+0x192/0x2c0 net/sched/sch_api.c:331 __tcf_qdisc_find+0x74/0x3c0 net/sched/cls_api.c:1174 tc_get_tfilter+0x18f/0x990 net/sched/cls_api.c:2547 rtnetlink_rcv_msg+0x7af/0x8c0 net/core/rtnetlink.c:6386 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
write to 0xffff8881286fc618 of 8 bytes by task 6912 on cpu 0: dev_graft_qdisc+0x4f/0x80 net/sched/sch_generic.c:1115 qdisc_graft+0x7d0/0xb60 net/sched/sch_api.c:1103 tc_modify_qdisc+0x712/0xf10 net/sched/sch_api.c:1693 rtnetlink_rcv_msg+0x807/0x8c0 net/core/rtnetlink.c:6395 netlink_rcv_skb+0x126/0x220 net/netlink/af_netlink.c:2546 rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:6413 netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline] netlink_unicast+0x56f/0x640 net/netlink/af_netlink.c:1365 netlink_sendmsg+0x665/0x770 net/netlink/af_netlink.c:1913 sock_sendmsg_nosec net/socket.c:724 [inline] sock_sendmsg net/socket.c:747 [inline] ____sys_sendmsg+0x375/0x4c0 net/socket.c:2503 ___sys_sendmsg net/socket.c:2557 [inline] __sys_sendmsg+0x1e3/0x270 net/socket.c:2586 __do_sys_sendmsg net/socket.c:2595 [inline] __se_sys_sendmsg net/socket.c:2593 [inline] __x64_sys_sendmsg+0x46/0x50 net/socket.c:2593 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 6912 Comm: syz-executor.5 Not tainted 6.4.0-rc3-syzkaller-00190-g0d85b27b0cc6 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/16/2023
Fixes: 3a7d0d07a386 ("net: sched: extend Qdisc with rcu") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.32, v6.1.31 |
|
#
6c1adb65 |
| 30-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: add netlink reporting for offload statistics counters
Offloading drivers may report some additional statistics counters, some of them even suggested by 802.1Q, like TransmissionOv
net/sched: taprio: add netlink reporting for offload statistics counters
Offloading drivers may report some additional statistics counters, some of them even suggested by 802.1Q, like TransmissionOverrun.
In my opinion we don't have to limit ourselves to reporting counters only globally to the Qdisc/interface, especially if the device has more detailed reporting (per traffic class), since the more detailed info is valuable for debugging and can help identifying who is exceeding its time slot.
But on the other hand, some devices may not be able to report both per TC and global stats.
So we end up reporting both ways, and use the good old ethtool_put_stat() strategy to determine which statistics are supported by this NIC. Statistics which aren't set are simply not reported to netlink. For this reason, we need something dynamic (a nlattr nest) to be reported through TCA_STATS_APP, and not something daft like the fixed-size and inextensible struct tc_codel_xstats. A good model for xstats which are a nlattr nest rather than a fixed struct seems to be cake.
# Global stats $ tc -s qdisc show dev eth0 root # Per-tc stats $ tc -s class show dev eth0
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
2d800bc5 |
| 30-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum
Inspired from struct flow_cls_offload :: cmd, in order for taprio to be able to report statistics (which is future work)
net/sched: taprio: replace tc_taprio_qopt_offload :: enable with a "cmd" enum
Inspired from struct flow_cls_offload :: cmd, in order for taprio to be able to report statistics (which is future work), it seems that we need to drill one step further with the ndo_setup_tc(TC_SETUP_QDISC_TAPRIO) multiplexing, and pass the command as part of the common portion of the muxed structure.
Since we already have an "enable" variable in tc_taprio_qopt_offload, refactor all drivers to check for "cmd" instead of "enable", and reject every other command except "replace" and "destroy" - to be future proof.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com> # for lan966x Acked-by: Kurt Kanzenbach <kurt@linutronix.de> # hellcreek Reviewed-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
dced11ef |
| 30-May-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()
In taprio_dump_class_stats() we don't need a reference to the root Qdisc once we get the reference to the child corresp
net/sched: taprio: don't overwrite "sch" variable in taprio_dump_class_stats()
In taprio_dump_class_stats() we don't need a reference to the root Qdisc once we get the reference to the child corresponding to this traffic class, so it's okay to overwrite "sch". But in a future patch we will need the root Qdisc too, so create a dedicated "child" pointer variable to hold the child reference. This also makes the code adhere to a more conventional coding style.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24 |
|
#
a721c3e5 |
| 11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload structu
net/sched: taprio: allow per-TC user input of FP adminStatus
This is a duplication of the FP adminStatus logic introduced for tc-mqprio. Offloading is done through the tc_mqprio_qopt_offload structure embedded within tc_taprio_qopt_offload. So practically, if a device driver is written to treat the mqprio portion of taprio just like standalone mqprio, it gets unified handling of frame preemption.
I would have reused more code with taprio, but this is mostly netlink attribute parsing, which is hard to transform into generic code without having something that stinks as a result. We have the same variables with the same semantics, just different nlattr type values (TCA_MQPRIO_TC_ENTRY=5 vs TCA_TAPRIO_ATTR_TC_ENTRY=12; TCA_MQPRIO_TC_ENTRY_FP=2 vs TCA_TAPRIO_TC_ENTRY_FP=3, etc) and consequently, different policies for the nest.
Every time nla_parse_nested() is called, an on-stack table "tb" of nlattr pointers is allocated statically, up to the maximum understood nlattr type. That array size is hardcoded as a constant, but when transforming this into a common parsing function, it would become either a VLA (which the Linux kernel rightfully doesn't like) or a call to the allocator.
Having FP adminStatus in tc-taprio can be seen as addressing the 802.1Q Annex S.3 "Scheduling and preemption used in combination, no HOLD/RELEASE" and S.4 "Scheduling and preemption used in combination with HOLD/RELEASE" use cases. HOLD and RELEASE events are emitted towards the underlying MAC Merge layer when the schedule hits a Set-And-Hold-MAC or a Set-And-Release-MAC gate operation. So within the tc-taprio UAPI space, one can distinguish between the 2 use cases by choosing whether to use the TC_TAPRIO_CMD_SET_AND_HOLD and TC_TAPRIO_CMD_SET_AND_RELEASE gate operations within the schedule, or just TC_TAPRIO_CMD_SET_GATES.
A small part of the change is dedicated to refactoring the max_sdu nlattr parsing to put all logic under the "if" that tests for presence of that nlattr.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ferenc Fejes <fejes@inf.elte.hu> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
c54876cd |
| 11-Apr-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: pass netlink extack to mqprio and taprio offload
With the multiplexed ndo_setup_tc() model which lacks a first-class struct netlink_ext_ack * argument, the only way to pass the netlink ex
net/sched: pass netlink extack to mqprio and taprio offload
With the multiplexed ndo_setup_tc() model which lacks a first-class struct netlink_ext_ack * argument, the only way to pass the netlink extended ACK message down to the device driver is to embed it within the offload structure.
Do this for struct tc_mqprio_qopt_offload and struct tc_taprio_qopt_offload.
Since struct tc_taprio_qopt_offload also contains a tc_mqprio_qopt_offload structure, and since device drivers might effectively reuse their mqprio implementation for the mqprio portion of taprio, we make taprio set the extack in both offload structures to point at the same netlink extack message.
In fact, the taprio handling is a bit more tricky, for 2 reasons.
First is because the offload structure has a longer lifetime than the extack structure. The driver is supposed to populate the extack synchronously from ndo_setup_tc() and leave it alone afterwards. To not have any use-after-free surprises, we zero out the extack pointer when we leave taprio_enable_offload().
The second reason is because taprio does overwrite the extack message on ndo_setup_tc() error. We need to switch to the weak form of setting an extack message, which preserves a potential message set by the driver.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2 |
|
#
64cb6aad |
| 15-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimited
It makes no sense to keep randomly large max_sdu values, especially if larger than the device's max_mtu. These are visible in
net/sched: taprio: dynamic max_sdu larger than the max_mtu is unlimited
It makes no sense to keep randomly large max_sdu values, especially if larger than the device's max_mtu. These are visible in "tc qdisc show". Such a max_sdu is practically unlimited and will cause no packets for that traffic class to be dropped on enqueue.
Just set max_sdu_dynamic to U32_MAX, which in the logic below causes taprio to save a max_frm_len of U32_MAX and a max_sdu presented to user space of 0 (unlimited).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
bdf366bd |
| 15-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: don't allow dynamic max_sdu to go negative after stab adjustment
The overhead specified in the size table comes from the user. With small time intervals (or gates always closed),
net/sched: taprio: don't allow dynamic max_sdu to go negative after stab adjustment
The overhead specified in the size table comes from the user. With small time intervals (or gates always closed), the overhead can be larger than the max interval for that traffic class, and their difference is negative.
What we want to happen is for max_sdu_dynamic to have the smallest non-zero value possible (1) which means that all packets on that traffic class are dropped on enqueue. However, since max_sdu_dynamic is u32, a negative is represented as a large value and oversized dropping never happens.
Use max_t with int to force a truncation of max_frm_len to no smaller than dev->hard_header_len + 1, which in turn makes max_sdu_dynamic no smaller than 1.
Fixes: fed87cc6718a ("net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
#
09dbdf28 |
| 15-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: fix calculation of maximum gate durations
taprio_calculate_gate_durations() depends on netdev_get_num_tc() and this returns 0. So it calculates the maximum gate durations for no t
net/sched: taprio: fix calculation of maximum gate durations
taprio_calculate_gate_durations() depends on netdev_get_num_tc() and this returns 0. So it calculates the maximum gate durations for no traffic class.
I had tested the blamed commit only with another patch in my tree, one which in the end I decided isn't valuable enough to submit ("net/sched: taprio: mask off bits in gate mask that exceed number of TCs").
The problem is that having this patch threw off my testing. By moving the netdev_set_num_tc() call earlier, we implicitly gave to taprio_calculate_gate_durations() the information it needed.
Extract only the portion from the unsubmitted change which applies the mqprio configuration to the netdev earlier.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-15-vladimir.oltean@nxp.com/ Fixes: a306a90c8ffe ("net/sched: taprio: calculate tc gate durations") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
Revision tags: v6.1.12, v6.1.11 |
|
#
39b02d6d |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: don't segment unnecessarily
Improve commit 497cc00224cf ("taprio: Handle short intervals and large packets") to only perform segmentation when skb->len exceeds what taprio_dequeue
net/sched: taprio: don't segment unnecessarily
Improve commit 497cc00224cf ("taprio: Handle short intervals and large packets") to only perform segmentation when skb->len exceeds what taprio_dequeue() expects.
In practice, this will make the biggest difference when a traffic class gate is always open in the schedule. This is because the max_frm_len will be U32_MAX, and such large skb->len values as Kurt reported will be sent just fine unsegmented.
What I don't seem to know how to handle is how to make sure that the segmented skbs themselves are smaller than the maximum frame size given by the current queueMaxSDU[tc]. Nonetheless, we still need to drop those, otherwise the Qdisc will hang.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
2d5e8071 |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: split segmentation logic from qdisc_enqueue()
The majority of the taprio_enqueue()'s function is spent doing TCP segmentation, which doesn't look right to me. Compilers shouldn't
net/sched: taprio: split segmentation logic from qdisc_enqueue()
The majority of the taprio_enqueue()'s function is spent doing TCP segmentation, which doesn't look right to me. Compilers shouldn't have a problem in inlining code no matter how we write it, so move the segmentation logic to a separate function.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
fed87cc6 |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
taprio today has a huge problem with small TC gate durations, because it might accept packets in taprio_enqueue() wh
net/sched: taprio: automatically calculate queueMaxSDU based on TC gate durations
taprio today has a huge problem with small TC gate durations, because it might accept packets in taprio_enqueue() which will never be sent by taprio_dequeue().
Since not much infrastructure was available, a kludge was added in commit 497cc00224cf ("taprio: Handle short intervals and large packets"), which segmented large TCP segments, but the fact of the matter is that the issue isn't specific to large TCP segments (and even worse, the performance penalty in segmenting those is absolutely huge).
In commit a54fc09e4cba ("net/sched: taprio: allow user input of per-tc max SDU"), taprio gained support for queueMaxSDU, which is precisely the mechanism through which packets should be dropped at qdisc_enqueue() if they cannot be sent.
After that patch, it was necessary for the user to manually limit the maximum MTU per TC. This change adds the necessary logic for taprio to further limit the values specified (or not specified) by the user to some minimum values which never allow oversized packets to be sent.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a878fd46 |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: keep the max_frm_len information inside struct sched_gate_list
I have one practical reason for doing this and one concerning correctness.
The practical reason has to do with a follow-up
net/sched: keep the max_frm_len information inside struct sched_gate_list
I have one practical reason for doing this and one concerning correctness.
The practical reason has to do with a follow-up patch, which aims to mix 2 sources of max_sdu (one coming from the user and the other automatically calculated based on TC gate durations @current link speed). Among those 2 sources of input, we must always select the smaller max_sdu value, but this can change at various link speeds. So the max_sdu coming from the user must be kept separated from the value that is operationally used (the minimum of the 2), because otherwise we overwrite it and forget what the user asked us to do.
To solve that, this patch proposes that struct sched_gate_list contains the operationally active max_frm_len, and q->max_sdu contains just what was requested by the user.
The reason having to do with correctness is based on the following observation: the admin sched_gate_list becomes operational at a given base_time in the future. Until then, it is inactive and applies no shaping, all gates are open, etc. So the queueMaxSDU dropping shouldn't apply either (this is a mechanism to ensure that packets smaller than the largest gate duration for that TC don't hang the port; clearly it makes little sense if the gates are always open).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a3d91b2c |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: warn about missing size table
Vinicius intended taprio to take the L1 overhead into account when estimating packet transmission time through user input, specifically through the q
net/sched: taprio: warn about missing size table
Vinicius intended taprio to take the L1 overhead into account when estimating packet transmission time through user input, specifically through the qdisc size table (man tc-stab).
Something like this:
tc qdisc replace dev $eth root stab overhead 24 taprio \ num_tc 8 \ map 0 1 2 3 4 5 6 7 \ queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \ base-time 0 \ sched-entry S 0x7e 9000000 \ sched-entry S 0x82 1000000 \ max-sdu 0 0 0 0 0 0 0 200 \ flags 0x0 clockid CLOCK_TAI
Without the overhead being specified, transmission times will be underestimated and will cause late transmissions. For an offloading driver, it might even cause TX hangs if there is no open gate large enough to send the maximum sized packets for that TC (including L1 overhead). Properly knowing the L1 overhead will ensure that we are able to auto-calculate the queueMaxSDU per traffic class just right, and avoid these hangs due to head-of-line blocking.
We can't make the stab mandatory due to existing setups, but we can warn the user that it's important with a warning netlink extack.
Link: https://patchwork.kernel.org/project/netdevbpf/patch/20220505160357.298794-1-vladimir.oltean@nxp.com/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
a1e6ad30 |
| 07-Feb-2023 |
Vladimir Oltean <vladimir.oltean@nxp.com> |
net/sched: taprio: calculate guard band against actual TC gate close time
taprio_dequeue_from_txq() looks at the entry->end_time to determine whether the skb will overrun its traffic class gate, as
net/sched: taprio: calculate guard band against actual TC gate close time
taprio_dequeue_from_txq() looks at the entry->end_time to determine whether the skb will overrun its traffic class gate, as if at the end of the schedule entry there surely is a "gate close" event for it. Hint: maybe there isn't.
For each schedule entry, introduce an array of kernel times which actually tracks when in the future will there be an *actual* gate close event for that traffic class, and use that in the guard band overrun calculation.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|