Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
e0f52298 |
| 18-Jul-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: xsk: Fix invalid buffer access for legacy rq
The below crash can be encountered when using xdpsock in rx mode for legacy rq: the buffer gets released in the XDP_REDIRECT path, and then on
net/mlx5e: xsk: Fix invalid buffer access for legacy rq
The below crash can be encountered when using xdpsock in rx mode for legacy rq: the buffer gets released in the XDP_REDIRECT path, and then once again in the driver. This fix sets the flag to avoid releasing on the driver side.
XSK handling of buffers for legacy rq was relying on the caller to set the skip release flag. But the referenced fix started using fragment counts for pages instead of the skip flag.
Crash log: general protection fault, probably for non-canonical address 0xffff8881217e3a: 0000 [#1] SMP CPU: 0 PID: 14 Comm: ksoftirqd/0 Not tainted 6.5.0-rc1+ #31 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:bpf_prog_03b13f331978c78c+0xf/0x28 Code: ... RSP: 0018:ffff88810082fc98 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888138404901 RCX: c0ffffc900027cbc RDX: ffffffffa000b514 RSI: 00ffff8881217e32 RDI: ffff888138404901 RBP: ffff88810082fc98 R08: 0000000000091100 R09: 0000000000000006 R10: 0000000000000800 R11: 0000000000000800 R12: ffffc9000027a000 R13: ffff8881217e2dc0 R14: ffff8881217e2910 R15: ffff8881217e2f00 FS: 0000000000000000(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000564cb2e2cde0 CR3: 000000010e603004 CR4: 0000000000370eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? die_addr+0x32/0x80 ? exc_general_protection+0x192/0x390 ? asm_exc_general_protection+0x22/0x30 ? 0xffffffffa000b514 ? bpf_prog_03b13f331978c78c+0xf/0x28 mlx5e_xdp_handle+0x48/0x670 [mlx5_core] ? dev_gro_receive+0x3b5/0x6e0 mlx5e_xsk_skb_from_cqe_linear+0x6e/0x90 [mlx5_core] mlx5e_handle_rx_cqe+0x55/0x100 [mlx5_core] mlx5e_poll_rx_cq+0x87/0x6e0 [mlx5_core] mlx5e_napi_poll+0x45e/0x6b0 [mlx5_core] __napi_poll+0x25/0x1a0 net_rx_action+0x28a/0x300 __do_softirq+0xcd/0x279 ? sort_range+0x20/0x20 run_ksoftirqd+0x1a/0x20 smpboot_thread_fn+0xa2/0x130 kthread+0xc9/0xf0 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: mlx5_ib mlx5_core rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay zram zsmalloc fuse [last unloaded: mlx5_core] ---[ end trace 0000000000000000 ]---
Fixes: 7abd955a58fb ("net/mlx5e: RX, Fix page_pool page fragment tracking for XDP") Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13 |
|
#
3f93f829 |
| 21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Defer page release in legacy rq for better recycling
Currently, fragmented pages from the page pool can be released in two ways:
1) In the mlx5e driver when trimming off the unused f
net/mlx5e: RX, Defer page release in legacy rq for better recycling
Currently, fragmented pages from the page pool can be released in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false).
Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. A flag is added to make sure that there's no release before first alloc and that XDP_TX fragments are not released prematurely.
This is a preparation patch that doesn't unlock the performance improvements yet. A followup patch will do that.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v6.2 |
|
#
38a36efc |
| 17-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX.
A following patch w
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX.
A following patch will use this bitmap in a slightly different context but for the same purpose. So rename the bitmap to a more generic name that reflects the purpose not the context.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v6.1.12, v6.1.11, v6.1.10, v6.1.9 |
|
#
d39092ca |
| 30-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the a
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the alloc units union is already in place.
This patch only moves things around: instead of an array of unions make it a union of arrays. This has the effect that each mlx5e_mpw_info will allocate the largest possible size of the array member. For xsk this means that the array of xdp_buff pointers for the wqe will still be contiguous, but there will be some extra unused bytes at the end of the array.
As further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
#
8fb1814f |
| 27-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
The mlx5e_alloc_unit union is conveniently used to store arrays of pointers to struct page or struct xdp_buff (for xsk). The union is
net/mlx5e: RX, Remove alloc unit layout constraint for legacy rq
The mlx5e_alloc_unit union is conveniently used to store arrays of pointers to struct page or struct xdp_buff (for xsk). The union is currently expected to have the size of a pointer for xsk batch allocations to work. This is conveniet for the current state of the code but makes it impossible to add a structure of a different size to the alloc unit.
A further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold.
This change removes the usage of mlx5e_alloc_unit union for legacy rq:
- A union of arrays is introduced (mlx5e_alloc_units) to replace the array of unions to allow structures of different sizes.
- Each fragment has a pointer to a unit in the mlx5e_alloc_units array.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
#
9da5294e |
| 15-Feb-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: Remove redundant page argument in mlx5e_xdp_handle()
Remove the page parameter, it can be derived from the xdp_buff member of mlx5e_xdp_buff.
Signed-off-by: Tariq Toukan <tariqt@nvidia.c
net/mlx5e: Remove redundant page argument in mlx5e_xdp_handle()
Remove the page parameter, it can be derived from the xdp_buff member of mlx5e_xdp_buff.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v6.1.8 |
|
#
bc8d405b |
| 19-Jan-2023 |
Toke Høiland-Jørgensen <toke@redhat.com> |
net/mlx5e: Support RX XDP metadata
Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe pointer to the mlx5e_skb_from* functions so it can be retrieved from the XDP ctx to do th
net/mlx5e: Support RX XDP metadata
Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe pointer to the mlx5e_skb_from* functions so it can be retrieved from the XDP ctx to do this.
Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-17-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
#
384a13ca |
| 19-Jan-2023 |
Toke Høiland-Jørgensen <toke@redhat.com> |
net/mlx5e: Introduce wrapper for xdp_buff
Preparation for implementing HW metadata kfuncs. No functional change.
Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John
net/mlx5e: Introduce wrapper for xdp_buff
Preparation for implementing HW metadata kfuncs. No functional change.
Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Anatoly Burakov <anatoly.burakov@intel.com> Cc: Alexander Lobakin <alexandr.lobakin@intel.com> Cc: Magnus Karlsson <magnus.karlsson@gmail.com> Cc: Maryam Tahhan <mtahhan@redhat.com> Cc: xdp-hints@xdp-project.net Cc: netdev@vger.kernel.org Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230119221536.3349901-16-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
show more ...
|
Revision tags: v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0 |
|
#
c2c9e31d |
| 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Optimize for unaligned mode with 3072-byte frames
When XSK frame size is 3072 (or another power of two multiplied by 3), KLM mechanism for NIC virtual memory page mapping can be opti
net/mlx5e: xsk: Optimize for unaligned mode with 3072-byte frames
When XSK frame size is 3072 (or another power of two multiplied by 3), KLM mechanism for NIC virtual memory page mapping can be optimized by replacing it with KSM.
Before this change, two KLM entries were needed to map an XSK frame that is not a power of two: one entry maps the UMEM memory up to the frame length, the other maps the rest of the stride to the garbage page.
When the frame length divided by 3 is a power of two, it can be mapped using 3 KSM entries, and the fourth will map the rest of the stride to the garbage page. All 4 KSM entries are of the same size, which allows for a much faster lookup.
Frame size 3072 is useful in certain use cases, because it allows packing 4 frames into 3 pages. Generally speaking, other frame sizes equal to PAGE_SIZE minus a power of two can be optimized in a similar way, but it will require many more KSMs per frame, which slows down UMRs a little bit, but more importantly may hit the limit for the maximum number of KSM entries.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
13921345 |
| 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use KLM to protect frame overrun in unaligned mode
XSK RQs support striding RQ linear mode, but the stride size may be bigger than the XSK frame size, because:
1. The stride size mu
net/mlx5e: xsk: Use KLM to protect frame overrun in unaligned mode
XSK RQs support striding RQ linear mode, but the stride size may be bigger than the XSK frame size, because:
1. The stride size must be a power of two.
2. The stride size must be equal to the UMR page size. Each XSK frame is treated as a separate page, because they aren't necessarily adjacent in physical memory, so the driver can't put more than one stride per page.
3. The minimal MTT page size is 4096 on older firmware.
That means that if XSK frame size is 2048 or not a power of two, the strides may be bigger than XSK frames. Normally, it's not a problem if the hardware enforces the MTU. However, traffic between vports skips the hardware MTU check, and oversized packets may be received.
If an oversized packet is bigger than the XSK frame but not bigger than the stride, it will cause overwriting of the adjacent UMEM region. If the packet takes more than one stride, they can be recycled for reuse, so it's not a problem when the XSK frame size matches the stride size.
Work around the above issue by leveraging KLM to make a more fine-grained mapping. The beginning of each stride is mapped to the frame memory, and the padding up to the closest power of two is mapped to the overflow page that doesn't belong to UMEM. This way, application data corruption won't happen upon receiving packets bigger than MTU.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
9f123f74 |
| 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Improve MTT/KSM alignment
Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires smaller alignment than MTT.
Ensure that there is always an even amount of MTTs in a UMR WQE, s
net/mlx5e: Improve MTT/KSM alignment
Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires smaller alignment than MTT.
Ensure that there is always an even amount of MTTs in a UMR WQE, so that complete octwords are formed, and no garbage is mapped.
Drop extra alignment in MLX5_MTT_OCTW that may cause setting too big ucseg->xlt_octowords, also leading to mapping garbage.
Generalize some calculations by introducing the MLX5_OCTWORD constant.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
168723c1 |
| 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use umr_mode to calculate striding RQ parameters
Instead of passing the unaligned flag, pass an enum that indicates the UMR mode. The next commit will add the third mode (KLM for cer
net/mlx5e: xsk: Use umr_mode to calculate striding RQ parameters
Instead of passing the unaligned flag, pass an enum that indicates the UMR mode. The next commit will add the third mode (KLM for certain configurations of XSK), which will be added to this enum instead of adding another bool flag everywhere.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
a752b2ed |
| 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Support XDP metadata on XSK RQs
Add support for XDP metadata on XSK RQs for cross-program communication. The driver no longer calls xdp_set_data_meta_invalid and copies the metadata
net/mlx5e: xsk: Support XDP metadata on XSK RQs
Add support for XDP metadata on XSK RQs for cross-program communication. The driver no longer calls xdp_set_data_meta_invalid and copies the metadata to a newly allocated SKB on XDP_PASS.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
ddb7afee |
| 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Optimize RQ page deallocation
mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling mlx5e_page_release for ones that are not scheduled for XDP_TX or XDP_REDIRECT; and mlx5e_page_re
net/mlx5e: Optimize RQ page deallocation
mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling mlx5e_page_release for ones that are not scheduled for XDP_TX or XDP_REDIRECT; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each page/XSK frame. This check can be moved outside the loop to reduce the number of branches.
mlx5e_free_rx_wqe loops over all fragments, calling mlx5e_page_release for the ones that are last in a page; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each fragment. Using the fact that XSK doesn't support multiple fragments, it can be optimized for both XSK and regular usages:
1. Make an early check for XSK and call its deallocator directly, saving 3 branches (loop condition, frag->last_in_page and selection of deallocator).
2. Call the regular deallocator directly in the non-XSK case, saving a branch per fragment, except the first one.
After the changes, mlx5e_page_release is removed, as there are no callers left.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
cf544517 |
| 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and
net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and creates an optimized flow for XSK. A side effect is an opportunity to optimize the regular RX flow by dropping branching for XSK cases.
Performance improvement is up to 6.4% in the aligned mode and up to 7.5% in the unaligned mode.
Aligned mode, 2048-byte frames: 12.9 Mpps -> 13.8 Mpps Aligned mode, 4096-byte frames: 11.8 Mpps -> 12.5 Mpps Unaligned mode, 2048-byte frames: 11.9 Mpps -> 12.8 Mpps Unaligned mode, 3072-byte frames: 11.4 Mpps -> 12.1 Mpps Unaligned mode, 4096-byte frames: 11.0 Mpps -> 11.2 Mpps
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
259bbc64 |
| 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use xsk_buff_alloc_batch on legacy RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on legacy RQ, adding
net/mlx5e: xsk: Use xsk_buff_alloc_batch on legacy RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on legacy RQ, adding a special case for XSK. The new branch introduced basically replaces the branch that was removed from the same place a few commits before.
A check is made that DMA sync is not needed, because the batching allocator falls back to returning one frame when DMA sync is needed, and this is best handled by the loop in the standard case.
Performance improvement is up to 8% in the aligned mode and up to 9% in the unaligned mode.
Aligned mode, 2048-byte frames: 12.8 Mpps -> 13.5 Mpps Aligned mode, 4096-byte frames: 11.5 Mpps -> 12.4 Mpps Unaligned mode, 2048-byte frames: 12.2 Mpps -> 13.4 Mpps Unaligned mode, 3072-byte frames: 11.6 Mpps -> 12.5 Mpps Unaligned mode, 4096-byte frames: 11.2 Mpps -> 12.2 Mpps
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
a2e5ba24 |
| 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Split out WQE allocation for legacy XSK RQ
Allocation of XSK frames on legacy RQ may be made more efficient with a specialized routine that relies on certain assumptions, such as the
net/mlx5e: xsk: Split out WQE allocation for legacy XSK RQ
Allocation of XSK frames on legacy RQ may be made more efficient with a specialized routine that relies on certain assumptions, such as there is only one fragment, allocation units (XSK frames) are not shared among multiple packets. It reduces the number of branches both in the XSK code and in the regular RQ, because with this approach there is only a single check whether it's an XSK or regular RQ.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
79008676 |
| 29-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Rename mlx5e_dma_info to prepare for removal of DMA address
The next commit will remove the DMA address from the struct currently called mlx5e_dma_info, because the same value can be retr
net/mlx5e: Rename mlx5e_dma_info to prepare for removal of DMA address
The next commit will remove the DMA address from the struct currently called mlx5e_dma_info, because the same value can be retrieved with page_pool_get_dma_addr(page) in almost all cases, with the notable exception of SHAMPO (HW GRO implementation) that modifies this address on the fly, after the initial allocation.
To keep the SHAMPO logic intact, struct mlx5e_dma_info remains in the SHAMPO code, consisting of addr and page (XSK is not compatible with SHAMPO). The struct used in all other places is renamed to mlx5e_alloc_unit, allowing the next commit to remove the addr field without affecting SHAMPO.
The new name means "allocation unit", and it's more appropriate after the field with the DMA address gets removed.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.71 |
|
#
258e655c |
| 27-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Make dma_info array dynamic in struct mlx5e_mpw_info
This commit moves the dma_info array to the end of struct mlx5e_mpw_info to make it a flexible array. It also removes the intermediate
net/mlx5e: Make dma_info array dynamic in struct mlx5e_mpw_info
This commit moves the dma_info array to the end of struct mlx5e_mpw_info to make it a flexible array. It also removes the intermediate struct mlx5e_umr_dma_info, which used to contain only this array. The flexibility of dma_info will allow to choose its size dynamically in a following commit.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17 |
|
#
84a137f0 |
| 20-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Drop error CQE handling from the XSK RX handler
This commit removes the redundant check and removes the unused cqe parameter of skb_from_cqe handlers.
Signed-off-by: Maxim Mikityanskiy <
net/mlx5e: Drop error CQE handling from the XSK RX handler
This commit removes the redundant check and removes the unused cqe parameter of skb_from_cqe handlers.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
#
064990d0 |
| 27-Jan-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Drop the len output parameter from mlx5e_xdp_handle
The len parameter of mlx5e_xdp_handle is used to output the new packet length after XDP has processed the packet and returned XDP_PASS.
net/mlx5e: Drop the len output parameter from mlx5e_xdp_handle
The len parameter of mlx5e_xdp_handle is used to output the new packet length after XDP has processed the packet and returned XDP_PASS. However, this value can be calculated on the caller site, as the caller knows if it was an XDP_PASS.
This commit drops the len parameter and moves the calculation to the caller, reducing the number of parameters passed to the function and preparing for XDP support in non-linear legacy RQ.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v5.4.173, v5.15.16 |
|
#
e26eceb9 |
| 19-Jan-2022 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: RX, Test the XDP program existence out of the handler
Instead of early return inside mlx5e_xdp_handle(), let the caller check if an XDP program is loaded. This allows saving a few unnece
net/mlx5e: RX, Test the XDP program existence out of the handler
Instead of early return inside mlx5e_xdp_handle(), let the caller check if an XDP program is loaded. This allows saving a few unnecessary function calls and calculations in case !prog.
Performance test: single core, drop packets in iptables Before: 3,872,504 pps After: 3,975,628 pps (+2.66%)
Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Maxim Mikityanskiy <maximmi@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
Revision tags: v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10, v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47 |
|
#
9c25a22d |
| 11-Jun-2020 |
Maxim Mikityanskiy <maximmi@mellanox.com> |
net/mlx5e: Use synchronize_rcu to sync with NAPI
As described in the previous commit, napi_synchronize doesn't quite fit the purpose when we just need to wait until the currently running NAPI quits.
net/mlx5e: Use synchronize_rcu to sync with NAPI
As described in the previous commit, napi_synchronize doesn't quite fit the purpose when we just need to wait until the currently running NAPI quits. Its implementation waits until NAPI is not running by polling and waiting for 1ms in between. In cases where we need to deactivate one queue (e.g., recovery flows) or where we deactivate them one-by-one (deactivate channel flow), we may get stuck in napi_synchronize forever if other queues keep NAPI active, causing a soft lockup. Depending on kernel configuration (CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC), it may result in a kernel panic.
To fix the issue, use synchronize_rcu to wait for NAPI to quit, and wrap the whole NAPI in rcu_read_lock.
Fixes: acc6c5953af1 ("net/mlx5e: Split open/close channels to stages") Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
show more ...
|
#
9647c57b |
| 28-Aug-2020 |
Magnus Karlsson <magnus.karlsson@intel.com> |
xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance
Test for dma_need_sync earlier to increase performance. xsk_buff_dma_sync_for_cpu() takes an xdp_buff as parameter
xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for better performance
Test for dma_need_sync earlier to increase performance. xsk_buff_dma_sync_for_cpu() takes an xdp_buff as parameter and from that the xsk_buff_pool reference is dug out. Perf shows that this dereference causes a lot of cache misses. But as the buffer pool is now sent down to the driver at zero-copy initialization time, we might as well use this pointer directly, instead of going via the xsk_buff and we can do so already in xsk_buff_dma_sync_for_cpu() instead of in xp_dma_sync_for_cpu. This gets rid of these cache misses.
Throughput increases with 3% for the xdpsock l2fwd sample application on my machine.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Link: https://lore.kernel.org/bpf/1598603189-32145-11-git-send-email-magnus.karlsson@intel.com
show more ...
|
#
e20f0dbf |
| 26-Aug-2020 |
Tariq Toukan <tariqt@mellanox.com> |
net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES
A single cacheline might not contain the packet header for small L1_CACHE_BYTES values. Use net_prefetch() as it issues an additional p
net/mlx5e: RX, Add a prefetch command for small L1_CACHE_BYTES
A single cacheline might not contain the packet header for small L1_CACHE_BYTES values. Use net_prefetch() as it issues an additional prefetch in this case.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|