3f734b8c | 17-Apr-2023 |
Tariq Toukan <tariqt@nvidia.com> |
net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo
Here we fix the current wi->num_pkts abuse, as it was used to indicate multiple xdpi entries in the xdpi_fifo.
Instead, reduce mlx5e_x
net/mlx5e: XDP, Use multiple single-entry objects in xdpi_fifo
Here we fix the current wi->num_pkts abuse, as it was used to indicate multiple xdpi entries in the xdpi_fifo.
Instead, reduce mlx5e_xdp_info to the size of a single field, making it a union of unions. Per packet, use as many instances as needed to provide the information needed at the time of completion.
The sequence of xdpi instances pushed is well defined, derived by the xmit_mode.
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
3f93f829 | 21-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Defer page release in legacy rq for better recycling
Currently, fragmented pages from the page pool can be released in two ways:
1) In the mlx5e driver when trimming off the unused f
net/mlx5e: RX, Defer page release in legacy rq for better recycling
Currently, fragmented pages from the page pool can be released in two ways:
1) In the mlx5e driver when trimming off the unused fragments AND the associated skb fragments have been released. This path allows recycling of pages to the page pool cache (allow_direct == true).
2) On the skb release path (last fragment release), which will always release pages to the page pool ring (allow_direct == false).
Whichever is releasing the last fragment will be decisive on where the page gets released: the cache or the ring. So we obviously want to maximize for doing the release from 1.
This patch does that by deferring the release of page fragments right before requesting new ones from the page pool. A flag is added to make sure that there's no release before first alloc and that XDP_TX fragments are not released prematurely.
This is a preparation patch that doesn't unlock the performance improvements yet. A followup patch will do that.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
38a36efc | 17-Feb-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX.
A following patch w
net/mlx5e: RX, Rename xdp_xmit_bitmap to a more generic name
The xdp_xmit_bitmap currently serves only one purpose: to avoid releasing pages that are still in use due to XDP TX.
A following patch will use this bitmap in a slightly different context but for the same purpose. So rename the bitmap to a more generic name that reflects the purpose not the context.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
d39092ca | 30-Jan-2023 |
Dragos Tatulea <dtatulea@nvidia.com> |
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the a
net/mlx5e: RX, Remove alloc unit layout constraint for striding rq
This change removes the usage of mlx5e_alloc_unit union for striding rq. The change is more straightforward than legacy rq as the alloc units union is already in place.
This patch only moves things around: instead of an array of unions make it a union of arrays. This has the effect that each mlx5e_mpw_info will allocate the largest possible size of the array member. For xsk this means that the array of xdp_buff pointers for the wqe will still be contiguous, but there will be some extra unused bytes at the end of the array.
As further patch in the series will add the mlx5e_frag_page struct for which the described size constraint will no longer hold.
Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
show more ...
|
c2c9e31d | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Optimize for unaligned mode with 3072-byte frames
When XSK frame size is 3072 (or another power of two multiplied by 3), KLM mechanism for NIC virtual memory page mapping can be opti
net/mlx5e: xsk: Optimize for unaligned mode with 3072-byte frames
When XSK frame size is 3072 (or another power of two multiplied by 3), KLM mechanism for NIC virtual memory page mapping can be optimized by replacing it with KSM.
Before this change, two KLM entries were needed to map an XSK frame that is not a power of two: one entry maps the UMEM memory up to the frame length, the other maps the rest of the stride to the garbage page.
When the frame length divided by 3 is a power of two, it can be mapped using 3 KSM entries, and the fourth will map the rest of the stride to the garbage page. All 4 KSM entries are of the same size, which allows for a much faster lookup.
Frame size 3072 is useful in certain use cases, because it allows packing 4 frames into 3 pages. Generally speaking, other frame sizes equal to PAGE_SIZE minus a power of two can be optimized in a similar way, but it will require many more KSMs per frame, which slows down UMRs a little bit, but more importantly may hit the limit for the maximum number of KSM entries.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
c6f04204 | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Print a warning in slow configurations
On striding RQ, when the XSK frame size doesn't match the MKey page size, KLM is used for memory mappings, which is a slower mechanism than MTT
net/mlx5e: xsk: Print a warning in slow configurations
On striding RQ, when the XSK frame size doesn't match the MKey page size, KLM is used for memory mappings, which is a slower mechanism than MTT or KSM. It may happen in two cases:
1. Frame size is not a power of two (only possible in the unaligned mode of XSK).
2. Frame size is 2048 bytes, and the firmware doesn't support MKey pages smaller than 4096 bytes.
Depending on the case, print a warning and recommend to disable striding RQ or upgrade the firmware.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
13921345 | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use KLM to protect frame overrun in unaligned mode
XSK RQs support striding RQ linear mode, but the stride size may be bigger than the XSK frame size, because:
1. The stride size mu
net/mlx5e: xsk: Use KLM to protect frame overrun in unaligned mode
XSK RQs support striding RQ linear mode, but the stride size may be bigger than the XSK frame size, because:
1. The stride size must be a power of two.
2. The stride size must be equal to the UMR page size. Each XSK frame is treated as a separate page, because they aren't necessarily adjacent in physical memory, so the driver can't put more than one stride per page.
3. The minimal MTT page size is 4096 on older firmware.
That means that if XSK frame size is 2048 or not a power of two, the strides may be bigger than XSK frames. Normally, it's not a problem if the hardware enforces the MTU. However, traffic between vports skips the hardware MTU check, and oversized packets may be received.
If an oversized packet is bigger than the XSK frame but not bigger than the stride, it will cause overwriting of the adjacent UMEM region. If the packet takes more than one stride, they can be recycled for reuse, so it's not a problem when the XSK frame size matches the stride size.
Work around the above issue by leveraging KLM to make a more fine-grained mapping. The beginning of each stride is mapped to the frame memory, and the padding up to the closest power of two is mapped to the overflow page that doesn't belong to UMEM. This way, application data corruption won't happen upon receiving packets bigger than MTU.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
9f123f74 | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Improve MTT/KSM alignment
Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires smaller alignment than MTT.
Ensure that there is always an even amount of MTTs in a UMR WQE, s
net/mlx5e: Improve MTT/KSM alignment
Make mlx5e_mpwrq_mtts_per_wqe take into account that KSM requires smaller alignment than MTT.
Ensure that there is always an even amount of MTTs in a UMR WQE, so that complete octwords are formed, and no garbage is mapped.
Drop extra alignment in MLX5_MTT_OCTW that may cause setting too big ucseg->xlt_octowords, also leading to mapping garbage.
Generalize some calculations by introducing the MLX5_OCTWORD constant.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
168723c1 | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use umr_mode to calculate striding RQ parameters
Instead of passing the unaligned flag, pass an enum that indicates the UMR mode. The next commit will add the third mode (KLM for cer
net/mlx5e: xsk: Use umr_mode to calculate striding RQ parameters
Instead of passing the unaligned flag, pass an enum that indicates the UMR mode. The next commit will add the third mode (KLM for certain configurations of XSK), which will be added to this enum instead of adding another bool flag everywhere.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
cfb4d09c | 01-Oct-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Improve need_wakeup logic
XSK need_wakeup mechanism allows the driver to stop busy waiting for buffers when the fill ring is empty, yield to the application and signal it that the dr
net/mlx5e: xsk: Improve need_wakeup logic
XSK need_wakeup mechanism allows the driver to stop busy waiting for buffers when the fill ring is empty, yield to the application and signal it that the driver needs to be waken up after the application refills the fill ring.
Add protection against the race condition on the RX (refill) side: if the application refills buffers after xskrq->post_wqes is called, but before mlx5e_xsk_update_rx_wakeup, NAPI will exit, skipping taking these buffers to the hardware WQ, and the application won't wake it up again.
Optimize the whole need_wakeup logic, removing unneeded flows, to compensate for this new check.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
3db4c85c | 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use queue indices starting from 0 for XSK queues
In the initial implementation of XSK in mlx5e, XSK RQs coexisted with regular RQs in the same channel. The main idea was to allow RSS
net/mlx5e: xsk: Use queue indices starting from 0 for XSK queues
In the initial implementation of XSK in mlx5e, XSK RQs coexisted with regular RQs in the same channel. The main idea was to allow RSS work the same for regular traffic, without need to reconfigure RSS to exclude XSK queues.
However, this scheme didn't prove to be beneficial, mainly because of incompatibility with other vendors. Some tools don't properly support using higher indices for XSK queues, some tools get confused with the double amount of RQs exposed in sysfs. Some use cases are purely XSK, and allocating the same amount of unused regular RQs is a waste of resources.
This commit changes the queuing scheme to the standard one, where XSK RQs replace regular RQs on the channels where XSK sockets are open. Two RQs still exist in the channel to allow failsafe disable of XSK, but only one is exposed at a time. The next commit will achieve the desired memory save by flushing the buffers when the regular RQ is unused.
As the result of this transition:
1. It's possible to use RSS contexts over XSK RQs.
2. It's possible to dedicate all queues to XSK.
3. When XSK RQs coexist with regular RQs, the admin should make sure no unwanted traffic goes into XSK RQs by either excluding them from RSS or settings up the XDP program to return XDP_PASS for non-XSK traffic.
4. When using a mixed fleet of mlx5e devices and other netdevs, the same configuration can be applied. If the application supports the fallback to copy mode on unsupported drivers, it will work too.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
a752b2ed | 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Support XDP metadata on XSK RQs
Add support for XDP metadata on XSK RQs for cross-program communication. The driver no longer calls xdp_set_data_meta_invalid and copies the metadata
net/mlx5e: xsk: Support XDP metadata on XSK RQs
Add support for XDP metadata on XSK RQs for cross-program communication. The driver no longer calls xdp_set_data_meta_invalid and copies the metadata to a newly allocated SKB on XDP_PASS.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
ddb7afee | 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: Optimize RQ page deallocation
mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling mlx5e_page_release for ones that are not scheduled for XDP_TX or XDP_REDIRECT; and mlx5e_page_re
net/mlx5e: Optimize RQ page deallocation
mlx5e_free_rx_mpwqe loops over all pages of a MPWQE, calling mlx5e_page_release for ones that are not scheduled for XDP_TX or XDP_REDIRECT; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each page/XSK frame. This check can be moved outside the loop to reduce the number of branches.
mlx5e_free_rx_wqe loops over all fragments, calling mlx5e_page_release for the ones that are last in a page; and mlx5e_page_release checks whether it's an XSK RQ or a regular one for each fragment. Using the fact that XSK doesn't support multiple fragments, it can be optimized for both XSK and regular usages:
1. Make an early check for XSK and call its deallocator directly, saving 3 branches (loop condition, frag->last_in_page and selection of deallocator).
2. Call the regular deallocator directly in the non-XSK case, saving a branch per fragment, except the first one.
After the changes, mlx5e_page_release is removed, as there are no callers left.
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
cf544517 | 30-Sep-2022 |
Maxim Mikityanskiy <maximmi@nvidia.com> |
net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and
net/mlx5e: xsk: Use xsk_buff_alloc_batch on striding RQ
XSK provides a function to allocate frames in batches for more efficient processing. This commit starts using this function on striding RQ and creates an optimized flow for XSK. A side effect is an opportunity to optimize the regular RX flow by dropping branching for XSK cases.
Performance improvement is up to 6.4% in the aligned mode and up to 7.5% in the unaligned mode.
Aligned mode, 2048-byte frames: 12.9 Mpps -> 13.8 Mpps Aligned mode, 4096-byte frames: 11.8 Mpps -> 12.5 Mpps Unaligned mode, 2048-byte frames: 11.9 Mpps -> 12.8 Mpps Unaligned mode, 3072-byte frames: 11.4 Mpps -> 12.1 Mpps Unaligned mode, 4096-byte frames: 11.0 Mpps -> 11.2 Mpps
CPU: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|