Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45 |
|
#
6db541ae |
| 09-Aug-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Ensure login failure recovery is safe from other resets
If a login request fails, the recovery process should be protected against parallel resets. It is a known issue that freeing and regi
ibmvnic: Ensure login failure recovery is safe from other resets
If a login request fails, the recovery process should be protected against parallel resets. It is a known issue that freeing and registering CRQ's in quick succession can result in a failover CRQ from the VIOS. Processing a failover during login recovery is dangerous for two reasons: 1. This will result in two parallel initialization processes, this can cause serious issues during login. 2. It is possible that the failover CRQ is received but never executed. We get notified of a pending failover through a transport event CRQ. The reset is not performed until a INIT CRQ request is received. Previously, if CRQ init fails during login recovery, then the ibmvnic irq is freed and the login process returned error. If failover_pending is true (a transport event was received), then the ibmvnic device would never be able to process the reset since it cannot receive the CRQ_INIT request due to the irq being freed. This leaved the device in a inoperable state.
Therefore, the login failure recovery process must be hardened against these possible issues. Possible failovers (due to quick CRQ free and init) must be avoided and any issues during re-initialization should be dealt with instead of being propagated up the stack. This logic is similar to that of ibmvnic_probe().
Fixes: dff515a3e71d ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-5-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
23cc5f66 |
| 09-Aug-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Do partial reset on login failure
Perform a partial reset before sending a login request if any of the following are true: 1. If a previous request times out. This can be dangerous because
ibmvnic: Do partial reset on login failure
Perform a partial reset before sending a login request if any of the following are true: 1. If a previous request times out. This can be dangerous because the VIOS could still receive the old login request at any point after the timeout. Therefore, it is best to re-register the CRQ's and sub-CRQ's before retrying. 2. If the previous request returns an error that is not described in PAPR. PAPR provides procedures if the login returns with partial success or aborted return codes (section L.5.1) but other values do not have a defined procedure. Previously, these conditions just returned error from the login function rather than trying to resolve the issue. This can cause further issues since most callers of the login function are not prepared to handle an error when logging in. This improper cleanup can lead to the device being permanently DOWN'd. For example, if the VIOS believes that the device is already logged in then it will return INVALID_STATE (-7). If we never re-register CRQ's then it will always think that the device is already logged in. This leaves the device inoperable.
The partial reset involves freeing the sub-CRQs, freeing the CRQ then registering and initializing a new CRQ and sub-CRQs. This essentially restarts all communication with VIOS to allow for a fresh login attempt that will be unhindered by any previous failed attempts.
Fixes: dff515a3e71d ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-4-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
d78a671e |
| 09-Aug-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Handle DMA unmapping of login buffs in release functions
Rather than leaving the DMA unmapping of the login buffers to the login response handler, move this work into the login release func
ibmvnic: Handle DMA unmapping of login buffs in release functions
Rather than leaving the DMA unmapping of the login buffers to the login response handler, move this work into the login release functions. Previously, these functions were only used for freeing the allocated buffers. This could lead to issues if there are more than one outstanding login buffer requests, which is possible if a login request times out.
If a login request times out, then there is another call to send login. The send login function makes a call to the login buffer release function. In the past, this freed the buffers but did not DMA unmap. Therefore, the VIOS could still write to the old login (now freed) buffer. It is for this reason that it is a good idea to leave the DMA unmap call to the login buffers release function.
Since the login buffer release functions now handle DMA unmapping, remove the duplicate DMA unmapping in handle_login_rsp().
Fixes: dff515a3e71d ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-3-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
411c565b |
| 09-Aug-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Unmap DMA login rsp buffer on send login fail
If the LOGIN CRQ fails to send then we must DMA unmap the response buffer. Previously, if the CRQ failed then the memory was freed without DMA
ibmvnic: Unmap DMA login rsp buffer on send login fail
If the LOGIN CRQ fails to send then we must DMA unmap the response buffer. Previously, if the CRQ failed then the memory was freed without DMA unmapping.
Fixes: c98d9cc4170d ("ibmvnic: send_login should check for crq errors") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-2-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
db17ba71 |
| 09-Aug-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Enforce stronger sanity checks on login response
Ensure that all offsets in a login response buffer are within the size of the allocated response buffer. Any offsets or lengths that surpass
ibmvnic: Enforce stronger sanity checks on login response
Ensure that all offsets in a login response buffer are within the size of the allocated response buffer. Any offsets or lengths that surpass the allocation are likely the result of an incomplete response buffer. In these cases, a full reset is necessary.
When attempting to login, the ibmvnic device will allocate a response buffer and pass a reference to the VIOS. The VIOS will then send the ibmvnic device a LOGIN_RSP CRQ to signal that the buffer has been filled with data. If the ibmvnic device does not get a response in 20 seconds, the old buffer is freed and a new login request is sent. With 2 outstanding requests, any LOGIN_RSP CRQ's could be for the older login request. If this is the case then the login response buffer (which is for the newer login request) could be incomplete and contain invalid data. Therefore, we must enforce strict sanity checks on the response buffer values.
Testing has shown that the `off_rxadd_buff_size` value is filled in last by the VIOS and will be the smoking gun for these circumstances.
Until VIOS can implement a mechanism for tracking outstanding response buffers and a method for mapping a LOGIN_RSP CRQ to a particular login response buffer, the best ibmvnic can do in this situation is perform a full reset.
Fixes: dff515a3e71d ("ibmvnic: Harden device login requests") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20230809221038.51296-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.44 |
|
#
813f3662 |
| 04-Aug-2023 |
Yu Liao <liaoyu15@huawei.com> |
ibmvnic: remove unused rc variable
gcc with W=1 reports drivers/net/ethernet/ibm/ibmvnic.c:194:13: warning: variable 'rc' set but not used [-Wunused-but-set-variable]
ibmvnic: remove unused rc variable
gcc with W=1 reports drivers/net/ethernet/ibm/ibmvnic.c:194:13: warning: variable 'rc' set but not used [-Wunused-but-set-variable] ^ This variable is not used so remove it.
Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202308040609.zQsSXWXI-lkp@intel.com/ Signed-off-by: Yu Liao <liaoyu15@huawei.com> Reviewed-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37 |
|
#
48538ccb |
| 28-Jun-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Do not reset dql stats on NON_FATAL err
All ibmvnic resets, make a call to netdev_tx_reset_queue() when re-opening the device. netdev_tx_reset_queue() resets the num_queued and num_complete
ibmvnic: Do not reset dql stats on NON_FATAL err
All ibmvnic resets, make a call to netdev_tx_reset_queue() when re-opening the device. netdev_tx_reset_queue() resets the num_queued and num_completed byte counters. These stats are used in Byte Queue Limit (BQL) algorithms. The difference between these two stats tracks the number of bytes currently sitting on the physical NIC. ibmvnic increases the number of queued bytes though calls to netdev_tx_sent_queue() in the drivers xmit function. When, VIOS reports that it is done transmitting bytes, the ibmvnic device increases the number of completed bytes through calls to netdev_tx_completed_queue(). It is important to note that the driver batches its transmit calls and num_queued is increased every time that an skb is added to the next batch, not necessarily when the batch is sent to VIOS for transmission.
Unlike other reset types, a NON FATAL reset will not flush the sub crq tx buffers. Therefore, it is possible for the batched skb array to be partially full. So if there is call to netdev_tx_reset_queue() when re-opening the device, the value of num_queued (0) would not account for the skb's that are currently batched. Eventually, when the batch is sent to VIOS, the call to netdev_tx_completed_queue() would increase num_completed to a value greater than the num_queued. This causes a BUG_ON crash:
ibmvnic 30000002: Firmware reports error, cause: adapter problem. Starting recovery... ibmvnic 30000002: tx error 600 ibmvnic 30000002: tx error 600 ibmvnic 30000002: tx error 600 ibmvnic 30000002: tx error 600 ------------[ cut here ]------------ kernel BUG at lib/dynamic_queue_limits.c:27! Oops: Exception in kernel mode, sig: 5 [....] NIP dql_completed+0x28/0x1c0 LR ibmvnic_complete_tx.isra.0+0x23c/0x420 [ibmvnic] Call Trace: ibmvnic_complete_tx.isra.0+0x3f8/0x420 [ibmvnic] (unreliable) ibmvnic_interrupt_tx+0x40/0x70 [ibmvnic] __handle_irq_event_percpu+0x98/0x270 ---[ end trace ]---
Therefore, do not reset the dql stats when performing a NON_FATAL reset.
Fixes: 0d973388185d ("ibmvnic: Introduce xmit_more support using batched subCRQ hcalls") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14 |
|
#
6f2ce45f |
| 23-Feb-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Assign XPS map to correct queue index
When setting the XPS map value for TX queues, use the index of the transmit queue. Previously, the function was passing the index of the loop that iter
ibmvnic: Assign XPS map to correct queue index
When setting the XPS map value for TX queues, use the index of the transmit queue. Previously, the function was passing the index of the loop that iterates over all queues (RX and TX). This was causing invalid XPS map values.
Fixes: 6831582937bd ("ibmvnic: Toggle between queue types in affinity mapping") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Link: https://lore.kernel.org/r/20230223153944.44969-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9 |
|
#
68315829 |
| 27-Jan-2023 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Toggle between queue types in affinity mapping
Previously, ibmvnic IRQs were assigned to CPU numbers by assigning all the IRQs for transmit queues then assigning all the IRQs for receive qu
ibmvnic: Toggle between queue types in affinity mapping
Previously, ibmvnic IRQs were assigned to CPU numbers by assigning all the IRQs for transmit queues then assigning all the IRQs for receive queues. With multi-threaded processors, in a heavy RX or TX environment, physical cores would either be overloaded or underutilized (due to the IRQ assignment algorithm). This approach is sub-optimal because IRQs for the same subprocess (RX or TX) would be bound to adjacent CPU numbers, meaning they were more likely to be contending for the same core.
For example, in a system with 64 CPU's and 32 queues, the IRQs would be bound to CPU in the following pattern:
IRQ type | CPU number ----------------------- TX0 | 0-1 TX1 | 2-3 <etc> RX0 | 32-33 RX1 | 34-35 <etc>
Observe that in SMT-8, the first 4 tx queues would be sharing the same core.
A more optimal algorithm would balance the number RX and TX IRQ's across the physical cores. Therefore, to increase performance, distribute RX and TX IRQs across cores by alternating between assigning IRQs for RX and TX queues to CPUs. With a system with 64 CPUs and 32 queues, this results in the following pattern:
IRQ type | CPU number ----------------------- TX0 | 0-1 RX0 | 2-3 TX1 | 4-5 RX1 | 6-7 <etc>
Observe that in SMT-8, there is equal distribution of RX and TX IRQs per core. In the above case, each core handles 2 TX and 2 RX IRQ's.
Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Link: https://lore.kernel.org/r/20230127214358.318152-1-nnac123@linux.ibm.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
show more ...
|
Revision tags: v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79 |
|
#
df8f66d0 |
| 10-Nov-2022 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Update XPS assignments during affinity binding
Transmit Packet Steering (XPS) maps cpu numbers to transmit queues. By running the same connection on the same set of cpu's, contention for th
ibmvnic: Update XPS assignments during affinity binding
Transmit Packet Steering (XPS) maps cpu numbers to transmit queues. By running the same connection on the same set of cpu's, contention for the queue and cache miss rate can be minimized. When assigning a cpu mask for a tranmit queues irq number, assign the same cpu mask as the set of cpu's that XPS should use for that queue.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
92125c3a |
| 10-Nov-2022 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Add hotpluggable CPU callbacks to reassign affinity hints
When CPU's are added and removed, ibmvnic devices will reassign hint values. Introduce a new cpu hotplug state CPUHP_IBMVNIC_DEAD t
ibmvnic: Add hotpluggable CPU callbacks to reassign affinity hints
When CPU's are added and removed, ibmvnic devices will reassign hint values. Introduce a new cpu hotplug state CPUHP_IBMVNIC_DEAD to signal to ibmvnic devices that the CPU has been removed and it is time to reset affinity hint assignments. On the other hand, when CPU's are being added, add a state instance to CPUHP_AP_ONLINE_DYN which will trigger a reassignment of affinity hints once the new CPU's are online. This implementation is based on the virtio_net driver.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
44fbc1b6 |
| 10-Nov-2022 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Assign IRQ affinity hints to device queues
Assign affinity hints to ibmvnic device queue interrupts. Affinity hints are assigned and removed during sub-crq init and teardown, respectively.
ibmvnic: Assign IRQ affinity hints to device queues
Assign affinity hints to ibmvnic device queue interrupts. Affinity hints are assigned and removed during sub-crq init and teardown, respectively. This update should improve latency if utilized as interrupt lines and processing are more equally distributed among CPU's. This implementation is based on the virtio_net driver.
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.0.8, v5.15.78, v6.0.7, v5.15.77 |
|
#
d6dd2fe7 |
| 31-Oct-2022 |
Nick Child <nnac123@linux.ibm.com> |
ibmvnic: Free rwi on reset success
Free the rwi structure in the event that the last rwi in the list processed successfully. The logic in commit 4f408e1fa6e1 ("ibmvnic: retry reset if there are no o
ibmvnic: Free rwi on reset success
Free the rwi structure in the event that the last rwi in the list processed successfully. The logic in commit 4f408e1fa6e1 ("ibmvnic: retry reset if there are no other resets") introduces an issue that results in a 32 byte memory leak whenever the last rwi in the list gets processed.
Fixes: 4f408e1fa6e1 ("ibmvnic: retry reset if there are no other resets") Signed-off-by: Nick Child <nnac123@linux.ibm.com> Link: https://lore.kernel.org/r/20221031150642.13356-1-nnac123@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71 |
|
#
b48b89f9 |
| 27-Sep-2022 |
Jakub Kicinski <kuba@kernel.org> |
net: drop the weight argument from netif_napi_add
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight arg
net: drop the weight argument from netif_napi_add
We tell driver developers to always pass NAPI_POLL_WEIGHT as the weight to netif_napi_add(). This may be confusing to newcomers, drop the weight argument, those who really need to tweak the weight can use netif_napi_add_weight().
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for CAN Link: https://lore.kernel.org/r/20220927132753.750069-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52 |
|
#
1b18f09d |
| 02-Jul-2022 |
Rick Lindsley <ricklind@us.ibm.com> |
ibmvnic: Properly dispose of all skbs during a failover.
During a reset, there may have been transmits in flight that are no longer valid and cannot be fulfilled. Resetting and clearing the queues
ibmvnic: Properly dispose of all skbs during a failover.
During a reset, there may have been transmits in flight that are no longer valid and cannot be fulfilled. Resetting and clearing the queues is insufficient; each skb also needs to be explicitly freed so that upper levels are not left waiting for confirmation of a transmit that will never happen. If this happens frequently enough, the apparent backlog will cause TCP to begin "congestion control" unnecessarily, culminating in permanently decreased throughput.
Fixes: d7c0ef36bde03 ("ibmvnic: Free and re-allocate scrqs when tx/rx scrqs change") Tested-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Rick Lindsley <ricklind@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38 |
|
#
6bff3ffc |
| 04-May-2022 |
Christophe Leroy <christophe.leroy@csgroup.eu> |
net: ethernet: Prepare cleanup of powerpc's asm/prom.h
powerpc's asm/prom.h includes some headers that it doesn't need itself.
In order to clean powerpc's asm/prom.h up in a further step, first cle
net: ethernet: Prepare cleanup of powerpc's asm/prom.h
powerpc's asm/prom.h includes some headers that it doesn't need itself.
In order to clean powerpc's asm/prom.h up in a further step, first clean all files that include asm/prom.h
Some files don't need asm/prom.h at all. For those ones, just remove inclusion of asm/prom.h
Some files don't need any of the items provided by asm/prom.h, but need some of the headers included by asm/prom.h. For those ones, add the needed headers that are brought by asm/prom.h at the moment and remove asm/prom.h
Some files really need asm/prom.h but also need some of the headers included by asm/prom.h. For those one, leave asm/prom.h but also add the needed headers so that they can be removed from asm/prom.h in a later step.
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.37 |
|
#
aeaf59b7 |
| 27-Apr-2022 |
Dany Madden <drt@linux.ibm.com> |
Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits"
This reverts commit 723ad916134784b317b72f3f6cf0f7ba774e5dae
When client requests channel or ring size larger than what th
Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits"
This reverts commit 723ad916134784b317b72f3f6cf0f7ba774e5dae
When client requests channel or ring size larger than what the server can support the server will cap the request to the supported max. So, the client would not be able to successfully request resources that exceed the server limit.
Fixes: 723ad9161347 ("ibmvnic: Add ethtool private flag for driver-defined queue limits") Signed-off-by: Dany Madden <drt@linux.ibm.com> Link: https://lore.kernel.org/r/20220427235146.23189-1-drt@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.36, v5.15.35, v5.15.34 |
|
#
93b1ebb3 |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: Allow multiple ltbs in txpool ltb_set
Allow multiple LTBs in the txpool's ltb_set. i.e rather than using a single large LTB, use several smaller LTBs.
The first n-1 LTBs will all be of the
ibmvnic: Allow multiple ltbs in txpool ltb_set
Allow multiple LTBs in the txpool's ltb_set. i.e rather than using a single large LTB, use several smaller LTBs.
The first n-1 LTBs will all be of the same size. The size of the last LTB in the set depends on the number of buffers and buffer (mtu) size. This strategy hopefully allows more reuse of the initial LTBs and also reduces the chances of an allocation failure (of the large LTB) when system is low in memory.
Suggested-by: Brian King <brking@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
a75de820 |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: Allow multiple ltbs in rxpool ltb_set
Allow multiple LTBs in the rxpool's ltb_set. The first n-1 LTBs will all be of the same size. The size of the last LTB in the set depends on the number
ibmvnic: Allow multiple ltbs in rxpool ltb_set
Allow multiple LTBs in the rxpool's ltb_set. The first n-1 LTBs will all be of the same size. The size of the last LTB in the set depends on the number of buffers and buffer (mtu) size.
Having a set of LTBs per pool provides a couple of benefits.
First, with the current value of IBMVNIC_MAX_LTB_SIZE of 16MB, with an MTU of 9000, we need a LTB (DMA buffer) of that size but the allocation can fail in low memory conditions. With a set of LTBs per pool, we can use several smaller (8MB) LTBs and hopefully have fewer allocation failures. (See also comments in ibmvnic.h on the trade-off with smaller LTBs)
Second since the kernel limits the size of the DMA buffer to 16MB (based on MAX_ORDER), with a single DMA buffer per pool, the pool is also limited to 16MB. This in turn limits the number of buffers per pool to 1763 when MTU is 9000. With a set of LTBs per pool, we can have upto the max of 4096 buffers per pool even when MTU is 9000.
Suggested-by: Brian King <brking@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
d6b45850 |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: convert rxpool ltb to a set of ltbs
Define and use interfaces that treat the long term buffer (LTB) of an rxpool as a set of LTBs rather than a single LTB. The set only has one LTB for now.
ibmvnic: convert rxpool ltb to a set of ltbs
Define and use interfaces that treat the long term buffer (LTB) of an rxpool as a set of LTBs rather than a single LTB. The set only has one LTB for now.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
0c91bf9c |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: define map_txpool_buf_to_ltb()
Define a helper to map a given txpool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per txpool so the mappi
ibmvnic: define map_txpool_buf_to_ltb()
Define a helper to map a given txpool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per txpool so the mapping is trivial. When we add support for multiple LTBs per txpool, this helper will be more useful.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
2872a67c |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: define map_rxpool_buf_to_ltb()
Define a helper to map a given rx pool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per pool so the mappin
ibmvnic: define map_rxpool_buf_to_ltb()
Define a helper to map a given rx pool buffer into its corresponding long term buffer (LTB) and offset. Currently there is just one LTB per pool so the mapping is trivial. When we add support for multiple LTBs per pool, this helper will be more useful.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
#
8880fc66 |
| 13-Apr-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: rename local variable index to bufidx
The local variable 'index' is heavily used in some functions and is confusing with the presence of other "index" fields like pool->index, ->consumer_in
ibmvnic: rename local variable index to bufidx
The local variable 'index' is heavily used in some functions and is confusing with the presence of other "index" fields like pool->index, ->consumer_index, etc. Rename it to bufidx to better reflect that its the index of a buffer in the pool.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30 |
|
#
4219196d |
| 16-Mar-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: fix race between xmit and reset
There is a race between reset and the transmit paths that can lead to ibmvnic_xmit() accessing an scrq after it has been freed in the reset path. It can resu
ibmvnic: fix race between xmit and reset
There is a race between reset and the transmit paths that can lead to ibmvnic_xmit() accessing an scrq after it has been freed in the reset path. It can result in a crash like:
Kernel attempted to read user page (0) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000000 Faulting instruction address: 0xc0080000016189f8 Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic] LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 Call Trace: [c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable) [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c9cfcc] sch_direct_xmit+0xec/0x330 [c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0 [c000000000c00ad4] __dev_queue_xmit+0x394/0x730 [c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding] [c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding] [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280 [c000000000c00ca4] __dev_queue_xmit+0x564/0x730 [c000000000cf97e0] neigh_hh_output+0xd0/0x180 [c000000000cfa69c] ip_finish_output2+0x31c/0x5c0 [c000000000cfd244] __ip_queue_xmit+0x194/0x4f0 [c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0 [c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0 [c000000000d2d984] tcp_retransmit_skb+0x34/0x130 [c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0 [c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330 [c000000000d317bc] tcp_write_timer+0x5c/0x200 [c000000000243270] call_timer_fn+0x50/0x1c0 [c000000000243704] __run_timers.part.0+0x324/0x460 [c000000000243894] run_timer_softirq+0x54/0xa0 [c000000000ea713c] __do_softirq+0x15c/0x3e0 [c000000000166258] __irq_exit_rcu+0x158/0x190 [c000000000166420] irq_exit+0x20/0x40 [c00000000002853c] timer_interrupt+0x14c/0x2b0 [c000000000009a00] decrementer_common_virt+0x210/0x220 --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
The immediate cause of the crash is the access of tx_scrq in the following snippet during a reset, where the tx_scrq can be either NULL or an address that will soon be invalid:
ibmvnic_xmit() { ... tx_scrq = adapter->tx_scrq[queue_num]; txq = netdev_get_tx_queue(netdev, queue_num); ind_bufp = &tx_scrq->ind_buf;
if (test_bit(0, &adapter->resetting)) { ... }
But beyond that, the call to ibmvnic_xmit() itself is not safe during a reset and the reset path attempts to avoid this by stopping the queue in ibmvnic_cleanup(). However just after the queue was stopped, an in-flight ibmvnic_complete_tx() could have restarted the queue even as the reset is progressing.
Since the queue was restarted we could get a call to ibmvnic_xmit() which can then access the bad tx_scrq (or other fields).
We cannot however simply have ibmvnic_complete_tx() check the ->resetting bit and skip starting the queue. This can race at the "back-end" of a good reset which just restarted the queue but has not cleared the ->resetting bit yet. If we skip restarting the queue due to ->resetting being true, the queue would remain stopped indefinitely potentially leading to transmit timeouts.
IOW ->resetting is too broad for this purpose. Instead use a new flag that indicates whether or not the queues are active. Only the open/ reset paths control when the queues are active. ibmvnic_complete_tx() and others wake up the queue only if the queue is marked active.
So we will have: A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open()
->resetting = true ->tx_queues_active = false disable tx queues ... ->tx_queues_active = true start tx queues
B. Tx interrupt in ibmvnic_complete_tx():
if (->tx_queues_active) netif_wake_subqueue();
To ensure that ->tx_queues_active and state of the queues are consistent, we need a lock which:
- must also be taken in the interrupt path (ibmvnic_complete_tx()) - shared across the multiple queues in the adapter (so they don't become serialized)
Use rcu_read_lock() and have the reset thread synchronize_rcu() after updating the ->tx_queues_active state.
While here, consolidate a few boolean fields in ibmvnic_adapter for better alignment.
Based on discussions with Brian King and Dany Madden.
Fixes: 7ed5b31f4a66 ("net/ibmvnic: prevent more than one thread from running in reset") Reported-by: Vaishnavi Bhat <vaish123@in.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v5.15.29, v5.15.28, v5.15.27, v5.15.26 |
|
#
fd98693c |
| 25-Feb-2022 |
Sukadev Bhattiprolu <sukadev@linux.ibm.com> |
ibmvnic: Allow queueing resets during probe
We currently don't allow queuing resets when adapter is in VNIC_PROBING state - instead we throw away the reset and return EBUSY. The reasoning is probabl
ibmvnic: Allow queueing resets during probe
We currently don't allow queuing resets when adapter is in VNIC_PROBING state - instead we throw away the reset and return EBUSY. The reasoning is probably that during ibmvnic_probe() the ibmvnic_adapter itself is being initialized so performing a reset during this time can lead us to accessing fields in the ibmvnic_adapter that are not fully initialized. A review of the code shows that all the adapter state neede to process a reset is initialized before registering the CRQ so that should no longer be a concern.
Further the expectation is that if we do get a reset (transport event) during probe, the do..while() loop in ibmvnic_probe() will handle this by reinitializing the CRQ.
While that is true to some extent, it is possible that the reset might occur _after_ the CRQ is registered and CRQ_INIT message was exchanged but _before_ the adapter state is set to VNIC_PROBED. As mentioned above, such a reset will be thrown away. While the client assumes that the adapter is functional, the vnic server will wait for the client to reinit the adapter. This disconnect between the two leaves the adapter down needing manual intervention.
Because ibmvnic_probe() has other work to do after initializing the CRQ (such as registering the netdev at a minimum) and because the reset event can occur at any instant after the CRQ is initialized, there will always be a window between initializing the CRQ and considering the adapter ready for resets (ie state == PROBED).
So rather than discarding resets during this window, allow queueing them - but only process them after the adapter is fully initialized.
To do this, introduce a new completion state ->probe_done and have the reset worker thread wait on this before processing resets.
This change brings up two new situations in or just after ibmvnic_probe(). First after one or more resets were queued, we encounter an error and decide to retry the initialization. At that point the queued resets are no longer relevant since we could be talking to a new vnic server. So we must purge/flush the queued resets before restarting the initialization. As a side note, since we are still in the probing stage and we have not registered the netdev, it will not be CHANGE_PARAM reset.
Second this change opens up a potential race between the worker thread in __ibmvnic_reset(), the tasklet and the ibmvnic_open() due to the following sequence of events:
1. Register CRQ 2. Get transport event before CRQ_INIT completes. 3. Tasklet schedules reset: a) add rwi to list b) schedule_work() to start worker thread which runs and waits for ->probe_done. 4. ibmvnic_probe() decides to retry, purges rwi_list 5. Re-register crq and this time rest of probe succeeds - register netdev and complete(->probe_done). 6. Worker thread resumes in __ibmvnic_reset() from 3b. 7. Worker thread sets ->resetting bit 8. ibmvnic_open() comes in, notices ->resetting bit, sets state to IBMVNIC_OPEN and returns early expecting worker thread to finish the open. 9. Worker thread finds rwi_list empty and returns without opening the interface.
If this happens, the ->ndo_open() call is effectively lost and the interface remains down. To address this, ensure that ->rwi_list is not empty before setting the ->resetting bit. See also comments in __ibmvnic_reset().
Fixes: 6a2fb0e99f9c ("ibmvnic: driver initialization for kdump/kexec") Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|