History log of /openbmc/linux/tools/testing/selftests/powerpc/eeh/eeh-functions.sh (Results 1 – 9 of 9)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10
# 38132cc0 02-Nov-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Add VF recovery tests

The basic EEH test ignores VFs since we the way the eeh_dev_break debugfs
interface works means that if multiple VFs are enabled we may cause errors
on all t

selftests/powerpc: Add VF recovery tests

The basic EEH test ignores VFs since we the way the eeh_dev_break debugfs
interface works means that if multiple VFs are enabled we may cause errors
on all them them. However, we can work around that by only enabling a
single VF at a time.

This patch adds some infrastructure for finding SR-IOV capable devices and
enabling / disabling VFs so we can exercise the VF specific EEH recovery
paths. Two new tests are added, one for testing EEH aware devices and one
for EEH un-aware VFs.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103044503.917128-3-oohall@gmail.com

show more ...


# d6749ccb 02-Nov-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Use stderr for debug messages in eeh-functions

We want to use stdout to return lists of devices, etc so log debug / status
messages to stderr rather than stdout.

Signed-off-by: O

selftests/powerpc: Use stderr for debug messages in eeh-functions

We want to use stdout to return lists of devices, etc so log debug / status
messages to stderr rather than stdout.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103044503.917128-2-oohall@gmail.com

show more ...


# db82f709 02-Nov-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Hoist helper code out of eeh-basic

Hoist some of the useful test environment checking and prep code into
eeh-functions.sh so they can be reused in other tests.

Signed-off-by: Oli

selftests/powerpc: Hoist helper code out of eeh-basic

Hoist some of the useful test environment checking and prep code into
eeh-functions.sh so they can be reused in other tests.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20201103044503.917128-1-oohall@gmail.com

show more ...


Revision tags: v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54
# 5f8cf647 26-Jul-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Squash spurious errors due to device removal

For drivers that don't have the error handling callbacks we implement
recovery by removing the device and re-probing it. This causes t

selftests/powerpc: Squash spurious errors due to device removal

For drivers that don't have the error handling callbacks we implement
recovery by removing the device and re-probing it. This causes the sysfs
directory for the PCI device to be removed which causes the following
spurious error to be printed when checking the PE state:

Breaking 0005:03:00.0...
./eeh-basic.sh: line 13: can't open /sys/bus/pci/devices/0005:03:00.0/eeh_pe_state: no such file
0005:03:00.0, waited 0/60
0005:03:00.0, waited 1/60
0005:03:00.0, waited 2/60
0005:03:00.0, waited 3/60
0005:03:00.0, waited 4/60
0005:03:00.0, waited 5/60
0005:03:00.0, waited 6/60
0005:03:00.0, waited 7/60
0005:03:00.0, Recovered after 8 seconds

We currently try to avoid this by checking if the PE state file exists
before reading from it. This is however inherently racy so re-work the
state checking so that we only read from the file once, and we squash any
errors that occur while reading.

Fixes: 85d86c8aa52e ("selftests/powerpc: Add basic EEH selftest")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727010127.23698-1-oohall@gmail.com

show more ...


Revision tags: v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14
# 414f5043 21-Jan-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/eeh: Bump EEH wait time to 60s

Some newer cards supported by aacraid can take up to 40s to recover
after an EEH event. This causes spurious failures in the basic EEH
self-test since the cu

selftests/eeh: Bump EEH wait time to 60s

Some newer cards supported by aacraid can take up to 40s to recover
after an EEH event. This causes spurious failures in the basic EEH
self-test since the current maximim timeout is only 30s.

Fix the immediate issue by bumping the timeout to a default of 60s,
and allow the wait time to be specified via an environmental variable
(EEH_MAX_WAIT).

Reported-by: Steve Best <sbest@redhat.com>
Suggested-by: Douglas Miller <dougmill@us.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200122031125.25991-1-oohall@gmail.com

show more ...


Revision tags: v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12
# 85d86c8a 03-Sep-2019 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Add basic EEH selftest

Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH
recovery. Historically this has been done manually using platform specific
EEH error inj

selftests/powerpc: Add basic EEH selftest

Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH
recovery. Historically this has been done manually using platform specific
EEH error injection facilities (e.g. via RTAS). However, documentation on
how to use these facilities is haphazard at best and non-existent at worst
so it's hard to develop a cross-platform test.

The new debugfs interfaces allow the kernel to handle the platform specific
details so we can write a more generic set of sets. This patch adds the
most basic of recovery tests where:

a) Errors are injected and recovered from sequentially,
b) Errors are not injected into PCI-PCI bridges, such as PCIe switches.
c) Errors are only injected into device function zero.
d) No errors are injected into Virtual Functions.

a), b) and c) are largely due to limitations of Linux's EEH support. EEH
recovery is serialised in the EEH recovery thread which forces a).
Similarly, multi-function PCI devices are almost always grouped into the
same PE so injecting an error on one function exercises the same code
paths. c) is because we currently more or less ignore PCI bridges during
recovery and assume that the recovered topology will be the same as the
original.

d) is due to the limits of the eeh_dev_break interface. With the current
implementation we can't inject an error into a specific VF without
potentially causing additional errors on other VFs. Due to the serialised
recovery process we might end up timing out waiting for another function to
recover before the function of interest is recovered. The platform specific
error injection facilities are finer-grained and allow this capability, but
doing that requires working out how to use those facilities first.

Basicly, it's better than nothing and it's a base to build on.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-15-oohall@gmail.com

show more ...


# 94976f93 26-Jul-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Squash spurious errors due to device removal

[ Upstream commit 5f8cf6475828b600ff6d000e580c961ac839cc61 ]

For drivers that don't have the error handling callbacks

selftests/powerpc: Squash spurious errors due to device removal

[ Upstream commit 5f8cf6475828b600ff6d000e580c961ac839cc61 ]

For drivers that don't have the error handling callbacks we implement
recovery by removing the device and re-probing it. This causes the sysfs
directory for the PCI device to be removed which causes the following
spurious error to be printed when checking the PE state:

Breaking 0005:03:00.0...
./eeh-basic.sh: line 13: can't open /sys/bus/pci/devices/0005:03:00.0/eeh_pe_state: no such file
0005:03:00.0, waited 0/60
0005:03:00.0, waited 1/60
0005:03:00.0, waited 2/60
0005:03:00.0, waited 3/60
0005:03:00.0, waited 4/60
0005:03:00.0, waited 5/60
0005:03:00.0, waited 6/60
0005:03:00.0, waited 7/60
0005:03:00.0, Recovered after 8 seconds

We currently try to avoid this by checking if the PE state file exists
before reading from it. This is however inherently racy so re-work the
state checking so that we only read from the file once, and we squash any
errors that occur while reading.

Fixes: 85d86c8aa52e ("selftests/powerpc: Add basic EEH selftest")
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200727010127.23698-1-oohall@gmail.com
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14
# 414f5043 21-Jan-2020 Oliver O'Halloran <oohall@gmail.com>

selftests/eeh: Bump EEH wait time to 60s

Some newer cards supported by aacraid can take up to 40s to recover
after an EEH event. This causes spurious failures in the basic EEH
self-t

selftests/eeh: Bump EEH wait time to 60s

Some newer cards supported by aacraid can take up to 40s to recover
after an EEH event. This causes spurious failures in the basic EEH
self-test since the current maximim timeout is only 30s.

Fix the immediate issue by bumping the timeout to a default of 60s,
and allow the wait time to be specified via an environmental variable
(EEH_MAX_WAIT).

Reported-by: Steve Best <sbest@redhat.com>
Suggested-by: Douglas Miller <dougmill@us.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200122031125.25991-1-oohall@gmail.com

show more ...


Revision tags: v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4, v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12
# 85d86c8a 03-Sep-2019 Oliver O'Halloran <oohall@gmail.com>

selftests/powerpc: Add basic EEH selftest

Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH
recovery. Historically this has been done manually using platform specific

selftests/powerpc: Add basic EEH selftest

Use the new eeh_dev_check and eeh_dev_break interfaces to test EEH
recovery. Historically this has been done manually using platform specific
EEH error injection facilities (e.g. via RTAS). However, documentation on
how to use these facilities is haphazard at best and non-existent at worst
so it's hard to develop a cross-platform test.

The new debugfs interfaces allow the kernel to handle the platform specific
details so we can write a more generic set of sets. This patch adds the
most basic of recovery tests where:

a) Errors are injected and recovered from sequentially,
b) Errors are not injected into PCI-PCI bridges, such as PCIe switches.
c) Errors are only injected into device function zero.
d) No errors are injected into Virtual Functions.

a), b) and c) are largely due to limitations of Linux's EEH support. EEH
recovery is serialised in the EEH recovery thread which forces a).
Similarly, multi-function PCI devices are almost always grouped into the
same PE so injecting an error on one function exercises the same code
paths. c) is because we currently more or less ignore PCI bridges during
recovery and assume that the recovered topology will be the same as the
original.

d) is due to the limits of the eeh_dev_break interface. With the current
implementation we can't inject an error into a specific VF without
potentially causing additional errors on other VFs. Due to the serialised
recovery process we might end up timing out waiting for another function to
recover before the function of interest is recovered. The platform specific
error injection facilities are finer-grained and allow this capability, but
doing that requires working out how to use those facilities first.

Basicly, it's better than nothing and it's a base to build on.

Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190903101605.2890-15-oohall@gmail.com

show more ...