/openbmc/linux/drivers/gpu/drm/i915/ |
H A D | i915_gem.c | diff 2232f0315c6688f5ff6b2067ea88d97542034873 Thu Sep 04 02:36:18 CDT 2014 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: Fix EIO/wedged handling in gem fault handler
In
commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Nov 15 17:17:22 2012 +0100
drm/i915: clear up wedged transitions
I've accidentally inverted the EIO/wedged handling in the fault handler: We want to return the EIO as a SIGBUS only if it's not because of the gpu having died, to prevent userspace from unduly dying.
In my defence the comment right above is completely misleading, so fix both.
v2: Drop the WARN_ON, it's not actually a bug to e.g. receive an -EIO when swap-in fails.
v3: Don't remove too much ... oops.
Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: stable@vger.kernel.org Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Jani Nikula <jani.nikula@intel.com> diff 7abb690a0e095717420ba78dcab4309abbbec78a Fri May 24 14:29:32 CDT 2013 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: Fix spurious -EIO/SIGBUS on wedged gpus
Chris Wilson noticed that since
commit 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c [v3.9] Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Nov 15 17:17:22 2012 +0100
drm/i915: clear up wedged transitions
X can again get -EIO when it does not expect it. And even worse score a SIGBUS when accessing gtt mmaps. The established ABI is that we _only_ return an -EIO from execbuf - all other ioctls should just work. And since the reset code moves all bos out of gpu domains and clears out all the last_seqno/ring tracking there really shouldn't be any reason for non-execbuf code to ever touch the hw and see an -EIO.
After some extensive discussions we've noticed that these spurios -EIO are caused by i915_gem_wait_for_error:
http://www.mail-archive.com/intel-gfx@lists.freedesktop.org/msg20540.html
That is easy to fix by returning 0 instead of -EIO, since grabbing the dev->struct_mutex does not yet mean that we actually want to touch the hw. And so there is no reason at all to fail with -EIO.
But that's not the entire since, since often (at least it's easily googleable) dmesg indicates that the reset fails and we declare the gpu wedged. Then, quite a bit later X wakes up with the "Timed out waiting for the gpu reset to complete" DRM_ERROR message in wait_for_errror and brings down the desktop with an -EIO/SIGBUS.
So clearly we're missing a wakeup somewhere, since the gpu reset just doesn't take 10 seconds to complete. And indeed we're do handle the terminally wedged state wrong.
Fix this all up.
References: https://bugs.freedesktop.org/show_bug.cgi?id=63921 References: https://bugs.freedesktop.org/show_bug.cgi?id=64073 Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Damien Lespiau <damien.lespiau@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> diff 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c Thu Nov 15 10:17:22 CST 2012 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: clear up wedged transitions
We have two important transitions of the wedged state in the current code:
- 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job.
- 1 -> 0: The reset handler has completed.
Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances.
Hence split this up, and add a new terminal state indicating that the hw is gone for good.
Also add explicit #defines for both states, update comments.
v2: Split out the reset handling bugfix for the throttle ioctl.
v3: s/tmp/wedged/ sugested by Chris Wilson. Also fixup up a rebase error which prevented this patch from actually compiling.
v4: To unify the wedged state with the reset counter, keep the reset-in-progress state just as a flag. The terminally-wedged state is now denoted with a big number.
v5: Add a comment to the reset_counter special values explaining that WEDGED & RESET_IN_PROGRESS needs to be true for the code to be correct.
v6: Fixup logic errors introduced with the wedged+reset_counter unification. Since WEDGED implies reset-in-progress (in a way we're terminally stuck in the dead-but-reset-not-completed state), we need ensure that we check for this everywhere. The specific bug was in wait_for_error, which would simply have timed out.
v7: Extract an inline i915_reset_in_progress helper to make the code more readable. Also annote the reset-in-progress case with an unlikely, to help the compiler optimize the fastpath. Do the same for the terminally wedged case with i915_terminally_wedged.
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
H A D | i915_debugfs.c | diff 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c Thu Nov 15 10:17:22 CST 2012 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: clear up wedged transitions
We have two important transitions of the wedged state in the current code:
- 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job.
- 1 -> 0: The reset handler has completed.
Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances.
Hence split this up, and add a new terminal state indicating that the hw is gone for good.
Also add explicit #defines for both states, update comments.
v2: Split out the reset handling bugfix for the throttle ioctl.
v3: s/tmp/wedged/ sugested by Chris Wilson. Also fixup up a rebase error which prevented this patch from actually compiling.
v4: To unify the wedged state with the reset counter, keep the reset-in-progress state just as a flag. The terminally-wedged state is now denoted with a big number.
v5: Add a comment to the reset_counter special values explaining that WEDGED & RESET_IN_PROGRESS needs to be true for the code to be correct.
v6: Fixup logic errors introduced with the wedged+reset_counter unification. Since WEDGED implies reset-in-progress (in a way we're terminally stuck in the dead-but-reset-not-completed state), we need ensure that we check for this everywhere. The specific bug was in wait_for_error, which would simply have timed out.
v7: Extract an inline i915_reset_in_progress helper to make the code more readable. Also annote the reset-in-progress case with an unlikely, to help the compiler optimize the fastpath. Do the same for the terminally wedged case with i915_terminally_wedged.
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
H A D | i915_irq.c | diff 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c Thu Nov 15 10:17:22 CST 2012 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: clear up wedged transitions
We have two important transitions of the wedged state in the current code:
- 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job.
- 1 -> 0: The reset handler has completed.
Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances.
Hence split this up, and add a new terminal state indicating that the hw is gone for good.
Also add explicit #defines for both states, update comments.
v2: Split out the reset handling bugfix for the throttle ioctl.
v3: s/tmp/wedged/ sugested by Chris Wilson. Also fixup up a rebase error which prevented this patch from actually compiling.
v4: To unify the wedged state with the reset counter, keep the reset-in-progress state just as a flag. The terminally-wedged state is now denoted with a big number.
v5: Add a comment to the reset_counter special values explaining that WEDGED & RESET_IN_PROGRESS needs to be true for the code to be correct.
v6: Fixup logic errors introduced with the wedged+reset_counter unification. Since WEDGED implies reset-in-progress (in a way we're terminally stuck in the dead-but-reset-not-completed state), we need ensure that we check for this everywhere. The specific bug was in wait_for_error, which would simply have timed out.
v7: Extract an inline i915_reset_in_progress helper to make the code more readable. Also annote the reset-in-progress case with an unlikely, to help the compiler optimize the fastpath. Do the same for the terminally wedged case with i915_terminally_wedged.
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|
H A D | i915_drv.h | diff 1f83fee08d625f8d0130f9fe5ef7b17c2e022f3c Thu Nov 15 10:17:22 CST 2012 Daniel Vetter <daniel.vetter@ffwll.ch> drm/i915: clear up wedged transitions
We have two important transitions of the wedged state in the current code:
- 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job.
- 1 -> 0: The reset handler has completed.
Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances.
Hence split this up, and add a new terminal state indicating that the hw is gone for good.
Also add explicit #defines for both states, update comments.
v2: Split out the reset handling bugfix for the throttle ioctl.
v3: s/tmp/wedged/ sugested by Chris Wilson. Also fixup up a rebase error which prevented this patch from actually compiling.
v4: To unify the wedged state with the reset counter, keep the reset-in-progress state just as a flag. The terminally-wedged state is now denoted with a big number.
v5: Add a comment to the reset_counter special values explaining that WEDGED & RESET_IN_PROGRESS needs to be true for the code to be correct.
v6: Fixup logic errors introduced with the wedged+reset_counter unification. Since WEDGED implies reset-in-progress (in a way we're terminally stuck in the dead-but-reset-not-completed state), we need ensure that we check for this everywhere. The specific bug was in wait_for_error, which would simply have timed out.
v7: Extract an inline i915_reset_in_progress helper to make the code more readable. Also annote the reset-in-progress case with an unlikely, to help the compiler optimize the fastpath. Do the same for the terminally wedged case with i915_terminally_wedged.
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com> Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
|