drm/i915: don't bail out of intel_wait_ring_buffer too early

In the pre-gem days with non-existing hangcheck and gpu reset code, this timeout of 3 seconds was pretty important to avoid stuck processes. But now we have the hangcheck code in gem that goes to great length to ensure that the gpu is really dead before declaring it wedged. So there's no need for this timeout anymore. Actually it's even harmful because we can bail out too early (e.g. with xscreensaver slip) when running giant batchbuffers. And our code isn't robust enough to properly unroll any state-changes, we pretty much rely on the gpu reset code cleaning up the mess (like cache tracking, fencing state, active list/request tracking, ...). With this change intel_begin_ring can only fail when the gpu is wedged, and it will return -EAGAIN (like wait_request in case the gpu reset is still outstanding). v2: Chris Wilson noted that on resume timers aren't running and hence we won't ever get kicked out of this loop by the hangcheck code. Use an insanely large timeout instead for the HAS_GEM case to prevent resume bugs from totally hanging the machine. Signed-off-by: N Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: N Chris Wilson <chris@chris-wilson.co.uk> Acked-by: N Ben Widawsky <ben@bwidawsk.net> Reviewed-by: N Eugeni Dodonov <eugeni.dodonov@intel.com> Signed-off-by: N Keith Packard <keithp@keithp.com>

drm/i915: don't bail out of intel_wait_ring_buffer too early
In the pre-gem days with non-existing hangcheck and gpu reset code, this timeout of 3 seconds was pretty important to avoid stuck processes. But now we have the hangcheck code in gem that goes to great length to ensure that the gpu is really dead before declaring it wedged. So there's no need for this timeout anymore. Actually it's even harmful because we can bail out too early (e.g. with xscreensaver slip) when running giant batchbuffers. And our code isn't robust enough to properly unroll any state-changes, we pretty much rely on the gpu reset code cleaning up the mess (like cache tracking, fencing state, active list/request tracking, ...). With this change intel_begin_ring can only fail when the gpu is wedged, and it will return -EAGAIN (like wait_request in case the gpu reset is still outstanding). v2: Chris Wilson noted that on resume timers aren't running and hence we won't ever get kicked out of this loop by the hangcheck code. Use an insanely large timeout instead for the HAS_GEM case to prevent resume bugs from totally hanging the machine. Signed-off-by: N Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: N Chris Wilson <chris@chris-wilson.co.uk> Acked-by: N Ben Widawsky <ben@bwidawsk.net> Reviewed-by: N Eugeni Dodonov <eugeni.dodonov@intel.com> Signed-off-by: N Keith Packard <keithp@keithp.com>
e6bfaf85 · Daniel Vetter · Keith Packard · 4e0e90dc · e6bfaf85
隐藏空白更改
内联并排

Showing with 10 addition and 1 deletion

drivers/gpu/drm/i915/intel_ringbuffer.c drivers/gpu/drm/i915/intel_ringbuffer.c +10 -1

未找到文件。
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1135,7 +1135,16 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 	}

 	trace_i915_ring_wait_begin(ring);
-	end = jiffies + 3 * HZ;
+	if (drm_core_check_feature(dev, DRIVER_GEM))
+		/* With GEM the hangcheck timer should kick us out of the loop,
+		 * leaving it early runs the risk of corrupting GEM state (due
+		 * to running on almost untested codepaths). But on resume
+		 * timers don't work yet, so prevent a complete hang in that
+		 * case by choosing an insanely large timeout. */
+		end = jiffies + 60 * HZ;
+	else
+		end = jiffies + 3 * HZ;
+
 	do {
 		ring->head = I915_READ_HEAD(ring);
 		ring->space = ring_space(ring);