1. 06 3月, 2014 4 次提交
    • M
      drm/i915: Add reason for capture in error state · 58174462
      Mika Kuoppala 提交于
      We capture error state not only when the GPU hangs but also on
      other situations as in interrupt errors and in situations where
      we can kick things forward without GPU reset. There will be log
      entry on most of these cases. But as error state capture might be
      only thing we have, if dmesg was not captured. Or as in GEN4 case,
      interrupt error can trigger error state capture without log entry,
      the exact reason why capture was made is hard to decipher.
      
      v2: Split out the the error code stuff to separate patch (Ben)
      
      References: https://bugs.freedesktop.org/show_bug.cgi?id=74193Signed-off-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      58174462
    • M
      drm/i915: Add error code into error state · cb383002
      Mika Kuoppala 提交于
      commit 011cf577
      Author: Ben Widawsky <benjamin.widawsky@intel.com>
      Date:   Tue Feb 4 12:18:55 2014 +0000
      
          drm/i915: Generate a hang error code
      
      added error code debug into dmesg. Store this also
      with error state to make matching dmesg logs and error
      states easier.
      
      As we need to have full ring state for error code generation,
      do full capture always, print hang message into log and then
      decide if we need to keep the error state.
      Signed-off-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      cb383002
    • C
      drm/i915: Record pid/comm of hanging task · ab0e7ff9
      Chris Wilson 提交于
      After finding the guilty batch and request, we can use it to find the
      process that submitted the batch and then add the culprit into the error
      state.
      
      This is a slightly different approach from Ben's in that instead of
      adding the extra information into the struct i915_hw_context, we use the
      information already captured in struct drm_file which is then referenced
      from the request.
      
      v2: Also capture the workaround buffer for gen2, so that we can compare
          its contents against the intended batch for the active request.
      
      v3: Rebase (Mika)
      v4: Check for null context (Chris)
          checkpatch warnings fixed
      
      Link: http://lists.freedesktop.org/archives/intel-gfx/2013-August/032280.html
      Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (v2)
      Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v4)
      Acked-by: NBen Widawsky <ben@bwidawsk.net>
      Cc: Ben Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      ab0e7ff9
    • C
      drm/i915: Rely on accurate request tracking for finding hung batches · 8d9fc7fd
      Chris Wilson 提交于
      In the past, it was possible to have multiple batches per request due to
      a stray signal or ENOMEM. As a result we had to scan each active object
      (filtered by those having the COMMAND domain) for the one that contained
      the ACTHD pointer. This was then made more complicated by the
      introduction of ppgtt, whereby ACTHD then pointed into the address space
      of the context and so also needed to be taken into account.
      
      This is a fairly robust approach (though the implementation is a little
      fragile and depends upon the per-generation setup, registers and
      parameters). However, due to the requirements for hangstats, we needed a
      robust method for associating batches with a particular request and
      having that we can rely upon it for finding the associated batch object
      for error capture.
      
      If the batch buffer tracking is not robust enough, that should become
      apparent quite quickly through an erroneous error capture. That should
      also help to make sure that the runtime reporting to userspace is
      robust. It also means that we then report the oldest incomplete batch on
      each ring, which can be useful for determining the state of userspace at
      the time of a hang.
      
      v2: Use i915_gem_find_active_request (Mika)
      
      v3: remove check for ring->get_seqno, split long lines (Ben)
      
      v4: check that context is available (Chris)
          checkpatch warnings fixed
      
      Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (v1)
      Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> (v3)
      Cc: Ben Widawsky <benjamin.widawsky@intel.com>
      Reviewed-by: Ben Widawsky <ben@bwidawsk.net> (v3)
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      8d9fc7fd
  2. 06 2月, 2014 1 次提交
    • B
      drm/i915: Generate a hang error code · 011cf577
      Ben Widawsky 提交于
      We get a large number of bugs which have a, "hey I have that too"
      because they see a GPU hang in dmesg. While two machines of the same
      model having a GPU hang is indeed a coincidence, it is far from enough
      evidence to suggest they are the same.
      
      In order to reduce this effect, and hopefully get people to file new bug
      reports, clearly the error message itself has been insufficient (see ref
      at the bottom for a new bug report with this characteristic).
      
      The algorithm is purposely pretty naive. I don't think we need much in
      order to avoid the problem I am trying to solve, and keeping it naive
      gives us some ability to make a decent test case.
      
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      References: https://bugs.freedesktop.org/show_bug.cgi?id=73276Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      011cf577
  3. 31 1月, 2014 2 次提交
  4. 30 1月, 2014 5 次提交
  5. 28 1月, 2014 2 次提交
  6. 18 12月, 2013 5 次提交
    • B
      drm/i915: Use multiple VMs -- the point of no return · 7e0d96bc
      Ben Widawsky 提交于
      As with processes which run on the CPU, the goal of multiple VMs is to
      provide process isolation. Specific to GEN, there is also the ability to
      map more objects per process (2GB each instead of 2Gb-2k total).
      
      For the most part, all the pipes have been laid, and all we need to do
      is remove asserts and actually start changing address spaces with the
      context switch. Since prior to this we've converted the setting of the
      page tables to a streamed version, this is quite easy.
      
      One important thing to point out (since it'd been hotly contested) is
      that with this patch, every context created will have it's own address
      space (provided the HW can do it).
      
      v2: Disable BDW on rebase
      
      NOTE: I tried to make this commit as small as possible. I needed one
      place where I could "turn everything on" and that is here. It could be
      split into finer commits, but I didn't really see much point.
      
      Cc: Eric Anholt <eric@anholt.net>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      7e0d96bc
    • B
      drm/i915: Make pin count per VMA · d7f46fc4
      Ben Widawsky 提交于
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      d7f46fc4
    • B
      drm/i915: Identify active VM for batchbuffer capture · 685987c6
      Ben Widawsky 提交于
      Using the current state of the page directory registers, we can
      determine which of our address spaces was active when the hang occurred.
      This allows us to scan through all the address spaces to identify the
      "active" one during error capture.
      
      v2: Rebased for BDW error detection. BDW error detection is similar
      except instead of PP_DIR_BASE, we can use the PDP registers.
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      [danvet: Add FIXME about global gtt misuse.]
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      685987c6
    • B
      drm/i915: Don't use gtt mapping for !gtt error objects · 496bfcb9
      Ben Widawsky 提交于
      The existing check was insufficient to determine whether we can use the
      GTT mapping to read out the object during error capture.
      
      The previous condition was, if the object has a GGTT mapping, and the
      reloc is in the GTT range... the can happen with opjects mapped into
      multiple vms (one of which being the GTT).
      
      There are two solutions to this problem:
      1. This patch, which avoid reading the io mapping
      2. Use the GGTT offset with the io mapping.
      
      Since error capture is about recording the most accurate possible error
      state, and the error was caused by the object not in the GGTT - I opted
      for the former.
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      496bfcb9
    • B
      drm/i915: Add vm to error BO capture · a7b91078
      Ben Widawsky 提交于
      formerly: drm/i915: Create VMAs (part 6) - finish error plumbing
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      a7b91078
  7. 12 12月, 2013 2 次提交
  8. 09 11月, 2013 2 次提交
  9. 30 10月, 2013 1 次提交
  10. 10 10月, 2013 1 次提交
  11. 09 10月, 2013 1 次提交
  12. 04 10月, 2013 1 次提交
    • C
      drm/i915: Fix __wait_seqno to use true infinite timeouts · 094f9a54
      Chris Wilson 提交于
      When we switched to always using a timeout in conjunction with
      wait_seqno, we lost the ability to detect missed interrupts. Since, we
      have had issues with interrupts on a number of generations, and they are
      required to be delivered in a timely fashion for a smooth UX, it is
      important that we do log errors found in the wild and prevent the
      display stalling for upwards of 1s every time the seqno interrupt is
      missed.
      
      Rather than continue to fix up the timeouts to work around the interface
      impedence in wait_event_*(), open code the combination of
      wait_event[_interruptible][_timeout], and use the exposed timer to
      poll for seqno should we detect a lost interrupt.
      
      v2: In order to satisfy the debug requirement of logging missed
      interrupts with the real world requirments of making machines work even
      if interrupts are hosed, we revert to polling after detecting a missed
      interrupt.
      
      v3: Throw in a debugfs interface to simulate broken hw not reporting
      interrupts.
      
      v4: s/EGAIN/EAGAIN/ (Imre)
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NImre Deak <imre.deak@intel.com>
      [danvet: Don't use the struct typedef in new code.]
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      094f9a54
  13. 01 10月, 2013 2 次提交
  14. 24 9月, 2013 1 次提交
    • C
      drm/i915: Use a temporary va_list for two-pass string handling · e29bb4eb
      Chris Wilson 提交于
      In
      
      commit edc3d884
      Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Date:   Thu May 23 13:55:35 2013 +0300
      
          drm/i915: avoid big kmallocs on reading error state
      
      we introduce a two-pass mechanism for splitting long strings being
      formatted into the error-state. The first pass finds the length, and the
      second pass emits the right portion of the string into the accumulation
      buffer. Unfortunately we use the same va_list for both passes, resulting
      in the second pass reading garbage off the end of the argument list. As
      the two passes are only used for boundaries between read() calls, the
      corruption is only rarely seen.
      
      This fixes the root cause behind
      
      commit baf27f9b
      Author: Chris Wilson <chris@chris-wilson.co.uk>
      Date:   Sat Jun 29 23:26:50 2013 +0100
      
          drm/i915: Break up the large vsnprintf() in print_error_buffers()
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      e29bb4eb
  15. 06 9月, 2013 1 次提交
  16. 04 9月, 2013 1 次提交
  17. 22 8月, 2013 1 次提交
  18. 08 8月, 2013 2 次提交
    • B
      drm/i915: Update error capture for VMs · 95f5301d
      Ben Widawsky 提交于
      formerly: "drm/i915: Create VMAs (part 4) - Error capture"
      
      Since the active/inactive lists are per VM, we need to modify the error
      capture code to be aware of this, and also extend it to capture the
      buffers from all the VMs. For now all the code assumes only 1 VM, but it
      will become more generic over the next few patches.
      
      NOTE: If the number of VMs in a real world system grows significantly
      we'll have to focus on only capturing the guilty VM, or else it's likely
      there won't be enough space for error capture.
      
      v2: Squashed in the "part 6" which had dependencies on the mm_list
      change. Since I've moved the mm_list change to an earlier point in the
      series, we were able to accomplish it here and now.
      
      v3: Rebased over new error capture
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      95f5301d
    • B
      drm/i915: mm_list is per VMA · ca191b13
      Ben Widawsky 提交于
      formerly: "drm/i915: Create VMAs (part 5) - move mm_list"
      
      The mm_list is used for the active/inactive LRUs. Since those LRUs are
      per address space, the link should be per VMx .
      
      Because we'll only ever have 1 VMA before this point, it's not incorrect
      to defer this change until this point in the patch series, and doing it
      here makes the change much easier to understand.
      
      Shamelessly manipulated out of Daniel:
      "active/inactive stuff is used by eviction when we run out of address
      space, so needs to be per-vma and per-address space. Bound/unbound otoh
      is used by the shrinker which only cares about the amount of memory used
      and not one bit about in which address space this memory is all used in.
      Of course to actual kick out an object we need to unbind it from every
      address space, but for that we have the per-object list of vmas."
      
      v2: only bump GGTT LRU in i915_gem_object_set_to_gtt_domain (Chris)
      
      v3: Moved earlier in the series
      
      v4: Add dropped message from v3
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      [danvet: Frob patch to apply and use vma->node.size directly as
      discused with Ben. Also drop a needles BUG_ON before move_to_inactive,
      the function itself has the same check.]
      [danvet 2nd: Rebase on top of the lost "drm/i915: Cleanup more of VMA
      in destroy", specifically unlink the vma from the mm_list in
      vma_unbind (to keep it symmetric with bind_to_vm) instead of
      vma_destroy.]
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      ca191b13
  19. 06 8月, 2013 1 次提交
  20. 18 7月, 2013 1 次提交
    • B
      drm/i915: Move active/inactive lists to new mm · 5cef07e1
      Ben Widawsky 提交于
      Shamelessly manipulated out of Daniel :-)
      "When moving the lists around explain that the active/inactive stuff is
      used by eviction when we run out of address space, so needs to be
      per-vma and per-address space. Bound/unbound otoh is used by the
      shrinker which only cares about the amount of memory used and not one
      bit about in which address space this memory is all used in. Of course
      to actual kick out an object we need to unbind it from every address
      space, but for that we have the per-object list of vmas."
      
      v2: Leave the bound list as a global one. (Chris, indirectly)
      
      v3: Rebased with no i915_gtt_vm. In most places I added a new *vm local,
      since it will eventually be replaces by a vm argument.
      Put comment back inline, since it no longer makes sense to do otherwise.
      
      v4: Rebased on hangcheck/error state movement
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      Reviewed-by: NImre Deak <imre.deak@intel.com>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      5cef07e1
  21. 13 7月, 2013 1 次提交