1. 09 4月, 2016 2 次提交
  2. 08 4月, 2016 1 次提交
  3. 04 4月, 2016 1 次提交
    • T
      drm/i915: Move execlists irq handler to a bottom half · 27af5eea
      Tvrtko Ursulin 提交于
      Doing a lot of work in the interrupt handler introduces huge
      latencies to the system as a whole.
      
      Most dramatic effect can be seen by running an all engine
      stress test like igt/gem_exec_nop/all where, when the kernel
      config is lean enough, the whole system can be brought into
      multi-second periods of complete non-interactivty. That can
      look for example like this:
      
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
       Modules linked in: [redacted for brevity]
       CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
       Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
       Workqueue: i915 gen6_pm_rps_work [i915]
       task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
       RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
       RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
       RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
       RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
       RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
       R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
       R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
       FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
        042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
        0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
        0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
       Call Trace:
        <IRQ>
        [<ffffffff8104a716>] irq_exit+0x86/0x90
        [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
        [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
        <EOI>
        [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
        [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
        [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
        [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
        [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
        [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
        [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
        [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
        [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
        [<ffffffff8105ab29>] process_one_work+0x139/0x350
        [<ffffffff8105b186>] worker_thread+0x126/0x490
        [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
        [<ffffffff8105fa64>] kthread+0xc4/0xe0
        [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
        [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
        [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
      
      I could not explain, or find a code path, which would explain
      a +20 second lockup, but from some instrumentation it was
      apparent the interrupts off proportion of time was between
      10-25% under heavy load which is quite bad.
      
      When a interrupt "cliff" is reached, which was >~320k irq/s on
      my machine, the whole system goes into a terrible state of the
      above described multi-second lockups.
      
      By moving the GT interrupt handling to a tasklet in a most
      simple way, the problem above disappears completely.
      
      Testing the effect on sytem-wide latencies using
      igt/gem_syslatency shows the following before this patch:
      
      gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
      gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
      gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
      gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us
      
      This shows that the unrelated process is experiencing huge
      delays in its wake-up latency. After the patch the results
      look like this:
      
      gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
      gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
      gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
      gem_syslatency: cycles=841683, latency mean=56.914us max=1667us
      
      Showing a huge improvement in the unrelated process wake-up
      latency. It also shows an approximate halving in the number
      of total empty batches submitted during the test. This may
      not be worrying since the test puts the driver under
      a very unrealistic load with ncpu threads doing empty batch
      submission to all GPU engines each.
      
      Another benefit compared to the hard-irq handling is that now
      work on all engines can be dispatched in parallel since we can
      have up to number of CPUs active tasklets. (While previously
      a single hard-irq would serially dispatch on one engine after
      another.)
      
      More interesting scenario with regards to throughput is
      "gem_latency -n 100" which  shows 25% better throughput and
      CPU usage, and 14% better dispatch latencies.
      
      I did not find any gains or regressions with Synmark2 or
      GLbench under light testing. More benchmarking is certainly
      required.
      
      v2:
         * execlists_lock should be taken as spin_lock_bh when
           queuing work from userspace now. (Chris Wilson)
         * uncore.lock must be taken with spin_lock_irq when
           submitting requests since that now runs from either
           softirq or process context.
      
      v3:
         * Expanded commit message with more testing data;
         * converted missed locking sites to _bh;
         * added execlist_lock comment. (Chris Wilson)
      
      v4:
         * Mention dispatch parallelism in commit. (Chris Wilson)
         * Do not hold uncore.lock over MMIO reads since the block
           is already serialised per-engine via the tasklet itself.
           (Chris Wilson)
         * intel_lrc_irq_handler should be static. (Chris Wilson)
         * Cancel/sync the tasklet on GPU reset. (Chris Wilson)
         * Document and WARN that tasklet cannot be active/pending
           on engine cleanup. (Chris Wilson/Imre Deak)
      Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Imre Deak <imre.deak@intel.com>
      Testcase: igt/gem_exec_nop/all
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1459768316-6670-1-git-send-email-tvrtko.ursulin@linux.intel.com
      27af5eea
  4. 21 3月, 2016 1 次提交
  5. 16 3月, 2016 5 次提交
  6. 04 3月, 2016 1 次提交
  7. 01 3月, 2016 1 次提交
    • T
      drm/i915: Execlists small cleanups and micro-optimisations · c6a2ac71
      Tvrtko Ursulin 提交于
      Assorted changes in the areas of code cleanup, reduction of
      invariant conditional in the interrupt handler and lock
      contention and MMIO access optimisation.
      
       * Remove needless initialization.
       * Improve cache locality by reorganizing code and/or using
         branch hints to keep unexpected or error conditions out
         of line.
       * Favor busy submit path vs. empty queue.
       * Less branching in hot-paths.
      
      v2:
      
       * Avoid mmio reads when possible. (Chris Wilson)
       * Use natural integer size for csb indices.
       * Remove useless return value from execlists_update_context.
       * Extract 32-bit ppgtt PDPs update so it is out of line and
         shared with two callers.
       * Grab forcewake across all mmio operations to ease the
         load on uncore lock and use chepear mmio ops.
      
      v3:
      
       * Removed some more pointless u8 data types.
       * Removed unused return from execlists_context_queue.
       * Commit message updates.
      
      v4:
       * Unclumsify the unqueue if statement. (Chris Wilson)
       * Hide forcewake from the queuing function. (Chris Wilson)
      
      Version 3 now makes the irq handling code path ~20% smaller on
      48-bit PPGTT hardware, and a little bit less elsewhere. Hot
      paths are mostly in-line now and hammering on the uncore
      spinlock is greatly reduced together with mmio traffic to an
      extent.
      
      Benchmarking with "gem_latency -n 100" (keep submitting
      batches with 100 nop instruction) shows approximately 4% higher
      throughput, 2% less CPU time and 22% smaller latencies. This was
      on a big-core while small-cores could benefit even more.
      
      Most likely reason for the improvements are the MMIO
      optimization and uncore lock traffic reduction.
      
      One odd result is with "gem_latency -n 0" (dispatching empty
      batches) which shows 5% more throughput, 8% less CPU time,
      25% better producer and consumer latencies, but 15% higher
      dispatch latency which is yet unexplained.
      Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com
      c6a2ac71
  8. 25 1月, 2016 1 次提交
  9. 21 1月, 2016 4 次提交
  10. 20 1月, 2016 1 次提交
  11. 18 1月, 2016 2 次提交
  12. 08 1月, 2016 1 次提交
    • M
      drm/i915: Inspect subunit states on hangcheck · 61642ff0
      Mika Kuoppala 提交于
      If head seems stuck and engine in question is rcs,
      inspect subunit state transitions from undone to done,
      before deciding that this really is a hang instead of limited
      progress. Only account the transitions of subunits from
      undone to done once, to prevent unstable subunit states
      to keep us falsely active.
      
      As this adds one extra steps to hangcheck heuristics,
      before hang is declared, it adds 1500ms to to detect hang
      for render ring to a total of 7500ms. We could sample
      the subunit states on first head stuck condition but
      decide not to do so only in order to mimic old behaviour. This
      way the check order of promotion from seqno > atchd > instdone
      is consistently done.
      
      v2: Deal with unstable done states (Arun)
          Clear instdone progress on head and seqno movement (Chris)
          Report raw and accumulated instdone's in in debugfs (Chris)
          Return HANGCHECK_ACTIVE on undone->done
      
      References: https://bugs.freedesktop.org/show_bug.cgi?id=93029
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Dave Gordon <david.s.gordon@intel.com>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/1448985372-19535-1-git-send-email-mika.kuoppala@intel.com
      61642ff0
  13. 10 12月, 2015 1 次提交
  14. 18 11月, 2015 2 次提交
  15. 29 10月, 2015 1 次提交
    • C
      drm/i915: Recover all available ringbuffer space following reset · 608c1a52
      Chris Wilson 提交于
      Having flushed all requests from all queues, we know that all
      ringbuffers must now be empty. However, since we do not reclaim
      all space when retiring the request (to prevent HEADs colliding
      with rapid ringbuffer wraparound) the amount of available space
      on each ringbuffer upon reset is less than when we start. Do one
      more pass over all the ringbuffers to reset the available space
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Cc: Arun Siluvery <arun.siluvery@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Dave Gordon <david.s.gordon@intel.com>
      608c1a52
  16. 04 9月, 2015 1 次提交
  17. 26 8月, 2015 1 次提交
    • I
      drm/i915/bxt: work around HW coherency issue when accessing GPU seqno · 319404df
      Imre Deak 提交于
      By running igt/store_dword_loop_render on BXT we can hit a coherency
      problem where the seqno written at GPU command completion time is not
      seen by the CPU. This results in __i915_wait_request seeing the stale
      seqno and not completing the request (not considering the lost
      interrupt/GPU reset mechanism). I also verified that this isn't a case
      of a lost interrupt, or that the command didn't complete somehow: when
      the coherency issue occured I read the seqno via an uncached GTT mapping
      too. While the cached version of the seqno still showed the stale value
      the one read via the uncached mapping was the correct one.
      
      Work around this issue by clflushing the corresponding CPU cacheline
      following any store of the seqno and preceding any reading of it. When
      reading it do this only when the caller expects a coherent view.
      
      v2:
      - fix using the proper logical && instead of a bitwise & (Jani, Mika)
      - limit the workaround to A stepping, on later steppings this HW issue
        is fixed
      v3:
      - use a separate get_seqno/set_seqno vfunc (Chris)
      
      Testcase: igt/store_dword_loop_render
      Signed-off-by: NImre Deak <imre.deak@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      319404df
  18. 14 7月, 2015 1 次提交
    • T
      drm/i915: Snapshot seqno of most recently submitted request. · 94f7bbe1
      Tomas Elf 提交于
      The hang checker needs to inspect whether or not the ring request list is empty
      as well as if the given engine has reached or passed the most recently
      submitted request. The problem with this is that the hang checker cannot grab
      the struct_mutex, which is required in order to safely inspect requests since
      requests might be deallocated during inspection. In the past we've had kernel
      panics due to this very unsynchronized access in the hang checker.
      
      One solution to this problem is to not inspect the requests directly since
      we're only interested in the seqno of the most recently submitted request - not
      the request itself. Instead the seqno of the most recently submitted request is
      stored separately, which the hang checker then inspects, circumventing the
      issue of synchronization from the hang checker entirely.
      
      This fixes a regression introduced in
      
      commit 44cdd6d2
      Author: John Harrison <John.C.Harrison@Intel.com>
      Date:   Mon Nov 24 18:49:40 2014 +0000
      
          drm/i915: Convert 'ring_idle()' to use requests not seqnos
      
      v2 (Chris Wilson):
      - Pass current engine seqno to ring_idle() from i915_hangcheck_elapsed() rather
      than compute it over again.
      - Remove extra whitespace.
      
      Issue: VIZ-5998
      Signed-off-by: NTomas Elf <tomas.elf@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      [danvet: Add regressing commit citation provided by Chris.]
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      94f7bbe1
  19. 06 7月, 2015 1 次提交
  20. 03 7月, 2015 1 次提交
    • J
      drm/i915: Reserve space improvements · 79bbcc29
      John Harrison 提交于
      An earlier patch was added to reserve space in the ring buffer for the
      commands issued during 'add_request()'. The initial version was
      pessimistic in the way it handled buffer wrapping and would cause
      premature wraps and thus waste ring space.
      
      This patch updates the code to better handle the wrap case. It no
      longer enforces that the space being asked for and the reserved space
      are a single contiguous block. Instead, it allows the reserve to be on
      the far end of a wrap operation. It still guarantees that the space is
      available so when the wrap occurs, no wait will happen. Thus the wrap
      cannot fail which is the whole point of the exercise.
      
      Also fixed a merge failure with some comments from the original patch.
      
      v2: Incorporated suggestion by David Gordon to move the wrap code
      inside the prepare function and thus allow a single combined
      wait_for_space() call rather than doing one before the wrap and
      another after. This also makes the prepare code much simpler and
      easier to follow.
      
      v3: Fix for 'effective_size' vs 'size' during ring buffer remainder
      calculations (spotted by Tomas Elf).
      
      For: VIZ-5115
      CC: Daniel Vetter <daniel@ffwll.ch>
      Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
      Reviewed-by: NTomas Elf <tomas.elf@intel.com>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      79bbcc29
  21. 23 6月, 2015 10 次提交