1. 28 4月, 2016 3 次提交
  2. 25 4月, 2016 1 次提交
  3. 20 4月, 2016 1 次提交
  4. 14 4月, 2016 5 次提交
    • P
      drm/i915/mocs: Program MOCS for all engines on init · 0ccdacf6
      Peter Antoine 提交于
      Allow for the MOCS to be programmed for all engines.
      Currently we program the MOCS when the first render batch
      goes through. This works on most platforms but fails on
      platforms that do not run a render batch early,
      i.e. headless servers. The patch now programs all initialised engines
      on init and the RCS is programmed again within the initial batch. This
      is done for predictable consistency with regards to the hardware
      context.
      
      Hardware context loading sets the values of the MOCS for RCS
      and L3CC. Programming them from within the batch makes sure that
      the render context is valid, no matter what the previous state of
      the saved-context was.
      
      v2: posted correct version to the mailing list.
      v3: moved programming to within engine->init_hw() (Chris Wilson)
      v4: code formatting and white-space changes. (Chris Wilson)
      
      Testcase: igt/gem_mocs_settings
      Signed-off-by: NPeter Antoine <peter.antoine@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1460556205-6644-1-git-send-email-chris@chris-wilson.co.uk
      0ccdacf6
    • C
      drm/i915: Late request cancellations are harmful · aa9b7810
      Chris Wilson 提交于
      Conceptually, each request is a record of a hardware transaction - we
      build up a list of pending commands and then either commit them to
      hardware, or cancel them. However, whilst building up the list of
      pending commands, we may modify state outside of the request and make
      references to the pending request. If we do so and then cancel that
      request, external objects then point to the deleted request leading to
      both graphical and memory corruption.
      
      The easiest example is to consider object/VMA tracking. When we mark an
      object as active in a request, we store a pointer to this, the most
      recent request, in the object. Then we want to free that object, we wait
      for the most recent request to be idle before proceeding (otherwise the
      hardware will write to pages now owned by the system, or we will attempt
      to read from those pages before the hardware is finished writing). If
      the request was cancelled instead, that wait completes immediately. As a
      result, all requests must be committed and not cancelled if the external
      state is unknown.
      
      All that remains of i915_gem_request_cancel() users are just a couple of
      extremely unlikely allocation failures, so remove the API entirely.
      
      A consequence of committing all incomplete requests is that we generate
      excess breadcrumbs and fill the ring much more often with dummy work. We
      have completely undone the outstanding_last_seqno optimisation.
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93907Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: http://patchwork.freedesktop.org/patch/msgid/1460565315-7748-16-git-send-email-chris@chris-wilson.co.uk
      aa9b7810
    • C
      drm/i915: Prevent leaking of -EIO from i915_wait_request() · f4457ae7
      Chris Wilson 提交于
      Reporting -EIO from i915_wait_request() has proven very troublematic
      over the years, with numerous hard-to-reproduce bugs cropping up in the
      corner case of where a reset occurs and the code wasn't expecting such
      an error.
      
      If the we reset the GPU or have detected a hang and wish to reset the
      GPU, the request is forcibly complete and the wait broken. Currently, we
      report either -EAGAIN or -EIO in order for the caller to retreat and
      restart the wait (if appropriate) after dropping and then reacquiring
      the struct_mutex (essential to allow the GPU reset to proceed). However,
      if we take the view that the request is complete (no further work will
      be done on it by the GPU because it is dead and soon to be reset), then
      we can proceed with the task at hand and then drop the struct_mutex
      allowing the reset to occur. This transfers the burden of checking
      whether it is safe to proceed to the caller, which in all but one
      instance it is safe - completely eliminating the source of all spurious
      -EIO.
      
      Of note, we only have two API entry points where we expect that
      userspace can observe an EIO. First is when submitting an execbuf, if
      the GPU is terminally wedged, then the operation cannot succeed and an
      -EIO is reported. Secondly, existing userspace uses the throttle ioctl
      to detect an already wedged GPU before starting using HW acceleration
      (or to confirm that the GPU is wedged after an error condition). So if
      the GPU is wedged when the user calls throttle, also report -EIO.
      
      v2: Split more carefully the change to i915_wait_request() and assorted
      ABI from the reset handling.
      v3: Add a couple of WARN_ON(EIO) to the interruptible modesetting code
      so that we don't start to leak EIO there in future (and break our hang
      resistant modesetting).
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: http://patchwork.freedesktop.org/patch/msgid/1460565315-7748-9-git-send-email-chris@chris-wilson.co.uk
      Link: http://patchwork.freedesktop.org/patch/msgid/1460565315-7748-1-git-send-email-chris@chris-wilson.co.uk
      f4457ae7
    • C
      drm/i915: Store the reset counter when constructing a request · 299259a3
      Chris Wilson 提交于
      As the request is only valid during the same global reset epoch, we can
      record the current reset_counter when constructing the request and reuse
      it when waiting upon that request in future. This removes a very hairy
      atomic check serialised by the struct_mutex at the time of waiting and
      allows us to transfer those waits to a central dispatcher for all
      waiters and all requests.
      
      PS: With per-engine resets, we obviously cannot assume a global reset
      epoch for the requests - a per-engine epoch makes the most sense. The
      challenge then is how to handle checking in the waiter for when to break
      the wait, as the fine-grained reset may also want to requeue the
      request (i.e. the assumption that just because the epoch changes the
      request is completed may be broken - or we just avoid breaking that
      assumption with the fine-grained resets).
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: http://patchwork.freedesktop.org/patch/msgid/1460565315-7748-7-git-send-email-chris@chris-wilson.co.uk
      299259a3
    • C
      drm/i915: Hide the atomic_read(reset_counter) behind a helper · c19ae989
      Chris Wilson 提交于
      This is principally a little bit of syntatic sugar to hide the
      atomic_read()s throughout the code to retrieve the current reset_counter.
      It also provides the other utility functions to check the reset state on the
      already read reset_counter, so that (in later patches) we can read it once
      and do multiple tests rather than risk the value changing between tests.
      
      v2: Be more strict on converting existing i915_reset_in_progress() over to
      the more verbose i915_reset_in_progress_or_wedged().
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: http://patchwork.freedesktop.org/patch/msgid/1460565315-7748-4-git-send-email-chris@chris-wilson.co.uk
      c19ae989
  5. 13 4月, 2016 3 次提交
  6. 12 4月, 2016 1 次提交
  7. 11 4月, 2016 1 次提交
  8. 09 4月, 2016 2 次提交
  9. 04 4月, 2016 1 次提交
    • T
      drm/i915: Move execlists irq handler to a bottom half · 27af5eea
      Tvrtko Ursulin 提交于
      Doing a lot of work in the interrupt handler introduces huge
      latencies to the system as a whole.
      
      Most dramatic effect can be seen by running an all engine
      stress test like igt/gem_exec_nop/all where, when the kernel
      config is lean enough, the whole system can be brought into
      multi-second periods of complete non-interactivty. That can
      look for example like this:
      
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143]
       Modules linked in: [redacted for brevity]
       CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G     U       L  4.5.0-160321+ #183
       Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1
       Workqueue: i915 gen6_pm_rps_work [i915]
       task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000
       RIP: 0010:[<ffffffff8104a3c2>]  [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0
       RSP: 0000:ffff88014f403f38  EFLAGS: 00000206
       RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0
       RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80
       RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022
       R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030
       R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082
       FS:  0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Stack:
        042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a
        0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080
        0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8
       Call Trace:
        <IRQ>
        [<ffffffff8104a716>] irq_exit+0x86/0x90
        [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50
        [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90
        <EOI>
        [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915]
        [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20
        [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915]
        [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0
        [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915]
        [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915]
        [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915]
        [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915]
        [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0
        [<ffffffff8105ab29>] process_one_work+0x139/0x350
        [<ffffffff8105b186>] worker_thread+0x126/0x490
        [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320
        [<ffffffff8105fa64>] kthread+0xc4/0xe0
        [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
        [<ffffffff814f351f>] ret_from_fork+0x3f/0x70
        [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170
      
      I could not explain, or find a code path, which would explain
      a +20 second lockup, but from some instrumentation it was
      apparent the interrupts off proportion of time was between
      10-25% under heavy load which is quite bad.
      
      When a interrupt "cliff" is reached, which was >~320k irq/s on
      my machine, the whole system goes into a terrible state of the
      above described multi-second lockups.
      
      By moving the GT interrupt handling to a tasklet in a most
      simple way, the problem above disappears completely.
      
      Testing the effect on sytem-wide latencies using
      igt/gem_syslatency shows the following before this patch:
      
      gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us
      gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us
      gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us
      gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us
      
      This shows that the unrelated process is experiencing huge
      delays in its wake-up latency. After the patch the results
      look like this:
      
      gem_syslatency: cycles=808907, latency mean=53.133us max=1640us
      gem_syslatency: cycles=862154, latency mean=62.778us max=2117us
      gem_syslatency: cycles=856039, latency mean=58.079us max=2123us
      gem_syslatency: cycles=841683, latency mean=56.914us max=1667us
      
      Showing a huge improvement in the unrelated process wake-up
      latency. It also shows an approximate halving in the number
      of total empty batches submitted during the test. This may
      not be worrying since the test puts the driver under
      a very unrealistic load with ncpu threads doing empty batch
      submission to all GPU engines each.
      
      Another benefit compared to the hard-irq handling is that now
      work on all engines can be dispatched in parallel since we can
      have up to number of CPUs active tasklets. (While previously
      a single hard-irq would serially dispatch on one engine after
      another.)
      
      More interesting scenario with regards to throughput is
      "gem_latency -n 100" which  shows 25% better throughput and
      CPU usage, and 14% better dispatch latencies.
      
      I did not find any gains or regressions with Synmark2 or
      GLbench under light testing. More benchmarking is certainly
      required.
      
      v2:
         * execlists_lock should be taken as spin_lock_bh when
           queuing work from userspace now. (Chris Wilson)
         * uncore.lock must be taken with spin_lock_irq when
           submitting requests since that now runs from either
           softirq or process context.
      
      v3:
         * Expanded commit message with more testing data;
         * converted missed locking sites to _bh;
         * added execlist_lock comment. (Chris Wilson)
      
      v4:
         * Mention dispatch parallelism in commit. (Chris Wilson)
         * Do not hold uncore.lock over MMIO reads since the block
           is already serialised per-engine via the tasklet itself.
           (Chris Wilson)
         * intel_lrc_irq_handler should be static. (Chris Wilson)
         * Cancel/sync the tasklet on GPU reset. (Chris Wilson)
         * Document and WARN that tasklet cannot be active/pending
           on engine cleanup. (Chris Wilson/Imre Deak)
      Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Imre Deak <imre.deak@intel.com>
      Testcase: igt/gem_exec_nop/all
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1459768316-6670-1-git-send-email-tvrtko.ursulin@linux.intel.com
      27af5eea
  10. 24 3月, 2016 1 次提交
  11. 22 3月, 2016 1 次提交
  12. 18 3月, 2016 2 次提交
  13. 16 3月, 2016 5 次提交
  14. 04 3月, 2016 1 次提交
  15. 01 3月, 2016 1 次提交
    • T
      drm/i915: Execlists small cleanups and micro-optimisations · c6a2ac71
      Tvrtko Ursulin 提交于
      Assorted changes in the areas of code cleanup, reduction of
      invariant conditional in the interrupt handler and lock
      contention and MMIO access optimisation.
      
       * Remove needless initialization.
       * Improve cache locality by reorganizing code and/or using
         branch hints to keep unexpected or error conditions out
         of line.
       * Favor busy submit path vs. empty queue.
       * Less branching in hot-paths.
      
      v2:
      
       * Avoid mmio reads when possible. (Chris Wilson)
       * Use natural integer size for csb indices.
       * Remove useless return value from execlists_update_context.
       * Extract 32-bit ppgtt PDPs update so it is out of line and
         shared with two callers.
       * Grab forcewake across all mmio operations to ease the
         load on uncore lock and use chepear mmio ops.
      
      v3:
      
       * Removed some more pointless u8 data types.
       * Removed unused return from execlists_context_queue.
       * Commit message updates.
      
      v4:
       * Unclumsify the unqueue if statement. (Chris Wilson)
       * Hide forcewake from the queuing function. (Chris Wilson)
      
      Version 3 now makes the irq handling code path ~20% smaller on
      48-bit PPGTT hardware, and a little bit less elsewhere. Hot
      paths are mostly in-line now and hammering on the uncore
      spinlock is greatly reduced together with mmio traffic to an
      extent.
      
      Benchmarking with "gem_latency -n 100" (keep submitting
      batches with 100 nop instruction) shows approximately 4% higher
      throughput, 2% less CPU time and 22% smaller latencies. This was
      on a big-core while small-cores could benefit even more.
      
      Most likely reason for the improvements are the MMIO
      optimization and uncore lock traffic reduction.
      
      One odd result is with "gem_latency -n 0" (dispatching empty
      batches) which shows 5% more throughput, 8% less CPU time,
      25% better producer and consumer latencies, but 15% higher
      dispatch latency which is yet unexplained.
      Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com
      c6a2ac71
  16. 26 2月, 2016 3 次提交
  17. 29 1月, 2016 4 次提交
  18. 25 1月, 2016 2 次提交
  19. 21 1月, 2016 2 次提交