1. 06 3月, 2019 1 次提交
  2. 02 3月, 2019 3 次提交
    • C
      drm/i915: Prioritise non-busywait semaphore workloads · f9e9e9de
      Chris Wilson 提交于
      We don't want to busywait on the GPU if we have other work to do. If we
      give non-busywaiting workloads higher (initial) priority than workloads
      that require a busywait, we will prioritise work that is ready to run
      immediately. We then also have to be careful that we don't give earlier
      semaphores an accidental boost because later work doesn't wait on other
      rings, hence we keep a history of semaphore usage of the dependency chain.
      
      v2: Stop rolling the bits into a chain and just use a flag in case this
      request or any of our dependencies use a semaphore. The rolling around
      was contagious as Tvrtko was heard to fall off his chair.
      
      Testcase: igt/gem_exec_schedule/semaphore
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-4-chris@chris-wilson.co.uk
      f9e9e9de
    • C
      drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+ · e8861964
      Chris Wilson 提交于
      Having introduced per-context seqno, we now have a means to identity
      progress across the system without feel of rollback as befell the
      global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in
      advance of submission safe in the knowledge that our target seqno and
      address is stable.
      
      However, since we are telling the GPU to busy-spin on the target address
      until it matches the signaling seqno, we only want to do so when we are
      sure that busy-spin will be completed quickly. To achieve this we only
      submit the request to HW once the signaler is itself executing (modulo
      preemption causing us to wait longer), and we only do so for default and
      above priority requests (so that idle priority tasks never themselves
      hog the GPU waiting for others).
      
      As might be reasonably expected, HW semaphores excel in inter-engine
      synchronisation microbenchmarks (where the 3x reduced latency / increased
      throughput more than offset the power cost of spinning on a second ring)
      and have significant improvement (can be up to ~10%, most see no change)
      for single clients that utilize multiple engines (typically media players
      and transcoders), without regressing multiple clients that can saturate
      the system or changing the power envelope dramatically.
      
      v3: Drop the older NEQ branch, now we pin the signaler's HWSP anyway.
      v4: Tell the world and include it as part of scheduler caps.
      
      Testcase: igt/gem_exec_whisper
      Testcase: igt/benchmarks/gem_wsim
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-3-chris@chris-wilson.co.uk
      e8861964
    • C
      drm/i915: Keep timeline HWSP allocated until idle across the system · ebece753
      Chris Wilson 提交于
      In preparation for enabling HW semaphores, we need to keep in flight
      timeline HWSP alive until its use across entire system has completed,
      as any other timeline active on the GPU may still refer back to the
      already retired timeline. We both have to delay recycling available
      cachelines and unpinning old HWSP until the next idle point.
      
      An easy option would be to simply keep all used HWSP until the system as
      a whole was idle, i.e. we could release them all at once on parking.
      However, on a busy system, we may never see a global idle point,
      essentially meaning the resource will be leaked until we are forced to
      do a GC pass. We already employ a fine-grained idle detection mechanism
      for vma, which we can reuse here so that each cacheline can be freed
      immediately after the last request using it is retired.
      
      v3: Keep track of the activity of each cacheline.
      v4: cacheline_free() on canceling the seqno tracking
      v5: Finally with a testcase to exercise wraparound
      v6: Pack cacheline into empty bits of page-aligned vaddr
      v7: Use i915_utils to hide the pointer casting around bit manipulation
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-2-chris@chris-wilson.co.uk
      ebece753
  3. 01 3月, 2019 2 次提交
    • C
      drm/i915: Introduce i915_timeline.mutex · 3ef71149
      Chris Wilson 提交于
      A simple mutex used for guarding the flow of requests in and out of the
      timeline. In the short-term, it will be used only to guard the addition
      of requests into the timeline, taken on alloc and released on commit so
      that only one caller can construct a request into the timeline
      (important as the seqno and ring pointers must be serialised). This will
      be used by observers to ensure that the seqno/hwsp is stable. Later,
      when we have reduced retiring to only operate on a single timeline at a
      time, we can then use the mutex as the sole guard required for retiring.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301110547.14758-2-chris@chris-wilson.co.uk
      3ef71149
    • C
      drm/i915/execlists: Suppress mere WAIT preemption · b5773a36
      Chris Wilson 提交于
      WAIT is occasionally suppressed by virtue of preempted requests being
      promoted to NEWCLIENT if they have not all ready received that boost.
      Make this consistent for all WAIT boosts that they are not allowed to
      preempt executing contexts and are merely granted the right to be at the
      front of the queue for the next execution slot. This is in keeping with
      the desire that the WAIT boost be a minor tweak that does not give
      excessive promotion to its user and open ourselves to trivial abuse.
      
      The problem with the inconsistent WAIT preemption becomes more apparent
      as the preemption is propagated across the engines, where one engine may
      preempt and the other not, and we be relying on the exact execution
      order being consistent across engines (e.g. using HW semaphores to
      coordinate parallel execution).
      
      v2: Also protect GuC submission from false preemption loops.
      v3: Build bug safeguards and better debug messages for st.
      v4: Do the priority bumping in unsubmit (i.e. on preemption/reset
      unwind), applying it earlier during submit causes out-of-order execution
      combined with execute fences.
      v5: Call sw_fence_fini for our dummy request (Matthew)
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Matthew Auld <matthew.auld@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190228220639.3173-1-chris@chris-wilson.co.uk
      b5773a36
  4. 28 2月, 2019 1 次提交
  5. 26 2月, 2019 2 次提交
  6. 21 2月, 2019 1 次提交
  7. 19 2月, 2019 1 次提交
    • C
      drm/i915: Use time based guilty context banning · 7f4127c4
      Chris Wilson 提交于
      Currently, we accumulate each time a context hangs the GPU, offset
      against the number of requests it submits, and if that score exceeds a
      certain threshold, we ban that context from submitting any more requests
      (cancelling any work in flight). In contrast, we use a simple timer on
      the file, that if we see more than a 9 hangs faster than 60s apart in
      total across all of its contexts, we will ban the client from creating
      any more contexts. This leads to a confusing situation where the file
      may be banned before the context, so lets use a simple timer scheme for
      each.
      
      If the context submits 3 hanging requests within a 120s period, declare
      it forbidden to ever send more requests.
      
      This has the advantage of not being easy to repair by simply sending
      empty requests, but has the disadvantage that if the context is idle
      then it is forgiven. However, if the context is idle, it is not
      disrupting the system, but a hog can evade the request counting and
      cause much more severe disruption to the system.
      
      Updating ban_score from request retirement is dubious as the retirement
      is purposely not in sync with request submission (i.e. we try and batch
      retirement to reduce overhead and avoid latency on submission), which
      leads to surprising situations where we can forgive a hang immediately
      due to a backlog of requests from before the hang being retired
      afterwards.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-2-chris@chris-wilson.co.uk
      7f4127c4
  8. 15 2月, 2019 1 次提交
  9. 13 2月, 2019 1 次提交
  10. 06 2月, 2019 1 次提交
  11. 05 2月, 2019 2 次提交
    • T
      drm/i915: Add timeline barrier support · 78108584
      Tvrtko Ursulin 提交于
      Timeline barrier allows serialization between different timelines.
      
      After calling i915_timeline_set_barrier with a request, all following
      submissions on this timeline will be set up as depending on this request,
      or barrier. Once the barrier has been completed it automatically gets
      cleared and things continue as normal.
      
      This facility will be used by the upcoming context SSEU code.
      
      v2:
       * Assert barrier has been retired on timeline_fini. (Chris Wilson)
       * Fix mock_timeline.
      
      v3:
       * Improved comment language. (Chris Wilson)
      
      v4:
       * Maintain ordering with previous barriers set on the timeline.
      
      v5:
       * Rebase.
      Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Suggested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190205095032.22673-3-tvrtko.ursulin@linux.intel.com
      78108584
    • C
      drm/i915: Trim NEWCLIENT boosting · 1413b2bc
      Chris Wilson 提交于
      Limit the NEWCLIENT boost to only give its small priority boost to fresh
      clients only that have no dependencies.
      
      The idea for using NEWCLIENT boosting, commit b16c7651 ("drm/i915:
      Priority boost for new clients"), is that short-lived streams are often
      interactive and require lower latency -- and that by executing those
      ahead of the long running hogs, the short-lived clients do little to
      interfere with the system throughput by virtue of their short-lived
      nature. However, we were only considering the client's own timeline for
      determining whether or not it was a fresh stream. This allowed for
      compositors to wake up before their vblank and bump all of its client
      streams. However, in testing with media-bench this results in chaining
      all cooperating contexts together preventing us from being able to
      reorder contexts to reduce bubbles (pipeline stalls), overall increasing
      latency, and reducing system throughput. The exact opposite of our
      intent. The compromise of applying the NEWCLIENT boost to strictly fresh
      clients (that do not wait upon anything else) should maintain the
      "real-time response under load" characteristics of FQ_CODEL, without
      locking together the long chains of dependencies across the system.
      
      References: b16c7651 ("drm/i915: Priority boost for new clients")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190204150101.30759-1-chris@chris-wilson.co.uk
      1413b2bc
  12. 30 1月, 2019 2 次提交
    • C
      drm/i915: Replace global breadcrumbs with per-context interrupt tracking · 52c0fdb2
      Chris Wilson 提交于
      A few years ago, see commit 688e6c72 ("drm/i915: Slaughter the
      thundering i915_wait_request herd"), the issue of handling multiple
      clients waiting in parallel was brought to our attention. The
      requirement was that every client should be woken immediately upon its
      request being signaled, without incurring any cpu overhead.
      
      To handle certain fragility of our hw meant that we could not do a
      simple check inside the irq handler (some generations required almost
      unbounded delays before we could be sure of seqno coherency) and so
      request completion checking required delegation.
      
      Before commit 688e6c72, the solution was simple. Every client
      waiting on a request would be woken on every interrupt and each would do
      a heavyweight check to see if their request was complete. Commit
      688e6c72 introduced an rbtree so that only the earliest waiter on
      the global timeline would woken, and would wake the next and so on.
      (Along with various complications to handle requests being reordered
      along the global timeline, and also a requirement for kthread to provide
      a delegate for fence signaling that had no process context.)
      
      The global rbtree depends on knowing the execution timeline (and global
      seqno). Without knowing that order, we must instead check all contexts
      queued to the HW to see which may have advanced. We trim that list by
      only checking queued contexts that are being waited on, but still we
      keep a list of all active contexts and their active signalers that we
      inspect from inside the irq handler. By moving the waiters onto the fence
      signal list, we can combine the client wakeup with the dma_fence
      signaling (a dramatic reduction in complexity, but does require the HW
      being coherent, the seqno must be visible from the cpu before the
      interrupt is raised - we keep a timer backup just in case).
      
      Having previously fixed all the issues with irq-seqno serialisation (by
      inserting delays onto the GPU after each request instead of random delays
      on the CPU after each interrupt), we can rely on the seqno state to
      perfom direct wakeups from the interrupt handler. This allows us to
      preserve our single context switch behaviour of the current routine,
      with the only downside that we lose the RT priority sorting of wakeups.
      In general, direct wakeup latency of multiple clients is about the same
      (about 10% better in most cases) with a reduction in total CPU time spent
      in the waiter (about 20-50% depending on gen). Average herd behaviour is
      improved, but at the cost of not delegating wakeups on task_prio.
      
      v2: Capture fence signaling state for error state and add comments to
      warm even the most cold of hearts.
      v3: Check if the request is still active before busywaiting
      v4: Reduce the amount of pointer misdirection with list_for_each_safe
      and using a local i915_request variable inside the loops
      v5: Add a missing pluralisation to a purely informative selftest message.
      
      References: 688e6c72 ("drm/i915: Slaughter the thundering i915_wait_request herd")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
      52c0fdb2
    • C
      drm/i915: Identify active requests · 85474441
      Chris Wilson 提交于
      To allow requests to forgo a common execution timeline, one question we
      need to be able to answer is "is this request running?". To track
      whether a request has started on HW, we can emit a breadcrumb at the
      beginning of the request and check its timeline's HWSP to see if the
      breadcrumb has advanced past the start of this request. (This is in
      contrast to the global timeline where we need only ask if we are on the
      global timeline and if the timeline has advanced past the end of the
      previous request.)
      
      There is still confusion from a preempted request, which has already
      started but relinquished the HW to a high priority request. For the
      common case, this discrepancy should be negligible. However, for
      identification of hung requests, knowing which one was running at the
      time of the hang will be much more important.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129185452.20989-2-chris@chris-wilson.co.uk
      85474441
  13. 29 1月, 2019 2 次提交
  14. 25 1月, 2019 2 次提交
  15. 22 1月, 2019 1 次提交
  16. 21 1月, 2019 1 次提交
  17. 17 1月, 2019 1 次提交
  18. 10 1月, 2019 1 次提交
  19. 31 12月, 2018 2 次提交
  20. 28 12月, 2018 1 次提交
  21. 07 12月, 2018 2 次提交
  22. 27 11月, 2018 1 次提交
  23. 26 10月, 2018 1 次提交
    • C
      drm/i915: Park signaling thread while wrapping the seqno · 1e016a86
      Chris Wilson 提交于
      A danger encountered when resetting the seqno (using
      debugfs/i915_next_seqno) is that as we change the breadcrumb stored in
      the HWSP, it may be inspected by the signaler thread leading to
      confusion in our sanity checks.
      
      <0> [136.331342] i915/sig-347     3..s1 136336154us : execlists_submission_tasklet: rcs0 awake?=1, active=5
      <0> [136.331373] i915/sig-347     3d.s2 136336155us : process_csb: rcs0 cs-irq head=5, tail=0
      <0> [136.331402] i915/sig-347     3d.s2 136336155us : process_csb: rcs0 csb[0]: status=0x00000018:0x00000002, active=0x5
      <0> [136.331434] i915/sig-347     3d.s2 136336156us : process_csb: rcs0 out[0]: ctx=2.1, global=219 (fence 46:8455) (current 219), prio=0
      <0> [136.331466] i915/sig-347     3d.s2 136336156us : process_csb: rcs0 completed ctx=2
      <0> [136.332027] gem_exec-1049    0.... 136336246us : reset_all_global_seqno.part.5: rcs0 seqno 219 (current 219) -> -43
      <0> [136.332056] gem_exec-1049    0.... 136336251us : reset_all_global_seqno.part.5: bcs0 seqno 183 (current 183) -> -43
      <0> [136.332085] gem_exec-1049    0.... 136336255us : reset_all_global_seqno.part.5: vcs0 seqno 191 (current 191) -> -43
      <0> [136.332114] gem_exec-1049    0.... 136336259us : reset_all_global_seqno.part.5: vcs1 seqno 180 (current 180) -> -43
      <0> [136.332143] gem_exec-1049    0.... 136336262us : reset_all_global_seqno.part.5: vecs0 seqno 212 (current 212) -> -43
      <0> [136.332174] i915/sig-347     3.... 136336280us : intel_breadcrumbs_signaler: intel_breadcrumbs_signaler:673 GEM_BUG_ON(!i915_request_completed(rq))
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20181024104939.2861-1-chris@chris-wilson.co.uk
      1e016a86
  24. 05 10月, 2018 1 次提交
  25. 02 10月, 2018 3 次提交
  26. 14 9月, 2018 1 次提交
    • C
      drm/i915: Limit the backpressure for i915_request allocation · 11abf0c5
      Chris Wilson 提交于
      If we try and fail to allocate a i915_request, we apply some
      backpressure on the clients to throttle the memory allocations coming
      from i915.ko. Currently, we wait until completely idle, but this is far
      too heavy and leads to some situations where the only escape is to
      declare a client hung and reset the GPU. The intent is to only ratelimit
      the allocation requests and to allow ourselves to recycle requests and
      memory from any long queues built up by a client hog.
      
      Although the system memory is inherently a global resources, we don't
      want to overly penalize an unlucky client to pay the price of reaping a
      hog. To reduce the influence of one client on another, we can instead of
      waiting for the entire GPU to idle, impose a barrier on the local client.
      (One end goal for request allocation is for scalability to many
      concurrent allocators; simultaneous execbufs.)
      
      To prevent ourselves from getting caught out by long running requests
      (requests that may never finish without userspace intervention, whom we
      are blocking) we need to impose a finite timeout, ideally shorter than
      hangcheck. A long time ago Paul McKenney suggested that RCU users should
      ratelimit themselves using judicious use of cond_synchronize_rcu(). This
      gives us the opportunity to reduce our indefinite wait for the GPU to
      idle to a wait for the RCU grace period of the previous allocation along
      this timeline to expire, satisfying both the local and finite properties
      we desire for our ratelimiting.
      
      There are still a few global steps (reclaim not least amongst those!)
      when we exhaust the immediate slab pool, at least now the wait is itself
      decoupled from struct_mutex for our glorious highly parallel future!
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk
      11abf0c5
  27. 07 8月, 2018 1 次提交
  28. 09 7月, 2018 1 次提交
    • C
      drm/i915: Provide a timeout to i915_gem_wait_for_idle() · ec625fb9
      Chris Wilson 提交于
      Usually we have no idea about the upper bound we need to wait to catch
      up with userspace when idling the device, but in a few situations we
      know the system was idle beforehand and can provide a short timeout in
      order to very quickly catch a failure, long before hangcheck kicks in.
      
      In the following patches, we will use the timeout to curtain two overly
      long waits, where we know we can expect the GPU to complete within a
      reasonable time or declare it broken.
      
      In particular, with a broken GPU we expect it to fail during the initial
      GPU setup where do a couple of context switches to record the defaults.
      This is a task that takes a few milliseconds even on the slowest of
      devices, but we may have to wait 60s for hangcheck to give in and
      declare the machine inoperable. In this a case where any gpu hang is
      unacceptable, both from a timeliness and practical standpoint.
      
      The other improvement is that in selftests, we do not need to arm an
      independent timer to inject a wedge, as we can just limit the timeout on
      the wait directly.
      
      v2: Include the timeout parameter in the trace.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NMika Kuoppala <mika.kuoppala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180709122044.7028-1-chris@chris-wilson.co.uk
      ec625fb9