1. 19 6月, 2019 1 次提交
    • C
      drm/i915: Make the semaphore saturation mask global · 44d89409
      Chris Wilson 提交于
      The idea behind keeping the saturation mask local to a context backfired
      spectacularly. The premise with the local mask was that we would be more
      proactive in attempting to use semaphores after each time the context
      idled, and that all new contexts would attempt to use semaphores
      ignoring the current state of the system. This turns out to be horribly
      optimistic. If the system state is still oversaturated and the existing
      workloads have all stopped using semaphores, the new workloads would
      attempt to use semaphores and be deprioritised behind real work. The
      new contexts would not switch off using semaphores until their initial
      batch of low priority work had completed. Given sufficient backload load
      of equal user priority, this would completely starve the new work of any
      GPU time.
      
      To compensate, remove the local tracking in favour of keeping it as
      global state on the engine -- once the system is saturated and
      semaphores are disabled, everyone stops attempting to use semaphores
      until the system is idle again. One of the reason for preferring local
      context tracking was that it worked with virtual engines, so for
      switching to global state we could either do a complete check of all the
      virtual siblings or simply disable semaphores for those requests. This
      takes the simpler approach of disabling semaphores on virtual engines.
      
      The downside is that the decision that the engine is saturated is a
      local measure -- we are only checking whether or not this context was
      scheduled in a timely fashion, it may be legitimately delayed due to user
      priorities. We still have the same dilemma though, that we do not want
      to employ the semaphore poll unless it will be used.
      
      v2: Explain why we need to assume the worst wrt virtual engines.
      
      Fixes: ca6e56f6 ("drm/i915: Disable semaphore busywaits on saturated systems")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190618074153.16055-8-chris@chris-wilson.co.uk
      44d89409
  2. 18 6月, 2019 1 次提交
  3. 15 6月, 2019 3 次提交
  4. 14 6月, 2019 2 次提交
  5. 12 6月, 2019 1 次提交
  6. 11 6月, 2019 1 次提交
  7. 28 5月, 2019 1 次提交
  8. 22 5月, 2019 2 次提交
  9. 20 5月, 2019 3 次提交
    • C
      drm/i915: Truly bump ready tasks ahead of busywaits · a491cc8e
      Chris Wilson 提交于
      In commit b7404c7e ("drm/i915: Bump ready tasks ahead of
      busywaits"), I tried cutting a corner in order to not install a signal
      for each of our dependencies, and only listened to requests on which we
      were intending to busywait. The compromise that was made was that
      instead of then being able to promote the request with a full
      NOSEMAPHORE like its non-busywaiting brethren, as we had not ensured we
      had cleared the semaphore chain, we settled for only using the NEWCLIENT
      boost. With an over saturated system with multiple NEWCLIENTS in flight
      at any time, this was found to be an inadequate promotion and left us
      with a much poorer scheduling order than prior to using semaphores.
      
      The outcome of this patch, is that all requests have NOSEMAPHORE
      priority when they have no dependencies and are ready to run and not
      busywait, restoring the pre-semaphore ordering on saturated systems.
      
      We can demonstrate the effect of poor scheduling order by oversaturating
      the system using gem_wsim on a system with multiple vcs engines
      (i.e running the same workloads across more clients than required for
      peak throughput, e.g. media_load_balance_17i7.wsim -c4 -b context):
      
      x v5.1 (normalized)
      + tip
      * fix
      +------------------------------------------------------------------------+
      |                                                                    x   |
      |                                                                    x   |
      |                                                                    x   |
      |                                                                    x   |
      |                                                                   %x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |         +                                                        %#xx  |
      |         +                                                        %#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%##x  |
      |         +++                                                     %%##x  |
      |         +++                                                     %%##x  |
      |         +++                                                     %%##x  |
      |        ++++                                                     %%##x  |
      |        ++++                                                     %%##x  |
      |        ++++                                                     %%##xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++ +                                                   %#O#xx |
      |        ++++ +                                                   %#O#xx |
      |        ++++++ +                                                 %#O#xx |
      |       ++++++++++                                                %OOOxxx|
      |       ++++++++++       +                                       %#OOO#xx|
      |     + ++++++++++++ ++ +++++    +                        ++    @@OOOO#xx|
      |                                                                   |A_| |
      ||__________M_______A____________________|                               |
      |                                                                 |A_|   |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       0.99456       1.00628      0.999985     1.0001545  0.0024387139
      + 120      0.873021       1.00037      0.884134    0.90148752   0.039190862
      Difference at 99.5% confidence
      	-0.098667 +/- 0.0110762
      	-9.86517% +/- 1.10745%
      	(Student's t, pooled s = 0.0277657)
      % 120      0.990207       1.00165     0.9970265    0.99699748     0.0021024
      Difference at 99.5% confidence
      	-0.003157 +/- 0.000908245
      	-0.315651% +/- 0.0908105%
      	(Student's t, pooled s = 0.00227678)
      
      Fixes: b7404c7e ("drm/i915: Bump ready tasks ahead of busywaits")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-2-chris@chris-wilson.co.uk
      (cherry picked from commit 17db337f)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      a491cc8e
    • C
      drm/i915: Downgrade NEWCLIENT to non-preemptive · c80274bb
      Chris Wilson 提交于
      Commit 1413b2bc ("drm/i915: Trim NEWCLIENT boosting") had the
      intended consequence of not allowing a sequence of work that merely
      crossed into a new engine the privilege to be promoted to NEWCLIENT
      status. It also had the unintended consequence of actually making
      NEWCLIENT effective on heavily oversubscribed transcode machines and
      impacting upon their throughput.
      
      If we consider a client packet composed of (rcsA, rcsB, vcs) and 30 of
      those clients, using the NEWCLIENT boost that will be scheduled as
      
      	rcsA x 30, (rcsB, vcs) x 30
      
      where as before it would have been
      
      	(rcsA, rcsB, vcs) x 30
      
      That is with NEWCLIENT only boosting the first request of each client,
      we would execute all rcsA requests prior to running on the vcs engines;
      acruing a lot of dead time as compared to the previous case where the
      vcs engine would be started in parallel to processing the second client.
      
      The previous patch has the effect of delaying submission until it is
      required by a third party (either the user with an explicit wait, or by
      another client/engine). We reduce the NEWCLIENT bump to a mere WAIT,
      which has the effect of removing its preemptive grant and reducing it to
      the same level as any other user interaction -- that it will not be
      promoted above the interengine dependencies, and so preventing NEWCLIENTS
      from starving other engines. This a large nerf to the rrul properties of
      the current NEWCLIENT, but it still does give prioritised submission to
      new requests from light workloads.
      
      References: b16c7651 ("drm/i915: Priority boost for new clients")
      Fixes: 1413b2bc ("drm/i915: Trim NEWCLIENT boosting") # customer impact
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-4-chris@chris-wilson.co.uk
      (cherry picked from commit 68fc728b)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      c80274bb
    • C
      drm/i915: Bump signaler priority on adding a waiter · 9981927c
      Chris Wilson 提交于
      The handling of the no-preemption priority level imposes the restriction
      that we need to maintain the implied ordering even though preemption is
      disabled. Otherwise we may end up with an AB-BA deadlock across multiple
      engine due to a real preemption event reordering the no-preemption
      WAITs. To resolve this issue we currently promote all requests to WAIT
      on unsubmission, however this interferes with the timeslicing
      requirement that we do not apply any implicit promotion that will defeat
      the round-robin timeslice list. (If we automatically promote the active
      request it will go back to the head of the queue and not the tail!)
      
      So we need implicit promotion to prevent reordering around semaphores
      where we are not allowed to preempt, and we must avoid implicit
      promotion on unsubmission. So instead of at unsubmit, if we apply that
      implicit promotion on adding the dependency, we avoid the semaphore
      deadlock and we also reduce the gains made by the promotion for user
      space waiting. Furthermore, by keeping the earlier dependencies at a
      higher level, we reduce the search space for timeslicing without
      altering runtime scheduling too badly (no dependencies at all will be
      assigned a higher priority for rrul).
      
      v2: Limit the bump to external edges (as originally intended) i.e.
      between contexts and out to the user.
      
      Testcase: igt/gem_concurrent_blit
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-3-chris@chris-wilson.co.uk
      (cherry picked from commit 6e7eb7a8)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      9981927c
  10. 17 5月, 2019 4 次提交
    • C
      drm/i915: Downgrade NEWCLIENT to non-preemptive · 68fc728b
      Chris Wilson 提交于
      Commit 1413b2bc ("drm/i915: Trim NEWCLIENT boosting") had the
      intended consequence of not allowing a sequence of work that merely
      crossed into a new engine the privilege to be promoted to NEWCLIENT
      status. It also had the unintended consequence of actually making
      NEWCLIENT effective on heavily oversubscribed transcode machines and
      impacting upon their throughput.
      
      If we consider a client packet composed of (rcsA, rcsB, vcs) and 30 of
      those clients, using the NEWCLIENT boost that will be scheduled as
      
      	rcsA x 30, (rcsB, vcs) x 30
      
      where as before it would have been
      
      	(rcsA, rcsB, vcs) x 30
      
      That is with NEWCLIENT only boosting the first request of each client,
      we would execute all rcsA requests prior to running on the vcs engines;
      acruing a lot of dead time as compared to the previous case where the
      vcs engine would be started in parallel to processing the second client.
      
      The previous patch has the effect of delaying submission until it is
      required by a third party (either the user with an explicit wait, or by
      another client/engine). We reduce the NEWCLIENT bump to a mere WAIT,
      which has the effect of removing its preemptive grant and reducing it to
      the same level as any other user interaction -- that it will not be
      promoted above the interengine dependencies, and so preventing NEWCLIENTS
      from starving other engines. This a large nerf to the rrul properties of
      the current NEWCLIENT, but it still does give prioritised submission to
      new requests from light workloads.
      
      References: b16c7651 ("drm/i915: Priority boost for new clients")
      Fixes: 1413b2bc ("drm/i915: Trim NEWCLIENT boosting") # customer impact
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-4-chris@chris-wilson.co.uk
      68fc728b
    • C
      drm/i915: Bump signaler priority on adding a waiter · 6e7eb7a8
      Chris Wilson 提交于
      The handling of the no-preemption priority level imposes the restriction
      that we need to maintain the implied ordering even though preemption is
      disabled. Otherwise we may end up with an AB-BA deadlock across multiple
      engine due to a real preemption event reordering the no-preemption
      WAITs. To resolve this issue we currently promote all requests to WAIT
      on unsubmission, however this interferes with the timeslicing
      requirement that we do not apply any implicit promotion that will defeat
      the round-robin timeslice list. (If we automatically promote the active
      request it will go back to the head of the queue and not the tail!)
      
      So we need implicit promotion to prevent reordering around semaphores
      where we are not allowed to preempt, and we must avoid implicit
      promotion on unsubmission. So instead of at unsubmit, if we apply that
      implicit promotion on adding the dependency, we avoid the semaphore
      deadlock and we also reduce the gains made by the promotion for user
      space waiting. Furthermore, by keeping the earlier dependencies at a
      higher level, we reduce the search space for timeslicing without
      altering runtime scheduling too badly (no dependencies at all will be
      assigned a higher priority for rrul).
      
      v2: Limit the bump to external edges (as originally intended) i.e.
      between contexts and out to the user.
      
      Testcase: igt/gem_concurrent_blit
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-3-chris@chris-wilson.co.uk
      6e7eb7a8
    • C
      drm/i915: Truly bump ready tasks ahead of busywaits · 17db337f
      Chris Wilson 提交于
      In commit b7404c7e ("drm/i915: Bump ready tasks ahead of
      busywaits"), I tried cutting a corner in order to not install a signal
      for each of our dependencies, and only listened to requests on which we
      were intending to busywait. The compromise that was made was that
      instead of then being able to promote the request with a full
      NOSEMAPHORE like its non-busywaiting brethren, as we had not ensured we
      had cleared the semaphore chain, we settled for only using the NEWCLIENT
      boost. With an over saturated system with multiple NEWCLIENTS in flight
      at any time, this was found to be an inadequate promotion and left us
      with a much poorer scheduling order than prior to using semaphores.
      
      The outcome of this patch, is that all requests have NOSEMAPHORE
      priority when they have no dependencies and are ready to run and not
      busywait, restoring the pre-semaphore ordering on saturated systems.
      
      We can demonstrate the effect of poor scheduling order by oversaturating
      the system using gem_wsim on a system with multiple vcs engines
      (i.e running the same workloads across more clients than required for
      peak throughput, e.g. media_load_balance_17i7.wsim -c4 -b context):
      
      x v5.1 (normalized)
      + tip
      * fix
      +------------------------------------------------------------------------+
      |                                                                    x   |
      |                                                                    x   |
      |                                                                    x   |
      |                                                                    x   |
      |                                                                   %x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %%x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |                                                                  %#x   |
      |         +                                                        %#xx  |
      |         +                                                        %#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%#xx  |
      |         +                                                       %%##x  |
      |         +++                                                     %%##x  |
      |         +++                                                     %%##x  |
      |         +++                                                     %%##x  |
      |        ++++                                                     %%##x  |
      |        ++++                                                     %%##x  |
      |        ++++                                                     %%##xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++                                                     %###xx |
      |        ++++ +                                                   %#O#xx |
      |        ++++ +                                                   %#O#xx |
      |        ++++++ +                                                 %#O#xx |
      |       ++++++++++                                                %OOOxxx|
      |       ++++++++++       +                                       %#OOO#xx|
      |     + ++++++++++++ ++ +++++    +                        ++    @@OOOO#xx|
      |                                                                   |A_| |
      ||__________M_______A____________________|                               |
      |                                                                 |A_|   |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       0.99456       1.00628      0.999985     1.0001545  0.0024387139
      + 120      0.873021       1.00037      0.884134    0.90148752   0.039190862
      Difference at 99.5% confidence
      	-0.098667 +/- 0.0110762
      	-9.86517% +/- 1.10745%
      	(Student's t, pooled s = 0.0277657)
      % 120      0.990207       1.00165     0.9970265    0.99699748     0.0021024
      Difference at 99.5% confidence
      	-0.003157 +/- 0.000908245
      	-0.315651% +/- 0.0908105%
      	(Student's t, pooled s = 0.00227678)
      
      Fixes: b7404c7e ("drm/i915: Bump ready tasks ahead of busywaits")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-2-chris@chris-wilson.co.uk
      17db337f
    • C
      drm/i915: Mark semaphores as complete on unsubmit out if payload was started · dba5a7f3
      Chris Wilson 提交于
      Avoid charging us for the presumed busywait if the request was preempted
      after successfully using semaphores to reduce inter-engine latency.
      
      v2: Bump the priority to reflect the lack of semaphores now required.
      
      References: ca6e56f6 ("drm/i915: Disable semaphore busywaits on saturated systems")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190515130052.4475-1-chris@chris-wilson.co.uk
      dba5a7f3
  11. 13 5月, 2019 1 次提交
    • C
      drm/i915: Seal races between async GPU cancellation, retirement and signaling · c36beba6
      Chris Wilson 提交于
      Currently there is an underlying assumption that i915_request_unsubmit()
      is synchronous wrt the GPU -- that is the request is no longer in flight
      as we remove it. In the near future that may change, and this may upset
      our signaling as we can process an interrupt for that request while it
      is no longer in flight.
      
      CPU0					CPU1
      intel_engine_breadcrumbs_irq
      (queue request completion)
      					i915_request_cancel_signaling
      ...					...
      					i915_request_enable_signaling
      dma_fence_signal
      
      Hence in the time it took us to drop the lock to signal the request, a
      preemption event may have occurred and re-queued the request. In the
      process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and
      so reused the rq->signal_link that was in use on CPU0, leading to bad
      pointer chasing in intel_engine_breadcrumbs_irq.
      
      A related issue was that if someone started listening for a signal on a
      completed but no longer in-flight request, we missed the opportunity to
      immediately signal that request.
      
      Furthermore, as intel_contexts may be immediately released during
      request retirement, in order to be entirely sure that
      intel_engine_breadcrumbs_irq may no longer dereference the intel_context
      (ce->signals and ce->signal_link), we must wait for irq spinlock.
      
      In order to prevent the race, we use a bit in the fence.flags to signal
      the transfer onto the signal list inside intel_engine_breadcrumbs_irq.
      For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then
      quickly signals to any outside observer that the fence is indeed signaled.
      
      v2: Sketch out potential dma-fence API for manual signaling
      v3: And the test_and_set_bit()
      
      Fixes: 52c0fdb2 ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190508112452.18942-1-chris@chris-wilson.co.uk
      (cherry picked from commit 0152b3b3)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      c36beba6
  12. 08 5月, 2019 2 次提交
    • C
      drm/i915: Seal races between async GPU cancellation, retirement and signaling · 0152b3b3
      Chris Wilson 提交于
      Currently there is an underlying assumption that i915_request_unsubmit()
      is synchronous wrt the GPU -- that is the request is no longer in flight
      as we remove it. In the near future that may change, and this may upset
      our signaling as we can process an interrupt for that request while it
      is no longer in flight.
      
      CPU0					CPU1
      intel_engine_breadcrumbs_irq
      (queue request completion)
      					i915_request_cancel_signaling
      ...					...
      					i915_request_enable_signaling
      dma_fence_signal
      
      Hence in the time it took us to drop the lock to signal the request, a
      preemption event may have occurred and re-queued the request. In the
      process, that request would have seen I915_FENCE_FLAG_SIGNAL clear and
      so reused the rq->signal_link that was in use on CPU0, leading to bad
      pointer chasing in intel_engine_breadcrumbs_irq.
      
      A related issue was that if someone started listening for a signal on a
      completed but no longer in-flight request, we missed the opportunity to
      immediately signal that request.
      
      Furthermore, as intel_contexts may be immediately released during
      request retirement, in order to be entirely sure that
      intel_engine_breadcrumbs_irq may no longer dereference the intel_context
      (ce->signals and ce->signal_link), we must wait for irq spinlock.
      
      In order to prevent the race, we use a bit in the fence.flags to signal
      the transfer onto the signal list inside intel_engine_breadcrumbs_irq.
      For simplicity, we use the DMA_FENCE_FLAG_SIGNALED_BIT as it then
      quickly signals to any outside observer that the fence is indeed signaled.
      
      v2: Sketch out potential dma-fence API for manual signaling
      v3: And the test_and_set_bit()
      
      Fixes: 52c0fdb2 ("drm/i915: Replace global breadcrumbs with per-context interrupt tracking")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190508112452.18942-1-chris@chris-wilson.co.uk
      0152b3b3
    • C
      drm/i915: Only reschedule the submission tasklet if preemption is possible · 25d851ad
      Chris Wilson 提交于
      If we couple the scheduler more tightly with the execlists policy, we
      can apply the preemption policy to the question of whether we need to
      kick the tasklet at all for this priority bump.
      
      v2: Rephrase it as a core i915 policy and not an execlists foible.
      v3: Pull the kick together.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190507122544.12698-1-chris@chris-wilson.co.uk
      25d851ad
  13. 07 5月, 2019 3 次提交
    • C
      drm/i915: Acquire the signaler's timeline HWSP last · c8a0e2ae
      Chris Wilson 提交于
      Acquiring the signaler's timeline takes an active reference to their
      HWSP that we would like to avoid if possible, so take it after
      performing all of our allocations required to set up the fencing. The
      acquisition also provides the final check that the target has not
      already signaled allowing us to avoid the semaphore at the last moment.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190503140239.32668-1-chris@chris-wilson.co.uk
      c8a0e2ae
    • C
      drm/i915: Disable semaphore busywaits on saturated systems · 2564fe70
      Chris Wilson 提交于
      Asking the GPU to busywait on a memory address, perhaps not unexpectedly
      in hindsight for a shared system, leads to bus contention that affects
      CPU programs trying to concurrently access memory. This can manifest as
      a drop in transcode throughput on highly over-saturated workloads.
      
      The only clue offered by perf, is that the bus-cycles (perf stat -e
      bus-cycles) jumped by 50% when enabling semaphores. This corresponds
      with extra CPU active cycles being attributed to intel_idle's mwait.
      
      This patch introduces a heuristic to try and detect when more than one
      client is submitting to the GPU pushing it into an oversaturated state.
      As we already keep track of when the semaphores are signaled, we can
      inspect their state on submitting the busywait batch and if we planned
      to use a semaphore but were too late, conclude that the GPU is
      overloaded and not try to use semaphores in future requests. In
      practice, this means we optimistically try to use semaphores for the
      first frame of a transcode job split over multiple engines, and fail if
      there are multiple clients active and continue not to use semaphores for
      the subsequent frames in the sequence. Periodically, we try to
      optimistically switch semaphores back on whenever the client waits to
      catch up with the transcode results.
      
      With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                                                    *                   |
      |                                                    *+                  |
      |                                                    **+                 |
      |                                                    **+  x              |
      |                                x               *  +**+  x              |
      |                                x  x       *    *  +***x xx             |
      |                                x  x       *    * *+***x *x             |
      |                                x  x*   +  *    * *****x *x x           |
      |                         +    x xx+x*   + ***   * ********* x   *       |
      |                         +    x xx+x*   * *** +** ********* xx  *       |
      |    *   +         ++++*  +    x*x****+*+* ***+*************+x*  *       |
      |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++    *|
      |                                   |__________A_____M_____|             |
      |                           |_______________A____M_________|             |
      |                                 |____________A___M________|            |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.60475       3.50941       3.31123     3.2143953    0.21117399
      + 120        2.3826       3.57077       3.25101     3.1414161    0.28146407
      Difference at 95.0% confidence
      	-0.0729792 +/- 0.0629585
      	-2.27039% +/- 1.95864%
      	(Student's t, pooled s = 0.248814)
      * 120       2.35536       3.66713        3.2849     3.2059917    0.24618565
      No difference proven at 95.0% confidence
      
      With 10 clients over-saturating the pipeline:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                    xx ***       |
      |                     ++                                    xx ***       |
      |                     ++                                    xxx***       |
      |                     ++                                    xxx***       |
      |                    +++                                    xxx***       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    ++++                                   xx****       |
      |                   +++++                                   xx****       |
      |                   +++++                                 x x******      |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xx********     |
      |                  ++++++                               xxxx********     |
      |                  ++++++                               xxxx********     |
      |                ++++++++                             xxxxx*********     |
      |+ +  +        + ++++++++                           xxx*xx**********x*  *|
      |                                                         |__A__|        |
      |                 |__AM__|                                               |
      |                                                            |__A_|      |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.47855        2.8972       2.72376     2.7193402   0.074604933
      + 120       1.17367       1.77459       1.71977     1.6966782   0.085850697
      Difference at 95.0% confidence
      	-1.02266 +/- 0.0203502
      	-37.607% +/- 0.748352%
      	(Student's t, pooled s = 0.0804246)
      * 120       2.57868       3.00821       2.80142     2.7923878   0.058646477
      Difference at 95.0% confidence
      	0.0730476 +/- 0.0169791
      	2.68622% +/- 0.624383%
      	(Student's t, pooled s = 0.0671018)
      
      Indicating that we've recovered the regression from enabling semaphores
      on this saturated setup, with a hint towards an overall improvement.
      
      Very similar, but of smaller magnitude, results are observed on both
      Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
      bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
      core, in this particular test.
      
      One observation to make here is that for a greedy client trying to
      maximise its own throughput, using semaphores is the right choice. It is
      only the holistic system-wide view that semaphores of one client
      impacts another and reduces the overall throughput where we would choose
      to disable semaphores.
      
      The most noticeable negactive impact this has is on the no-op
      microbenchmarks, which are also very notable for having no cpu bus load.
      In particular, this increases the runtime and energy consumption of
      gem_exec_whisper.
      
      Fixes: e8861964 ("drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk
      (cherry picked from commit ca6e56f6)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      2564fe70
    • C
      drm/i915: Delay semaphore submission until the start of the signaler · e766fde6
      Chris Wilson 提交于
      Currently we submit the semaphore busywait as soon as the signaler is
      submitted to HW. However, we may submit the signaler as the tail of a
      batch of requests, and even not as the first context in the HW list,
      i.e. the busywait may start spinning far in advance of the signaler even
      starting.
      
      If we wait until the request before the signaler is completed before
      submitting the busywait, we prevent the busywait from starting too
      early, if the signaler is not first in submission port.
      
      To handle the case where the signaler is at the start of the second (or
      later) submission port, we will need to delay the execution callback
      until we know the context is promoted to port0. A challenge for later.
      
      Fixes: e8861964 ("drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190501114541.10077-9-chris@chris-wilson.co.uk
      (cherry picked from commit 0d90ccb7)
      [Joonas: edited Fixes: tag into single line.]
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      e766fde6
  14. 04 5月, 2019 1 次提交
    • C
      drm/i915: Disable semaphore busywaits on saturated systems · ca6e56f6
      Chris Wilson 提交于
      Asking the GPU to busywait on a memory address, perhaps not unexpectedly
      in hindsight for a shared system, leads to bus contention that affects
      CPU programs trying to concurrently access memory. This can manifest as
      a drop in transcode throughput on highly over-saturated workloads.
      
      The only clue offered by perf, is that the bus-cycles (perf stat -e
      bus-cycles) jumped by 50% when enabling semaphores. This corresponds
      with extra CPU active cycles being attributed to intel_idle's mwait.
      
      This patch introduces a heuristic to try and detect when more than one
      client is submitting to the GPU pushing it into an oversaturated state.
      As we already keep track of when the semaphores are signaled, we can
      inspect their state on submitting the busywait batch and if we planned
      to use a semaphore but were too late, conclude that the GPU is
      overloaded and not try to use semaphores in future requests. In
      practice, this means we optimistically try to use semaphores for the
      first frame of a transcode job split over multiple engines, and fail if
      there are multiple clients active and continue not to use semaphores for
      the subsequent frames in the sequence. Periodically, we try to
      optimistically switch semaphores back on whenever the client waits to
      catch up with the transcode results.
      
      With 1 client, on Broxton J3455, with the relative fps normalized by %cpu:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                                                    *                   |
      |                                                    *+                  |
      |                                                    **+                 |
      |                                                    **+  x              |
      |                                x               *  +**+  x              |
      |                                x  x       *    *  +***x xx             |
      |                                x  x       *    * *+***x *x             |
      |                                x  x*   +  *    * *****x *x x           |
      |                         +    x xx+x*   + ***   * ********* x   *       |
      |                         +    x xx+x*   * *** +** ********* xx  *       |
      |    *   +         ++++*  +    x*x****+*+* ***+*************+x*  *       |
      |*+ +** *+ + +* + *++****** *xxx**********x***+*****************+*++    *|
      |                                   |__________A_____M_____|             |
      |                           |_______________A____M_________|             |
      |                                 |____________A___M________|            |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.60475       3.50941       3.31123     3.2143953    0.21117399
      + 120        2.3826       3.57077       3.25101     3.1414161    0.28146407
      Difference at 95.0% confidence
      	-0.0729792 +/- 0.0629585
      	-2.27039% +/- 1.95864%
      	(Student's t, pooled s = 0.248814)
      * 120       2.35536       3.66713        3.2849     3.2059917    0.24618565
      No difference proven at 95.0% confidence
      
      With 10 clients over-saturating the pipeline:
      
      x no semaphores
      + drm-tip
      * patched
      +------------------------------------------------------------------------+
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                        **       |
      |                     ++                                    xx ***       |
      |                     ++                                    xx ***       |
      |                     ++                                    xxx***       |
      |                     ++                                    xxx***       |
      |                    +++                                    xxx***       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    +++                                    xx****       |
      |                    ++++                                   xx****       |
      |                   +++++                                   xx****       |
      |                   +++++                                 x x******      |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xxx*******     |
      |                  ++++++                                 xx********     |
      |                  ++++++                               xxxx********     |
      |                  ++++++                               xxxx********     |
      |                ++++++++                             xxxxx*********     |
      |+ +  +        + ++++++++                           xxx*xx**********x*  *|
      |                                                         |__A__|        |
      |                 |__AM__|                                               |
      |                                                            |__A_|      |
      +------------------------------------------------------------------------+
          N           Min           Max        Median           Avg        Stddev
      x 120       2.47855        2.8972       2.72376     2.7193402   0.074604933
      + 120       1.17367       1.77459       1.71977     1.6966782   0.085850697
      Difference at 95.0% confidence
      	-1.02266 +/- 0.0203502
      	-37.607% +/- 0.748352%
      	(Student's t, pooled s = 0.0804246)
      * 120       2.57868       3.00821       2.80142     2.7923878   0.058646477
      Difference at 95.0% confidence
      	0.0730476 +/- 0.0169791
      	2.68622% +/- 0.624383%
      	(Student's t, pooled s = 0.0671018)
      
      Indicating that we've recovered the regression from enabling semaphores
      on this saturated setup, with a hint towards an overall improvement.
      
      Very similar, but of smaller magnitude, results are observed on both
      Skylake(gt2) and Kabylake(gt4). This may be due to the reduced impact of
      bus-cycles, where we see a 50% hit on Broxton, it is only 10% on the big
      core, in this particular test.
      
      One observation to make here is that for a greedy client trying to
      maximise its own throughput, using semaphores is the right choice. It is
      only the holistic system-wide view that semaphores of one client
      impacts another and reduces the overall throughput where we would choose
      to disable semaphores.
      
      The most noticeable negactive impact this has is on the no-op
      microbenchmarks, which are also very notable for having no cpu bus load.
      In particular, this increases the runtime and energy consumption of
      gem_exec_whisper.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190504070707.30902-1-chris@chris-wilson.co.uk
      ca6e56f6
  15. 03 5月, 2019 1 次提交
  16. 27 4月, 2019 3 次提交
  17. 25 4月, 2019 5 次提交
  18. 20 4月, 2019 1 次提交
    • C
      drm/i915: Expose the busyspin durations for i915_wait_request · 7ce99d24
      Chris Wilson 提交于
      An interesting discussion regarding "hybrid interrupt polling" for NVMe
      came to the conclusion that the ideal busyspin before sleeping was half
      of the expected request latency (and better if it was already halfway
      through that request). This suggested that we too should look again at
      our tradeoff between spinning and waiting. Currently, our spin simply
      tries to hide the cost of enabling the interrupt, which is good to avoid
      penalising nop requests (i.e. test throughput) and not much else.
      Studying real world workloads suggests that a spin of upto 500us can
      dramatically boost performance, but the suggestion is that this is not
      from avoiding interrupt latency per-se, but from secondary effects of
      sleeping such as allowing the CPU reduce cstate and context switch away.
      
      In a truly hybrid interrupt polling scheme, we would aim to sleep until
      just before the request completed and then wake up in advance of the
      interrupt and do a quick poll to handle completion. This is tricky for
      ourselves at the moment as we are not recording request times, and since
      we allow preemption, our requests are not on as a nicely ordered
      timeline as IO. However, the idea is interesting, for it will certainly
      help us decide when busyspinning is worthwhile.
      
      v2: Expose the spin setting via Kconfig options for easier adjustment
      and testing.
      v3: Don't get caught sneaking in a change to the busyspin parameters.
      v4: Explain more about the "hybrid interrupt polling" scheme that we
      want to migrate towards.
      Suggested-by: NSagar Kamble <sagar.a.kamble@intel.com>
      References: http://events.linuxfoundation.org/sites/events/files/slides/lemoal-nvme-polling-vault-2017-final_0.pdfSigned-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Sagar Kamble <sagar.a.kamble@intel.com>
      Cc: Eero Tamminen <eero.t.tamminen@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Ben Widawsky <ben@bwidawsk.net>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NSagar Kamble <sagar.a.kamble@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190419182625.11186-1-chris@chris-wilson.co.uk
      7ce99d24
  19. 12 4月, 2019 1 次提交
  20. 11 4月, 2019 1 次提交
    • C
      drm/i915: Bump ready tasks ahead of busywaits · b7404c7e
      Chris Wilson 提交于
      Consider two tasks that are running in parallel on a pair of engines
      (vcs0, vcs1), but then must complete on a shared engine (rcs0). To
      maximise throughput, we want to run the first ready task on rcs0 (i.e.
      the first task that completes on either of vcs0 or vcs1). When using
      semaphores, however, we will instead queue onto rcs in submission order.
      
      To resolve this incorrect ordering, we want to re-evaluate the priority
      queue when each of the request is ready. Normally this happens because
      we only insert into the priority queue requests that are ready, but with
      semaphores we are inserting ahead of their readiness and to compensate
      we penalize those tasks with reduced priority (so that tasks that do not
      need to busywait should naturally be run first). However, given a series
      of tasks that each use semaphores, the queue degrades into submission
      fifo rather than readiness fifo, and so to counter this we give a small
      boost to semaphore users as their dependent tasks are completed (and so
      we no longer require any busywait prior to running the user task as they
      are then ready themselves).
      
      v2: Fixup irqsave for schedule_lock (Tvrtko)
      
      Testcase: igt/gem_exec_schedule/semaphore-codependency
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
      Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190409152922.23894-1-chris@chris-wilson.co.uk
      b7404c7e
  21. 09 4月, 2019 1 次提交
  22. 08 4月, 2019 1 次提交