1. 30 1月, 2019 2 次提交
    • C
      drm/i915: Replace global breadcrumbs with per-context interrupt tracking · 52c0fdb2
      Chris Wilson 提交于
      A few years ago, see commit 688e6c72 ("drm/i915: Slaughter the
      thundering i915_wait_request herd"), the issue of handling multiple
      clients waiting in parallel was brought to our attention. The
      requirement was that every client should be woken immediately upon its
      request being signaled, without incurring any cpu overhead.
      
      To handle certain fragility of our hw meant that we could not do a
      simple check inside the irq handler (some generations required almost
      unbounded delays before we could be sure of seqno coherency) and so
      request completion checking required delegation.
      
      Before commit 688e6c72, the solution was simple. Every client
      waiting on a request would be woken on every interrupt and each would do
      a heavyweight check to see if their request was complete. Commit
      688e6c72 introduced an rbtree so that only the earliest waiter on
      the global timeline would woken, and would wake the next and so on.
      (Along with various complications to handle requests being reordered
      along the global timeline, and also a requirement for kthread to provide
      a delegate for fence signaling that had no process context.)
      
      The global rbtree depends on knowing the execution timeline (and global
      seqno). Without knowing that order, we must instead check all contexts
      queued to the HW to see which may have advanced. We trim that list by
      only checking queued contexts that are being waited on, but still we
      keep a list of all active contexts and their active signalers that we
      inspect from inside the irq handler. By moving the waiters onto the fence
      signal list, we can combine the client wakeup with the dma_fence
      signaling (a dramatic reduction in complexity, but does require the HW
      being coherent, the seqno must be visible from the cpu before the
      interrupt is raised - we keep a timer backup just in case).
      
      Having previously fixed all the issues with irq-seqno serialisation (by
      inserting delays onto the GPU after each request instead of random delays
      on the CPU after each interrupt), we can rely on the seqno state to
      perfom direct wakeups from the interrupt handler. This allows us to
      preserve our single context switch behaviour of the current routine,
      with the only downside that we lose the RT priority sorting of wakeups.
      In general, direct wakeup latency of multiple clients is about the same
      (about 10% better in most cases) with a reduction in total CPU time spent
      in the waiter (about 20-50% depending on gen). Average herd behaviour is
      improved, but at the cost of not delegating wakeups on task_prio.
      
      v2: Capture fence signaling state for error state and add comments to
      warm even the most cold of hearts.
      v3: Check if the request is still active before busywaiting
      v4: Reduce the amount of pointer misdirection with list_for_each_safe
      and using a local i915_request variable inside the loops
      v5: Add a missing pluralisation to a purely informative selftest message.
      
      References: 688e6c72 ("drm/i915: Slaughter the thundering i915_wait_request herd")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
      52c0fdb2
    • C
      drm/i915: Identify active requests · 85474441
      Chris Wilson 提交于
      To allow requests to forgo a common execution timeline, one question we
      need to be able to answer is "is this request running?". To track
      whether a request has started on HW, we can emit a breadcrumb at the
      beginning of the request and check its timeline's HWSP to see if the
      breadcrumb has advanced past the start of this request. (This is in
      contrast to the global timeline where we need only ask if we are on the
      global timeline and if the timeline has advanced past the end of the
      previous request.)
      
      There is still confusion from a preempted request, which has already
      started but relinquished the HW to a high priority request. For the
      common case, this discrepancy should be negligible. However, for
      identification of hung requests, knowing which one was running at the
      time of the hang will be much more important.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129185452.20989-2-chris@chris-wilson.co.uk
      85474441
  2. 29 1月, 2019 2 次提交
  3. 22 1月, 2019 1 次提交
  4. 27 12月, 2018 1 次提交
  5. 02 10月, 2018 2 次提交
  6. 14 9月, 2018 1 次提交
    • C
      drm/i915: Limit the backpressure for i915_request allocation · 11abf0c5
      Chris Wilson 提交于
      If we try and fail to allocate a i915_request, we apply some
      backpressure on the clients to throttle the memory allocations coming
      from i915.ko. Currently, we wait until completely idle, but this is far
      too heavy and leads to some situations where the only escape is to
      declare a client hung and reset the GPU. The intent is to only ratelimit
      the allocation requests and to allow ourselves to recycle requests and
      memory from any long queues built up by a client hog.
      
      Although the system memory is inherently a global resources, we don't
      want to overly penalize an unlucky client to pay the price of reaping a
      hog. To reduce the influence of one client on another, we can instead of
      waiting for the entire GPU to idle, impose a barrier on the local client.
      (One end goal for request allocation is for scalability to many
      concurrent allocators; simultaneous execbufs.)
      
      To prevent ourselves from getting caught out by long running requests
      (requests that may never finish without userspace intervention, whom we
      are blocking) we need to impose a finite timeout, ideally shorter than
      hangcheck. A long time ago Paul McKenney suggested that RCU users should
      ratelimit themselves using judicious use of cond_synchronize_rcu(). This
      gives us the opportunity to reduce our indefinite wait for the GPU to
      idle to a wait for the RCU grace period of the previous allocation along
      this timeline to expire, satisfying both the local and finite properties
      we desire for our ratelimiting.
      
      There are still a few global steps (reclaim not least amongst those!)
      when we exhaust the immediate slab pool, at least now the wait is itself
      decoupled from struct_mutex for our glorious highly parallel future!
      
      Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk
      11abf0c5
  7. 07 8月, 2018 1 次提交
  8. 07 7月, 2018 2 次提交
  9. 14 6月, 2018 1 次提交
  10. 11 6月, 2018 1 次提交
    • C
      drm/i915/ringbuffer: Fix context restore upon reset · b3ee09a4
      Chris Wilson 提交于
      The discovery with trying to enable full-ppgtt was that we were
      completely failing to the load both the mm and context following the
      reset. Although we were performing mmio to set the PP_DIR (per-process
      GTT) and CCID (context), these were taking no effect (the assumption was
      that this would trigger reload of the context and restore the page
      tables). It was not until we performed the LRI + MI_SET_CONTEXT in a
      following context switch would anything occur.
      
      Since we are then required to reset the context image and PP_DIR using
      CS commands, we place those commands into every batch. The hardware
      should recognise the no-ops and eliminate the expensive context loads,
      but we still have to pay the cost of using cross-powerwell register
      writes. In practice, this has no effect on actual context switch times,
      and only adds a few hundred nanoseconds to no-op switches. We can improve
      the latter by eliminating the w/a around known no-op switches, but there
      is an ulterior motive to keeping them.
      
      Always emitting the context switch at the beginning of the request (and
      relying on HW to skip unneeded switches) does have one key advantage.
      Should we implement request reordering on Haswell, we will not know in
      advance what the previous executing context was on the GPU and so we
      would not be able to elide the MI_SET_CONTEXT commands ourselves and
      always have to emit them. Having our hand forced now actually prepares
      us for later.
      
      Now since that context and mm follow the request, we no longer (and not
      for a long time since requests took over!) require a trace point to tell
      when we write the switch into the ring, since it is always. (This is
      even more important when you remember that simply writing into the ring
      bears no relation to the current mm.)
      
      v2: Sandybridge has to agree to use LRI as well.
      
      Testcase: igt/drv_selftests/live_hangcheck
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Matthew Auld <matthew.william.auld@gmail.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180611110845.31890-1-chris@chris-wilson.co.uk
      b3ee09a4
  11. 01 6月, 2018 1 次提交
  12. 18 5月, 2018 2 次提交
  13. 03 5月, 2018 1 次提交
  14. 19 4月, 2018 3 次提交
  15. 06 3月, 2018 1 次提交
    • C
      drm/i915/breadcrumbs: Reduce signaler rbtree to a sorted list · cd46c545
      Chris Wilson 提交于
      The goal here is to try and reduce the latency of signaling additional
      requests following the wakeup from interrupt by reducing the list of
      to-be-signaled requests from an rbtree to a sorted linked list. The
      original choice of using an rbtree was to facilitate random insertions
      of request into the signaler while maintaining a sorted list. However,
      if we assume that most new requests are added when they are submitted,
      we see those new requests in execution order making a insertion sort
      fast, and the reduction in overhead of each signaler iteration
      significant.
      
      Since commit 56299fb7 ("drm/i915: Signal first fence from irq handler
      if complete"), we signal most fences directly from notify_ring() in the
      interrupt handler greatly reducing the amount of work that actually
      needs to be done by the signaler kthread. All the thread is then
      required to do is operate as the bottom-half, cleaning up after the
      interrupt handler and preparing the next waiter. This includes signaling
      all later completed fences in a saturated system, but on a mostly idle
      system we only have to rebuild the wait rbtree in time for the next
      interrupt. With this de-emphasis of the signaler's role, we want to
      rejig it's datastructures to reduce the amount of work we require to
      both setup the signal tree and maintain it on every interrupt.
      
      References: 56299fb7 ("drm/i915: Signal first fence from irq handler if complete")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180222092545.17216-1-chris@chris-wilson.co.uk
      cd46c545
  16. 22 2月, 2018 1 次提交
  17. 19 1月, 2018 1 次提交
    • C
      drm/i915: Avoid waitboosting on the active request · e9af4ea2
      Chris Wilson 提交于
      Watching a light workload on Baytrail (running glxgears and a 1080p
      decode), instead of the system remaining at low frequency, the glxgears
      would regularly trigger waitboosting after which it would have to spend
      a few seconds throttling back down. In this case, the waitboosting is
      counter productive as the minimal wait for glxgears doesn't prevent it
      from functioning correctly and delivering frames on time. In this case,
      glxgears happens to almost always be waiting on the current request,
      which we already expect to complete quickly (see i915_spin_request) and
      so avoiding the waitboost on the active request and spinning instead
      provides the best latency without overcommitting to upclocking.
      However, if the system falls behind we still force the waitboost.
      Similarly, we will also trigger upclocking if we detect the system is
      not delivering frames on time - again using a mechanism that tries to
      detect a miss and not preemptively upclock.
      
      v2: Also skip boosting for after missed vblank if the desired request is
      already active.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Radoslaw Szwichtenberg <radoslaw.szwichtenberg@intel.com>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20180118131609.16574-1-chris@chris-wilson.co.uk
      e9af4ea2
  18. 03 1月, 2018 2 次提交
  19. 13 12月, 2017 1 次提交
  20. 05 10月, 2017 1 次提交
    • C
      drm/i915/scheduler: Support user-defined priorities · ac14fbd4
      Chris Wilson 提交于
      Use a priority stored in the context as the initial value when
      submitting a request. This allows us to change the default priority on a
      per-context basis, allowing different contexts to be favoured with GPU
      time at the expense of lower importance work. The user can adjust the
      context's priority via I915_CONTEXT_PARAM_PRIORITY, with more positive
      values being higher priority (they will be serviced earlier, after their
      dependencies have been resolved). Any prerequisite work for an execbuf
      will have its priority raised to match the new request as required.
      
      Normal users can specify any value in the range of -1023 to 0 [default],
      i.e. they can reduce the priority of their workloads (and temporarily
      boost it back to normal if so desired).
      
      Privileged users can specify any value in the range of -1023 to 1023,
      [default is 0], i.e. they can raise their priority above all overs and
      so potentially starve the system.
      
      Note that the existing schedulers are not fair, nor load balancing, the
      execution is strictly by priority on a first-come, first-served basis,
      and the driver may choose to boost some requests above the range
      available to users.
      
      This priority was originally based around nice(2), but evolved to allow
      clients to adjust their priority within a small range, and allow for a
      privileged high priority range.
      
      For example, this can be used to implement EGL_IMG_context_priority
      https://www.khronos.org/registry/egl/extensions/IMG/EGL_IMG_context_priority.txt
      
      	EGL_CONTEXT_PRIORITY_LEVEL_IMG determines the priority level of
              the context to be created. This attribute is a hint, as an
              implementation may not support multiple contexts at some
              priority levels and system policy may limit access to high
              priority contexts to appropriate system privilege level. The
              default value for EGL_CONTEXT_PRIORITY_LEVEL_IMG is
              EGL_CONTEXT_PRIORITY_MEDIUM_IMG."
      
      so we can map
      
      	PRIORITY_HIGH -> 1023 [privileged, will failback to 0]
      	PRIORITY_MED -> 0 [default]
      	PRIORITY_LOW -> -1023
      
      They also map onto the priorities used by VkQueue (and a VkQueue is
      essentially a timeline, our i915_gem_context under full-ppgtt).
      
      v2: s/CAP_SYS_ADMIN/CAP_SYS_NICE/
      v3: Report min/max user priorities as defines in the uapi, and rebase
      internal priorities on the exposed values.
      
      Testcase: igt/gem_exec_schedule
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20171003203453.15692-9-chris@chris-wilson.co.uk
      ac14fbd4
  21. 29 9月, 2017 1 次提交
  22. 23 9月, 2017 1 次提交
  23. 28 6月, 2017 1 次提交
    • C
      drm/i915: Avoid keeping waitboost active for signaling threads · 7b92c1bd
      Chris Wilson 提交于
      Once a client has requested a waitboost, we keep that waitboost active
      until all clients are no longer waiting. This is because we don't
      distinguish which waiter deserves the boost. However, with the advent of
      fence signaling, the signaler threads appear as waiters to the RPS
      interrupt handler. So instead of using a single boolean to track when to
      keep the waitboost active, use a counter of all outstanding waitboosted
      requests.
      
      At this point, I have removed all vestiges of the rate limiting on
      clients. Whilst this means that compositors should remain more fluid,
      it also means that boosts are more prevalent. See commit b29c19b6
      ("drm/i915: Boost RPS frequency for CPU stalls") for a longer discussion
      on the pros and cons of both approaches.
      
      A drawback of this implementation is that it requires constant request
      submission to keep the waitboost trimmed (as it is now cancelled when the
      request is completed). This will be fine for a busy system, but near
      idle the boosts may be kept for longer than desired (effectively tens of
      vblanks worstcase) and there is a reliance on rc6 instead.
      
      v2: Remove defunct rps.client_lock
      Reported-by: NMichał Winiarski <michal.winiarski@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170628123548.9236-1-chris@chris-wilson.co.uk
      7b92c1bd
  24. 20 6月, 2017 1 次提交
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  25. 17 5月, 2017 2 次提交
    • C
      drm/i915: Split execlist priority queue into rbtree + linked list · 6c067579
      Chris Wilson 提交于
      All the requests at the same priority are executed in FIFO order. They
      do not need to be stored in the rbtree themselves, as they are a simple
      list within a level. If we move the requests at one priority into a list,
      we can then reduce the rbtree to the set of priorities. This should keep
      the height of the rbtree small, as the number of active priorities can not
      exceed the number of active requests and should be typically only a few.
      
      Currently, we have ~2k possible different priority levels, that may
      increase to allow even more fine grained selection. Allocating those in
      advance seems a waste (and may be impossible), so we opt for allocating
      upon first use, and freeing after its requests are depleted. To avoid
      the possibility of an allocation failure causing us to lose a request,
      we preallocate the default priority (0) and bump any request to that
      priority if we fail to allocate it the appropriate plist. Having a
      request (that is ready to run, so not leading to corruption) execute
      out-of-order is better than leaking the request (and its dependency
      tree) entirely.
      
      There should be a benefit to reducing execlists_dequeue() to principally
      using a simple list (and reducing the frequency of both rbtree iteration
      and balancing on erase) but for typical workloads, request coalescing
      should be small enough that we don't notice any change. The main gain is
      from improving PI calls to schedule, and the explicit list within a
      level should make request unwinding simpler (we just need to insert at
      the head of the list rather than the tail and not have to make the
      rbtree search more complicated).
      
      v2: Avoid use-after-free when deleting a depleted priolist
      
      v3: Michał found the solution to handling the allocation failure
      gracefully. If we disable all priority scheduling following the
      allocation failure, those requests will be executed in fifo and we will
      ensure that this request and its dependencies are in strict fifo (even
      when it doesn't realise it is only a single list). Normal scheduling is
      restored once we know the device is idle, until the next failure!
      Suggested-by: NMichał Wajdeczko <michal.wajdeczko@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
      6c067579
    • C
      drm/i915: Use a define for the default priority [0] · e4f815f6
      Chris Wilson 提交于
      Explicitly assign the default priority, and give it a name. After much
      discussion, we have chosen to call it I915_PRIORITY_NORMAL!
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-7-chris@chris-wilson.co.uk
      e4f815f6
  26. 19 4月, 2017 1 次提交
    • P
      mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU · 5f0d5a3a
      Paul E. McKenney 提交于
      A group of Linux kernel hackers reported chasing a bug that resulted
      from their assumption that SLAB_DESTROY_BY_RCU provided an existence
      guarantee, that is, that no block from such a slab would be reallocated
      during an RCU read-side critical section.  Of course, that is not the
      case.  Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire
      slab of blocks.
      
      However, there is a phrase for this, namely "type safety".  This commit
      therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order
      to avoid future instances of this sort of confusion.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      [ paulmck: Add comments mentioning the old name, as requested by Eric
        Dumazet, in order to help people familiar with the old name find
        the new one. ]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      5f0d5a3a
  27. 15 4月, 2017 1 次提交
  28. 17 3月, 2017 1 次提交
  29. 09 3月, 2017 1 次提交
  30. 03 3月, 2017 1 次提交
  31. 28 2月, 2017 1 次提交
    • C
      drm/i915: Signal first fence from irq handler if complete · 56299fb7
      Chris Wilson 提交于
      As execlists and other non-semaphore multi-engine devices coordinate
      between engines using interrupts, we can shave off a few 10s of
      microsecond of scheduling latency by doing the fence signaling from the
      interrupt as opposed to a RT kthread. (Realistically the delay adds
      about 1% to an individual cross-engine workload.) We only signal the
      first fence in order to limit the amount of work we move into the
      interrupt handler. We also have to remember that our breadcrumbs may be
      unordered with respect to the interrupt and so we still require the
      waiter process to perform some heavyweight coherency fixups, as well as
      traversing the tree of waiters.
      
      v2: No need for early exit in irq handler - it breaks the flow between
      patches and prevents the tracepoint
      v3: Restore rcu hold across irq signaling of request
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170227205850.2828-2-chris@chris-wilson.co.uk
      56299fb7