1. 18 9月, 2017 3 次提交
  2. 14 9月, 2017 2 次提交
    • C
      drm/i915/execlists: Read the context-status HEAD from the HWSP · 767a983a
      Chris Wilson 提交于
      The engine also provides a mirror of the CSB write pointer in the HWSP,
      but not of our read pointer. To take advantage of this we need to
      remember where we read up to on the last interrupt and continue off from
      there. This poses a problem following a reset, as we don't know where
      the hw will start writing from, and due to the use of power contexts we
      cannot perform that query during the reset itself. So we continue the
      current modus operandi of delaying the first read of the context-status
      read/write pointers until after the first interrupt. With this we should
      now have eliminated all uncached mmio reads in handling the
      context-status interrupt, though we still have the uncached mmio writes
      for submitting new work, and many uncached mmio reads in the global
      interrupt handler itself. Still a step in the right direction towards
      reducing our resubmit latency, although it appears lost in the noise!
      
      v2: Cannonlake moved the CSB write index
      v3: Include the sw/hwsp state in debugfs/i915_engine_info
      v4: Also revert to using CSB mmio for GVT-g
      v5: Prevent the compiler reloading tail (Mika)
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michel Thierry <michel.thierry@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Acked-by: NMichel Thierry <michel.thierry@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170913085605.18299-6-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      767a983a
    • C
      drm/i915/execlists: Read the context-status buffer from the HWSP · 6d2cb5aa
      Chris Wilson 提交于
      The engine provides a mirror of the CSB in the HWSP. If we use the
      cacheable reads from the HWSP, we can shave off a few mmio reads per
      context-switch interrupt (which are quite frequent!). Just removing a
      couple of mmio is not enough to actually reduce any latency, but a small
      reduction in overall cpu usage.
      
      Much appreciation for Ben dropping the bombshell that the CSB was in the
      HWSP and for Michel in digging out the details.
      
      v2: Don't be lazy, add the defines for the indices.
      v3: Include the HWSP in debugfs/i915_engine_info
      v4: Check for GVT-g, it currently depends on intercepting CSB mmio
      v5: Fixup GVT-g mmio path
      v6: Disable HWSP if VT-d is active as the iommu adds unpredictable
      memory latency. (Mika)
      v7: Also markup the CSB read with READ_ONCE() as it may still be an mmio
      read and we want to stop the compiler from issuing a later (v.slow) reload.
      Suggested-by: NBen Widawsky <benjamin.widawsky@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michel Thierry <michel.thierry@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Acked-by: NMichel Thierry <michel.thierry@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170913133534.26927-1-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      6d2cb5aa
  3. 08 9月, 2017 1 次提交
  4. 07 9月, 2017 1 次提交
  5. 18 8月, 2017 1 次提交
    • C
      drm/i915: Replace execbuf vma ht with an idr · d1b48c1e
      Chris Wilson 提交于
      This was the competing idea long ago, but it was only with the rewrite
      of the idr as an radixtree and using the radixtree directly ourselves,
      along with the realisation that we can store the vma directly in the
      radixtree and only need a list for the reverse mapping, that made the
      patch performant enough to displace using a hashtable. Though the vma ht
      is fast and doesn't require any extra allocation (as we can embed the node
      inside the vma), it does require a thread for resizing and serialization
      and will have the occasional slow lookup. That is hairy enough to
      investigate alternatives and favour them if equivalent in peak performance.
      One advantage of allocating an indirection entry is that we can support a
      single shared bo between many clients, something that was done on a
      first-come first-serve basis for shared GGTT vma previously. To offset
      the extra allocations, we create yet another kmem_cache for them.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
      d1b48c1e
  6. 11 8月, 2017 1 次提交
  7. 27 7月, 2017 2 次提交
  8. 21 7月, 2017 1 次提交
  9. 14 7月, 2017 1 次提交
  10. 11 7月, 2017 1 次提交
  11. 10 7月, 2017 1 次提交
  12. 08 7月, 2017 1 次提交
  13. 03 7月, 2017 1 次提交
  14. 28 6月, 2017 1 次提交
    • C
      drm/i915: Avoid keeping waitboost active for signaling threads · 7b92c1bd
      Chris Wilson 提交于
      Once a client has requested a waitboost, we keep that waitboost active
      until all clients are no longer waiting. This is because we don't
      distinguish which waiter deserves the boost. However, with the advent of
      fence signaling, the signaler threads appear as waiters to the RPS
      interrupt handler. So instead of using a single boolean to track when to
      keep the waitboost active, use a counter of all outstanding waitboosted
      requests.
      
      At this point, I have removed all vestiges of the rate limiting on
      clients. Whilst this means that compositors should remain more fluid,
      it also means that boosts are more prevalent. See commit b29c19b6
      ("drm/i915: Boost RPS frequency for CPU stalls") for a longer discussion
      on the pros and cons of both approaches.
      
      A drawback of this implementation is that it requires constant request
      submission to keep the waitboost trimmed (as it is now cancelled when the
      request is completed). This will be fine for a busy system, but near
      idle the boosts may be kept for longer than desired (effectively tens of
      vblanks worstcase) and there is a reliance on rc6 instead.
      
      v2: Remove defunct rps.client_lock
      Reported-by: NMichał Winiarski <michal.winiarski@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170628123548.9236-1-chris@chris-wilson.co.uk
      7b92c1bd
  15. 26 6月, 2017 1 次提交
  16. 21 6月, 2017 3 次提交
  17. 16 6月, 2017 1 次提交
    • C
      drm/i915: Store a direct lookup from object handle to vma · 4ff4b44c
      Chris Wilson 提交于
      The advent of full-ppgtt lead to an extra indirection between the object
      and its binding. That extra indirection has a noticeable impact on how
      fast we can convert from the user handles to our internal vma for
      execbuffer. In order to bypass the extra indirection, we use a
      resizable hashtable to jump from the object to the per-ctx vma.
      rhashtable was considered but we don't need the online resizing feature
      and the extra complexity proved to undermine its usefulness. Instead, we
      simply reallocate the hastable on demand in a background task and
      serialize it before iterating.
      
      In non-full-ppgtt modes, multiple files and multiple contexts can share
      the same vma. This leads to having multiple possible handle->vma links,
      so we only use the first to establish the fast path. The majority of
      buffers are not shared and so we should still be able to realise
      speedups with multiple clients.
      
      v2: Prettier names, more magic.
      v3: Many style tweaks, most notably hiding the misuse of execobj[].rsvd2
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      4ff4b44c
  18. 07 6月, 2017 2 次提交
  19. 30 5月, 2017 1 次提交
  20. 26 5月, 2017 1 次提交
  21. 22 5月, 2017 1 次提交
  22. 19 5月, 2017 3 次提交
  23. 18 5月, 2017 1 次提交
  24. 17 5月, 2017 2 次提交
    • C
      drm/i915: Split execlist priority queue into rbtree + linked list · 6c067579
      Chris Wilson 提交于
      All the requests at the same priority are executed in FIFO order. They
      do not need to be stored in the rbtree themselves, as they are a simple
      list within a level. If we move the requests at one priority into a list,
      we can then reduce the rbtree to the set of priorities. This should keep
      the height of the rbtree small, as the number of active priorities can not
      exceed the number of active requests and should be typically only a few.
      
      Currently, we have ~2k possible different priority levels, that may
      increase to allow even more fine grained selection. Allocating those in
      advance seems a waste (and may be impossible), so we opt for allocating
      upon first use, and freeing after its requests are depleted. To avoid
      the possibility of an allocation failure causing us to lose a request,
      we preallocate the default priority (0) and bump any request to that
      priority if we fail to allocate it the appropriate plist. Having a
      request (that is ready to run, so not leading to corruption) execute
      out-of-order is better than leaking the request (and its dependency
      tree) entirely.
      
      There should be a benefit to reducing execlists_dequeue() to principally
      using a simple list (and reducing the frequency of both rbtree iteration
      and balancing on erase) but for typical workloads, request coalescing
      should be small enough that we don't notice any change. The main gain is
      from improving PI calls to schedule, and the explicit list within a
      level should make request unwinding simpler (we just need to insert at
      the head of the list rather than the tail and not have to make the
      rbtree search more complicated).
      
      v2: Avoid use-after-free when deleting a depleted priolist
      
      v3: Michał found the solution to handling the allocation failure
      gracefully. If we disable all priority scheduling following the
      allocation failure, those requests will be executed in fifo and we will
      ensure that this request and its dependencies are in strict fifo (even
      when it doesn't realise it is only a single list). Normal scheduling is
      restored once we know the device is idle, until the next failure!
      Suggested-by: NMichał Wajdeczko <michal.wajdeczko@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
      6c067579
    • C
      drm/i915/execlists: Pack the count into the low bits of the port.request · 77f0d0e9
      Chris Wilson 提交于
      add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
      function                                     old     new   delta
      execlists_submit_ports                       262     471    +209
      port_assign.isra                               -     136    +136
      capture                                     6344    6359     +15
      reset_common_ring                            438     452     +14
      execlists_submit_request                     228     238     +10
      gen8_init_common_ring                        334     341      +7
      intel_engine_is_idle                         106     105      -1
      i915_engine_info                            2314    2290     -24
      __i915_gem_set_wedged_BKL                    485     411     -74
      intel_lrc_irq_handler                       1789    1604    -185
      execlists_update_context                     294       -    -294
      
      The most important change there is the improve to the
      intel_lrc_irq_handler and excclist_submit_ports (net improvement since
      execlists_update_context is now inlined).
      
      v2: Use the port_api() for guc as well (even though currently we do not
      pack any counters in there, yet) and hide all port->request_count inside
      the helpers.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-5-chris@chris-wilson.co.uk
      77f0d0e9
  25. 16 5月, 2017 1 次提交
  26. 11 5月, 2017 2 次提交
  27. 10 5月, 2017 1 次提交
    • V
      drm/i915: Two stage watermarks for g4x · 04548cba
      Ville Syrjälä 提交于
      Implement proper two stage watermark programming for g4x. As with
      other pre-SKL platforms, the watermark registers aren't double
      buffered on g4x. Hence we must sequence the watermark update
      carefully around plane updates.
      
      The code is quite heavily modelled on the VLV/CHV code, with some
      fairly significant differences due to the different hardware
      architecture:
      * g4x doesn't use inverted watermark values
      * CxSR actually affects the watermarks since it controls memory self
        refresh in addition to the max FIFO mode
      * A further HPLL SR mode is possible with higher memory wakeup
        latency
      * g4x has FBC2 and so it also has FBC watermarks
      * max FIFO mode for primary plane only (cursor is allowed, sprite is not)
      * g4x has no manual FIFO repartitioning
      * some TLB miss related workarounds are needed for the watermarks
      
      Actually the hardware is quite similar to ILK+ in many ways. The
      most visible differences are in the actual watermakr register
      layout. ILK revamped that part quite heavily whereas g4x is still
      using the layout inherited from earlier platforms.
      
      Note that we didn't previously enable the HPLL SR on g4x. So in order
      to not introduce too many functional changes in this patch I've not
      actually enabled it here either, even though the code is now fully
      ready for it. We'll enable it separately later on.
      Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170421181432.15216-13-ville.syrjala@linux.intel.comReviewed-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      04548cba
  28. 09 5月, 2017 1 次提交
  29. 08 5月, 2017 1 次提交