1. 05 10月, 2017 1 次提交
  2. 04 10月, 2017 1 次提交
  3. 29 9月, 2017 3 次提交
  4. 27 9月, 2017 1 次提交
  5. 26 9月, 2017 1 次提交
  6. 25 9月, 2017 7 次提交
  7. 22 9月, 2017 1 次提交
  8. 18 9月, 2017 6 次提交
  9. 14 9月, 2017 2 次提交
    • C
      drm/i915/execlists: Read the context-status HEAD from the HWSP · 767a983a
      Chris Wilson 提交于
      The engine also provides a mirror of the CSB write pointer in the HWSP,
      but not of our read pointer. To take advantage of this we need to
      remember where we read up to on the last interrupt and continue off from
      there. This poses a problem following a reset, as we don't know where
      the hw will start writing from, and due to the use of power contexts we
      cannot perform that query during the reset itself. So we continue the
      current modus operandi of delaying the first read of the context-status
      read/write pointers until after the first interrupt. With this we should
      now have eliminated all uncached mmio reads in handling the
      context-status interrupt, though we still have the uncached mmio writes
      for submitting new work, and many uncached mmio reads in the global
      interrupt handler itself. Still a step in the right direction towards
      reducing our resubmit latency, although it appears lost in the noise!
      
      v2: Cannonlake moved the CSB write index
      v3: Include the sw/hwsp state in debugfs/i915_engine_info
      v4: Also revert to using CSB mmio for GVT-g
      v5: Prevent the compiler reloading tail (Mika)
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michel Thierry <michel.thierry@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Acked-by: NMichel Thierry <michel.thierry@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170913085605.18299-6-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      767a983a
    • C
      drm/i915/execlists: Read the context-status buffer from the HWSP · 6d2cb5aa
      Chris Wilson 提交于
      The engine provides a mirror of the CSB in the HWSP. If we use the
      cacheable reads from the HWSP, we can shave off a few mmio reads per
      context-switch interrupt (which are quite frequent!). Just removing a
      couple of mmio is not enough to actually reduce any latency, but a small
      reduction in overall cpu usage.
      
      Much appreciation for Ben dropping the bombshell that the CSB was in the
      HWSP and for Michel in digging out the details.
      
      v2: Don't be lazy, add the defines for the indices.
      v3: Include the HWSP in debugfs/i915_engine_info
      v4: Check for GVT-g, it currently depends on intercepting CSB mmio
      v5: Fixup GVT-g mmio path
      v6: Disable HWSP if VT-d is active as the iommu adds unpredictable
      memory latency. (Mika)
      v7: Also markup the CSB read with READ_ONCE() as it may still be an mmio
      read and we want to stop the compiler from issuing a later (v.slow) reload.
      Suggested-by: NBen Widawsky <benjamin.widawsky@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michel Thierry <michel.thierry@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: Zhi Wang <zhi.a.wang@intel.com>
      Acked-by: NMichel Thierry <michel.thierry@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20170913133534.26927-1-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      6d2cb5aa
  10. 13 9月, 2017 3 次提交
  11. 21 8月, 2017 1 次提交
  12. 19 8月, 2017 2 次提交
  13. 27 7月, 2017 1 次提交
  14. 21 6月, 2017 1 次提交
  15. 19 6月, 2017 1 次提交
  16. 15 6月, 2017 1 次提交
    • R
      drm/i915/perf: Add OA unit support for Gen 8+ · 19f81df2
      Robert Bragg 提交于
      Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all
      share (more-or-less) the same OA unit design.
      
      Of particular note in comparison to Haswell: some OA unit HW config
      state has become per-context state and as a consequence it is somewhat
      more complicated to manage synchronous state changes from the cpu while
      there's no guarantee of what context (if any) is currently actively
      running on the gpu.
      
      The periodic sampling frequency which can be particularly useful for
      system-wide analysis (as opposed to command stream synchronised
      MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to
      have become per-context save and restored (while the OABUFFER
      destination is still a shared, system-wide resource).
      
      This support for gen8+ takes care to consider a number of timing
      challenges involved in synchronously updating per-context state
      primarily by programming all config state from the cpu and updating all
      current and saved contexts synchronously while the OA unit is still
      disabled.
      
      The driver intentionally avoids depending on command streamer
      programming to update OA state considering the lack of synchronization
      between the automatic loading of OACTXCONTROL state (that includes the
      periodic sampling state and enable state) on context restore and the
      parsing of any general purpose BB the driver can control. I.e. this
      implementation is careful to avoid the possibility of a context restore
      temporarily enabling any out-of-date periodic sampling state. In
      addition to the risk of transiently-out-of-date state being loaded
      automatically; there are also internal HW latencies involved in the
      loading of MUX configurations which would be difficult to account for
      from the command streamer (and we only want to enable the unit when once
      the MUX configuration is complete).
      
      Since the Gen8+ OA unit design no longer supports clock gating the unit
      off for a single given context (which effectively stopped any progress
      of counters while any other context was running) and instead supports
      tagging OA reports with a context ID for filtering on the CPU, it means
      we can no longer hide the system-wide progress of counters from a
      non-privileged application only interested in metrics for its own
      context. Although we could theoretically try and subtract the progress
      of other contexts before forwarding reports via read() we aren't in a
      position to filter reports captured via MI_REPORT_PERF_COUNT commands.
      As a result, for Gen8+, we always require the
      dev.i915.perf_stream_paranoid to be unset for any access to OA metrics
      if not root.
      
      v5: Drain submitted requests when enabling metric set to ensure no
          lite-restore erases the context image we just updated (Lionel)
      
      v6: In addition to drain, switch to kernel context & update all
          context in place (Chris)
      
      v7: Add missing mutex_unlock() if switching to kernel context fails
          (Matthew)
      
      v8: Simplify OA period/flex-eu-counters programming by using the
          batchbuffer instead of modifying ctx-image (Lionel)
      
      v9: Back to updating the context image (due to erroneous testing,
          batchbuffer programming the OA unit doesn't actually work)
          (Lionel)
          Pin context before updating context image (Chris)
          Drop MMIO programming now that we switch to a kernel context with
          right values in initial context image (Chris)
      
      v10: Just pin_map the contexts we want to modify or let the
           configuration happen on first use (Chris)
      
      v11: Update kernel context OA config through the batchbuffer rather
           than on the fly ctx-image update (Lionel)
      
      v12: Rework OA context registers update again by swithing away from
           user contexts and reconfiguring the kernel context through the
           batchbuffer and updating all the other contexts' context image.
           Also take care to lock slice/subslice configuration when OA is
           on. (Lionel)
      
      v13: Request rpcs updates on all engine when updating the OA config
           (Lionel)
      
      v14: Drop any kind of rpcs management now that we monitor sseu
           configuration changes in a later patch (Lionel)
           Remove usleep after programming the NOA configs on Gen8+, this
           doesn't seem to be needed (Lionel)
      
      v15: Respect coding style for block comments (Chris)
      
      v16: Add missing i915_add_request() in case we fail to emit OA
           configuration (Matthew)
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: Matthew Auld <matthew.auld@intel.com> \o/
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      19f81df2
  17. 07 6月, 2017 1 次提交
  18. 19 5月, 2017 1 次提交
  19. 17 5月, 2017 4 次提交
    • C
      drm/i915/execlists: Reduce lock contention between schedule/submit_request · 349bdb68
      Chris Wilson 提交于
      If we do not require to perform priority bumping, and we haven't yet
      submitted the request, we can update its priority in situ and skip
      acquiring the engine locks -- thus avoiding any contention between us
      and submit/execute.
      
      v2: Remove the stack element from the list if we can do the early
      assignment.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-10-chris@chris-wilson.co.uk
      349bdb68
    • C
      drm/i915: Create a kmem_cache to allocate struct i915_priolist from · c5cf9a91
      Chris Wilson 提交于
      The i915_priolist are allocated within an atomic context on a path where
      we wish to minimise latency. If we use a dedicated kmem_cache, we have
      the advantage of a local freelist from which to service new requests
      that should keep the latency impact of an allocation small. Though
      currently we expect the majority of requests to be at default priority
      (and so hit the preallocate priolist), once userspace starts using
      priorities they are likely to use many fine grained policies improving
      the utilisation of a private slab.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-9-chris@chris-wilson.co.uk
      c5cf9a91
    • C
      drm/i915: Split execlist priority queue into rbtree + linked list · 6c067579
      Chris Wilson 提交于
      All the requests at the same priority are executed in FIFO order. They
      do not need to be stored in the rbtree themselves, as they are a simple
      list within a level. If we move the requests at one priority into a list,
      we can then reduce the rbtree to the set of priorities. This should keep
      the height of the rbtree small, as the number of active priorities can not
      exceed the number of active requests and should be typically only a few.
      
      Currently, we have ~2k possible different priority levels, that may
      increase to allow even more fine grained selection. Allocating those in
      advance seems a waste (and may be impossible), so we opt for allocating
      upon first use, and freeing after its requests are depleted. To avoid
      the possibility of an allocation failure causing us to lose a request,
      we preallocate the default priority (0) and bump any request to that
      priority if we fail to allocate it the appropriate plist. Having a
      request (that is ready to run, so not leading to corruption) execute
      out-of-order is better than leaking the request (and its dependency
      tree) entirely.
      
      There should be a benefit to reducing execlists_dequeue() to principally
      using a simple list (and reducing the frequency of both rbtree iteration
      and balancing on erase) but for typical workloads, request coalescing
      should be small enough that we don't notice any change. The main gain is
      from improving PI calls to schedule, and the explicit list within a
      level should make request unwinding simpler (we just need to insert at
      the head of the list rather than the tail and not have to make the
      rbtree search more complicated).
      
      v2: Avoid use-after-free when deleting a depleted priolist
      
      v3: Michał found the solution to handling the allocation failure
      gracefully. If we disable all priority scheduling following the
      allocation failure, those requests will be executed in fifo and we will
      ensure that this request and its dependencies are in strict fifo (even
      when it doesn't realise it is only a single list). Normal scheduling is
      restored once we know the device is idle, until the next failure!
      Suggested-by: NMichał Wajdeczko <michal.wajdeczko@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
      6c067579
    • C
      drm/i915/execlists: Pack the count into the low bits of the port.request · 77f0d0e9
      Chris Wilson 提交于
      add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
      function                                     old     new   delta
      execlists_submit_ports                       262     471    +209
      port_assign.isra                               -     136    +136
      capture                                     6344    6359     +15
      reset_common_ring                            438     452     +14
      execlists_submit_request                     228     238     +10
      gen8_init_common_ring                        334     341      +7
      intel_engine_is_idle                         106     105      -1
      i915_engine_info                            2314    2290     -24
      __i915_gem_set_wedged_BKL                    485     411     -74
      intel_lrc_irq_handler                       1789    1604    -185
      execlists_update_context                     294       -    -294
      
      The most important change there is the improve to the
      intel_lrc_irq_handler and excclist_submit_ports (net improvement since
      execlists_update_context is now inlined).
      
      v2: Use the port_api() for guc as well (even though currently we do not
      pack any counters in there, yet) and hide all port->request_count inside
      the helpers.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-5-chris@chris-wilson.co.uk
      77f0d0e9
  20. 12 5月, 2017 1 次提交