1. 27 7月, 2017 1 次提交
  2. 21 6月, 2017 1 次提交
  3. 19 6月, 2017 1 次提交
  4. 15 6月, 2017 1 次提交
    • R
      drm/i915/perf: Add OA unit support for Gen 8+ · 19f81df2
      Robert Bragg 提交于
      Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all
      share (more-or-less) the same OA unit design.
      
      Of particular note in comparison to Haswell: some OA unit HW config
      state has become per-context state and as a consequence it is somewhat
      more complicated to manage synchronous state changes from the cpu while
      there's no guarantee of what context (if any) is currently actively
      running on the gpu.
      
      The periodic sampling frequency which can be particularly useful for
      system-wide analysis (as opposed to command stream synchronised
      MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to
      have become per-context save and restored (while the OABUFFER
      destination is still a shared, system-wide resource).
      
      This support for gen8+ takes care to consider a number of timing
      challenges involved in synchronously updating per-context state
      primarily by programming all config state from the cpu and updating all
      current and saved contexts synchronously while the OA unit is still
      disabled.
      
      The driver intentionally avoids depending on command streamer
      programming to update OA state considering the lack of synchronization
      between the automatic loading of OACTXCONTROL state (that includes the
      periodic sampling state and enable state) on context restore and the
      parsing of any general purpose BB the driver can control. I.e. this
      implementation is careful to avoid the possibility of a context restore
      temporarily enabling any out-of-date periodic sampling state. In
      addition to the risk of transiently-out-of-date state being loaded
      automatically; there are also internal HW latencies involved in the
      loading of MUX configurations which would be difficult to account for
      from the command streamer (and we only want to enable the unit when once
      the MUX configuration is complete).
      
      Since the Gen8+ OA unit design no longer supports clock gating the unit
      off for a single given context (which effectively stopped any progress
      of counters while any other context was running) and instead supports
      tagging OA reports with a context ID for filtering on the CPU, it means
      we can no longer hide the system-wide progress of counters from a
      non-privileged application only interested in metrics for its own
      context. Although we could theoretically try and subtract the progress
      of other contexts before forwarding reports via read() we aren't in a
      position to filter reports captured via MI_REPORT_PERF_COUNT commands.
      As a result, for Gen8+, we always require the
      dev.i915.perf_stream_paranoid to be unset for any access to OA metrics
      if not root.
      
      v5: Drain submitted requests when enabling metric set to ensure no
          lite-restore erases the context image we just updated (Lionel)
      
      v6: In addition to drain, switch to kernel context & update all
          context in place (Chris)
      
      v7: Add missing mutex_unlock() if switching to kernel context fails
          (Matthew)
      
      v8: Simplify OA period/flex-eu-counters programming by using the
          batchbuffer instead of modifying ctx-image (Lionel)
      
      v9: Back to updating the context image (due to erroneous testing,
          batchbuffer programming the OA unit doesn't actually work)
          (Lionel)
          Pin context before updating context image (Chris)
          Drop MMIO programming now that we switch to a kernel context with
          right values in initial context image (Chris)
      
      v10: Just pin_map the contexts we want to modify or let the
           configuration happen on first use (Chris)
      
      v11: Update kernel context OA config through the batchbuffer rather
           than on the fly ctx-image update (Lionel)
      
      v12: Rework OA context registers update again by swithing away from
           user contexts and reconfiguring the kernel context through the
           batchbuffer and updating all the other contexts' context image.
           Also take care to lock slice/subslice configuration when OA is
           on. (Lionel)
      
      v13: Request rpcs updates on all engine when updating the OA config
           (Lionel)
      
      v14: Drop any kind of rpcs management now that we monitor sseu
           configuration changes in a later patch (Lionel)
           Remove usleep after programming the NOA configs on Gen8+, this
           doesn't seem to be needed (Lionel)
      
      v15: Respect coding style for block comments (Chris)
      
      v16: Add missing i915_add_request() in case we fail to emit OA
           configuration (Matthew)
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: Matthew Auld <matthew.auld@intel.com> \o/
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      19f81df2
  5. 07 6月, 2017 1 次提交
  6. 19 5月, 2017 1 次提交
  7. 17 5月, 2017 4 次提交
    • C
      drm/i915/execlists: Reduce lock contention between schedule/submit_request · 349bdb68
      Chris Wilson 提交于
      If we do not require to perform priority bumping, and we haven't yet
      submitted the request, we can update its priority in situ and skip
      acquiring the engine locks -- thus avoiding any contention between us
      and submit/execute.
      
      v2: Remove the stack element from the list if we can do the early
      assignment.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-10-chris@chris-wilson.co.uk
      349bdb68
    • C
      drm/i915: Create a kmem_cache to allocate struct i915_priolist from · c5cf9a91
      Chris Wilson 提交于
      The i915_priolist are allocated within an atomic context on a path where
      we wish to minimise latency. If we use a dedicated kmem_cache, we have
      the advantage of a local freelist from which to service new requests
      that should keep the latency impact of an allocation small. Though
      currently we expect the majority of requests to be at default priority
      (and so hit the preallocate priolist), once userspace starts using
      priorities they are likely to use many fine grained policies improving
      the utilisation of a private slab.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-9-chris@chris-wilson.co.uk
      c5cf9a91
    • C
      drm/i915: Split execlist priority queue into rbtree + linked list · 6c067579
      Chris Wilson 提交于
      All the requests at the same priority are executed in FIFO order. They
      do not need to be stored in the rbtree themselves, as they are a simple
      list within a level. If we move the requests at one priority into a list,
      we can then reduce the rbtree to the set of priorities. This should keep
      the height of the rbtree small, as the number of active priorities can not
      exceed the number of active requests and should be typically only a few.
      
      Currently, we have ~2k possible different priority levels, that may
      increase to allow even more fine grained selection. Allocating those in
      advance seems a waste (and may be impossible), so we opt for allocating
      upon first use, and freeing after its requests are depleted. To avoid
      the possibility of an allocation failure causing us to lose a request,
      we preallocate the default priority (0) and bump any request to that
      priority if we fail to allocate it the appropriate plist. Having a
      request (that is ready to run, so not leading to corruption) execute
      out-of-order is better than leaking the request (and its dependency
      tree) entirely.
      
      There should be a benefit to reducing execlists_dequeue() to principally
      using a simple list (and reducing the frequency of both rbtree iteration
      and balancing on erase) but for typical workloads, request coalescing
      should be small enough that we don't notice any change. The main gain is
      from improving PI calls to schedule, and the explicit list within a
      level should make request unwinding simpler (we just need to insert at
      the head of the list rather than the tail and not have to make the
      rbtree search more complicated).
      
      v2: Avoid use-after-free when deleting a depleted priolist
      
      v3: Michał found the solution to handling the allocation failure
      gracefully. If we disable all priority scheduling following the
      allocation failure, those requests will be executed in fifo and we will
      ensure that this request and its dependencies are in strict fifo (even
      when it doesn't realise it is only a single list). Normal scheduling is
      restored once we know the device is idle, until the next failure!
      Suggested-by: NMichał Wajdeczko <michal.wajdeczko@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Michał Winiarski <michal.winiarski@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
      6c067579
    • C
      drm/i915/execlists: Pack the count into the low bits of the port.request · 77f0d0e9
      Chris Wilson 提交于
      add/remove: 1/1 grow/shrink: 5/4 up/down: 391/-578 (-187)
      function                                     old     new   delta
      execlists_submit_ports                       262     471    +209
      port_assign.isra                               -     136    +136
      capture                                     6344    6359     +15
      reset_common_ring                            438     452     +14
      execlists_submit_request                     228     238     +10
      gen8_init_common_ring                        334     341      +7
      intel_engine_is_idle                         106     105      -1
      i915_engine_info                            2314    2290     -24
      __i915_gem_set_wedged_BKL                    485     411     -74
      intel_lrc_irq_handler                       1789    1604    -185
      execlists_update_context                     294       -    -294
      
      The most important change there is the improve to the
      intel_lrc_irq_handler and excclist_submit_ports (net improvement since
      execlists_update_context is now inlined).
      
      v2: Use the port_api() for guc as well (even though currently we do not
      pack any counters in there, yet) and hide all port->request_count inside
      the helpers.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-5-chris@chris-wilson.co.uk
      77f0d0e9
  8. 12 5月, 2017 1 次提交
  9. 04 5月, 2017 1 次提交
  10. 28 4月, 2017 1 次提交
    • J
      drm/i915: Sanitize engine context sizes · 63ffbcda
      Joonas Lahtinen 提交于
      Pre-calculate engine context size based on engine class and device
      generation and store it in the engine instance.
      
      v2:
      - Squash and get rid of hw_context_size (Chris)
      
      v3:
      - Move after MMIO init for probing on Gen7 and 8 (Chris)
      - Retained rounding (Tvrtko)
      v4:
      - Rebase for deferred legacy context allocation
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Oscar Mateo <oscar.mateo@intel.com>
      Cc: Zhenyu Wang <zhenyuw@linux.intel.com>
      Cc: intel-gvt-dev@lists.freedesktop.org
      Acked-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      63ffbcda
  11. 25 4月, 2017 2 次提交
  12. 12 4月, 2017 2 次提交
  13. 03 4月, 2017 1 次提交
    • C
      drm/i915: intel_ring.engine is unused · d822bb18
      Chris Wilson 提交于
      Or rather it is used only by intel_ring_pin() to extract the
      drm_i915_private which we can easily pass in. As this is a relatively
      rare operation, save the space in the struct, and as such it is even
      break even in the extra code for passing around the parameter:
      
      add/remove: 0/0 grow/shrink: 2/3 up/down: 15/-15 (0)
      function                                     old     new   delta
      intel_init_ring_buffer                       906     918     +12
      execlists_context_pin                       1308    1311      +3
      mock_engine                                  407     403      -4
      intel_engine_create_ring                     367     363      -4
      intel_ring_pin                               326     319      -7
      Total: Before=1261794, After=1261794, chg +0.00%
      
      v2: Reorder intel_init_ring_buffer to keep the ring setup together:
      
      add/remove: 0/0 grow/shrink: 2/3 up/down: 9/-15 (-6)
      function                                     old     new   delta
      intel_init_ring_buffer                       906     912      +6
      execlists_context_pin                       1308    1311      +3
      mock_engine                                  407     403      -4
      intel_engine_create_ring                     367     363      -4
      intel_ring_pin                               326     319      -7
      Total: Before=1261794, After=1261788, chg -0.00%
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170403113426.25707-1-chris@chris-wilson.co.uk
      d822bb18
  14. 31 3月, 2017 1 次提交
  15. 29 3月, 2017 3 次提交
  16. 27 3月, 2017 4 次提交
  17. 24 3月, 2017 1 次提交
  18. 21 3月, 2017 4 次提交
  19. 20 3月, 2017 1 次提交
  20. 17 3月, 2017 4 次提交
  21. 16 3月, 2017 1 次提交
    • C
      drm/i915/scheduler: emulate a scheduler for guc · 31de7350
      Chris Wilson 提交于
      This emulates execlists on top of the GuC in order to defer submission of
      requests to the hardware. This deferral allows time for high priority
      requests to gazump their way to the head of the queue, however it nerfs
      the GuC by converting it back into a simple execlist (where the CPU has
      to wake up after every request to feed new commands into the GuC).
      
      v2: Drop hack status - though iirc there is still a lockdep inversion
      between fence and engine->timeline->lock (which is impossible as the
      nesting only occurs on different fences - hopefully just requires some
      judicious lockdep annotation)
      v3: Apply lockdep nesting to enabling signaling on the request, using
      the pattern we already have in __i915_gem_request_submit();
      v4: Replaying requests after a hang also now needs the timeline
      spinlock, to disable the interrupts at least
      v5: Hold wq lock for completeness, and emit a tracepoint for enabling signal
      v6: Reorder interrupt checking for a happier gcc.
      v7: Only signal the tasklet after a user-interrupt if using guc scheduling
      v8: Restore lost update of rq through the i915_guc_irq_handler (Tvrtko)
      v9: Avoid re-initialising the engine->irq_tasklet from inside a reset
      v10: Hook up the execlists-style tracepoints
      v11: Clear the execlists irq_posted bit after taking over the interrupt/tasklet
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170316125619.6856-1-chris@chris-wilson.co.ukReviewed-by: NMichał Winiarski <michal.winiarski@intel.com>
      31de7350
  22. 03 3月, 2017 3 次提交