1. 16 6月, 2017 2 次提交
    • C
      drm/i915: Eliminate lots of iterations over the execobjects array · 2889caa9
      Chris Wilson 提交于
      The major scaling bottleneck in execbuffer is the processing of the
      execobjects. Creating an auxiliary list is inefficient when compared to
      using the execobject array we already have allocated.
      
      Reservation is then split into phases. As we lookup up the VMA, we
      try and bind it back into active location. Only if that fails, do we add
      it to the unbound list for phase 2. In phase 2, we try and add all those
      objects that could not fit into their previous location, with fallback
      to retrying all objects and evicting the VM in case of severe
      fragmentation. (This is the same as before, except that phase 1 is now
      done inline with looking up the VMA to avoid an iteration over the
      execobject array. In the ideal case, we eliminate the separate reservation
      phase). During the reservation phase, we only evict from the VM between
      passes (rather than currently as we try to fit every new VMA). In
      testing with Unreal Engine's Atlantis demo which stresses the eviction
      logic on gen7 class hardware, this speed up the framerate by a factor of
      2.
      
      The second loop amalgamation is between move_to_gpu and move_to_active.
      As we always submit the request, even if incomplete, we can use the
      current request to track active VMA as we perform the flushes and
      synchronisation required.
      
      The next big advancement is to avoid copying back to the user any
      execobjects and relocations that are not changed.
      
      v2: Add a Theory of Operation spiel.
      v3: Fall back to slow relocations in preparation for flushing userptrs.
      v4: Document struct members, factor out eb_validate_vma(), add a few
      more comments to explain some magic and hide other magic behind macros.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      2889caa9
    • C
      drm/i915: Store a direct lookup from object handle to vma · 4ff4b44c
      Chris Wilson 提交于
      The advent of full-ppgtt lead to an extra indirection between the object
      and its binding. That extra indirection has a noticeable impact on how
      fast we can convert from the user handles to our internal vma for
      execbuffer. In order to bypass the extra indirection, we use a
      resizable hashtable to jump from the object to the per-ctx vma.
      rhashtable was considered but we don't need the online resizing feature
      and the extra complexity proved to undermine its usefulness. Instead, we
      simply reallocate the hastable on demand in a background task and
      serialize it before iterating.
      
      In non-full-ppgtt modes, multiple files and multiple contexts can share
      the same vma. This leads to having multiple possible handle->vma links,
      so we only use the first to establish the fast path. The majority of
      buffers are not shared and so we should still be able to realise
      speedups with multiple clients.
      
      v2: Prettier names, more magic.
      v3: Many style tweaks, most notably hiding the misuse of execobj[].rsvd2
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      4ff4b44c
  2. 15 6月, 2017 8 次提交
    • V
      drm/i915: Remove pipe A quirk remnants · e56134bc
      Ville Syrjälä 提交于
      With 830 the only thing needing pipe quirks, we can just drop the quirk
      defines and replace the checks with IS_I830() checks.
      Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170601143619.27840-8-ville.syrjala@linux.intel.comAcked-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NMaarten Lankhorst <maarten.lankhorst@linux.intel.com>
      e56134bc
    • L
      drm/i915: add KBL GT2/GT3 check macros · 3891589e
      Lionel Landwerlin 提交于
      Add macros to detect GT2/GT3 skus so we can apply the proper OA
      configuration later.
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      3891589e
    • R
      drm/i915/perf: remove perf.hook_lock · 1bef3409
      Robert Bragg 提交于
      In earlier iterations of the i915-perf driver we had a number of
      callbacks/hooks from other parts of the i915 driver to e.g. notify us
      when a legacy context was pinned and these could run asynchronously with
      respect to the stream file operations and might also run in atomic
      context.
      
      dev_priv->perf.hook_lock had been for serialising access to state needed
      within these callbacks, but as the code has evolved some of the hooks
      have gone away or are implemented to avoid needing to lock any state.
      
      The remaining use of this lock was actually redundant considering how
      the gen7 oacontrol state used to be updated as part of a context pin
      hook.
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      1bef3409
    • R
      drm/i915/perf: per-gen timebase for checking sample freq · 155e941f
      Robert Bragg 提交于
      An oa_exponent_to_ns() utility and per-gen timebase constants where
      recently removed when updating the tail pointer race condition WA, and
      this restores those so we can update the _PROP_OA_EXPONENT validation
      done in read_properties_unlocked() to not assume we have a 12.5MHz
      timebase as we did for Haswell.
      
      Accordingly the oa_sample_rate_hard_limit value that's referenced by
      proc_dointvec_minmax defining the absolute limit for the OA sampling
      frequency is now initialized to (timestamp_frequency / 2) instead of the
      6.25MHz constant for Haswell.
      
      v2:
          Specify frequency of 19.2MHz for BXT (Ville)
          Initialize oa_sample_rate_hard_limit per-gen too (Lionel)
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      155e941f
    • R
      drm/i915/perf: Add more OA configs for BDW, CHV, SKL + BXT · fc599211
      Robert Bragg 提交于
      These are auto generated from an XML description of metric sets,
      currently maintained in gputop, ref:
      
       https://github.com/rib/gputop
       > gputop-data/oa-*.xml
       > scripts/i915-perf-kernelgen.py
      
       $ make -C gputop-data -f Makefile.xml
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      fc599211
    • R
      drm/i915/perf: Add OA unit support for Gen 8+ · 19f81df2
      Robert Bragg 提交于
      Enables access to OA unit metrics for BDW, CHV, SKL and BXT which all
      share (more-or-less) the same OA unit design.
      
      Of particular note in comparison to Haswell: some OA unit HW config
      state has become per-context state and as a consequence it is somewhat
      more complicated to manage synchronous state changes from the cpu while
      there's no guarantee of what context (if any) is currently actively
      running on the gpu.
      
      The periodic sampling frequency which can be particularly useful for
      system-wide analysis (as opposed to command stream synchronised
      MI_REPORT_PERF_COUNT commands) is perhaps the most surprising state to
      have become per-context save and restored (while the OABUFFER
      destination is still a shared, system-wide resource).
      
      This support for gen8+ takes care to consider a number of timing
      challenges involved in synchronously updating per-context state
      primarily by programming all config state from the cpu and updating all
      current and saved contexts synchronously while the OA unit is still
      disabled.
      
      The driver intentionally avoids depending on command streamer
      programming to update OA state considering the lack of synchronization
      between the automatic loading of OACTXCONTROL state (that includes the
      periodic sampling state and enable state) on context restore and the
      parsing of any general purpose BB the driver can control. I.e. this
      implementation is careful to avoid the possibility of a context restore
      temporarily enabling any out-of-date periodic sampling state. In
      addition to the risk of transiently-out-of-date state being loaded
      automatically; there are also internal HW latencies involved in the
      loading of MUX configurations which would be difficult to account for
      from the command streamer (and we only want to enable the unit when once
      the MUX configuration is complete).
      
      Since the Gen8+ OA unit design no longer supports clock gating the unit
      off for a single given context (which effectively stopped any progress
      of counters while any other context was running) and instead supports
      tagging OA reports with a context ID for filtering on the CPU, it means
      we can no longer hide the system-wide progress of counters from a
      non-privileged application only interested in metrics for its own
      context. Although we could theoretically try and subtract the progress
      of other contexts before forwarding reports via read() we aren't in a
      position to filter reports captured via MI_REPORT_PERF_COUNT commands.
      As a result, for Gen8+, we always require the
      dev.i915.perf_stream_paranoid to be unset for any access to OA metrics
      if not root.
      
      v5: Drain submitted requests when enabling metric set to ensure no
          lite-restore erases the context image we just updated (Lionel)
      
      v6: In addition to drain, switch to kernel context & update all
          context in place (Chris)
      
      v7: Add missing mutex_unlock() if switching to kernel context fails
          (Matthew)
      
      v8: Simplify OA period/flex-eu-counters programming by using the
          batchbuffer instead of modifying ctx-image (Lionel)
      
      v9: Back to updating the context image (due to erroneous testing,
          batchbuffer programming the OA unit doesn't actually work)
          (Lionel)
          Pin context before updating context image (Chris)
          Drop MMIO programming now that we switch to a kernel context with
          right values in initial context image (Chris)
      
      v10: Just pin_map the contexts we want to modify or let the
           configuration happen on first use (Chris)
      
      v11: Update kernel context OA config through the batchbuffer rather
           than on the fly ctx-image update (Lionel)
      
      v12: Rework OA context registers update again by swithing away from
           user contexts and reconfiguring the kernel context through the
           batchbuffer and updating all the other contexts' context image.
           Also take care to lock slice/subslice configuration when OA is
           on. (Lionel)
      
      v13: Request rpcs updates on all engine when updating the OA config
           (Lionel)
      
      v14: Drop any kind of rpcs management now that we monitor sseu
           configuration changes in a later patch (Lionel)
           Remove usleep after programming the NOA configs on Gen8+, this
           doesn't seem to be needed (Lionel)
      
      v15: Respect coding style for block comments (Chris)
      
      v16: Add missing i915_add_request() in case we fail to emit OA
           configuration (Matthew)
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: Matthew Auld <matthew.auld@intel.com> \o/
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      19f81df2
    • R
      drm/i915/perf: Add 'render basic' Gen8+ OA unit configs · 5182f646
      Robert Bragg 提交于
      Adds a static OA unit, MUX, B Counter + Flex EU configurations for basic
      render metrics on Broadwell, Cherryview, Skylake and Broxton. These are
      auto generated from an XML description of metric sets, currently
      maintained in gputop, ref:
      
       https://github.com/rib/gputop
       > gputop-data/oa-*.xml
       > scripts/i915-perf-kernelgen.py
      
       $ make -C gputop-data -f Makefile.xml WHITELIST=RenderBasic
      
      v2: add newlines to debug messages + fix comment (Matthew Auld)
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      5182f646
    • L
      drm/i915/perf: rework mux configurations queries · 3f488d99
      Lionel Landwerlin 提交于
      Gen8+ might have mux configurations per slices/subslices. Depending on
      whether slices/subslices have been fused off, only part of the
      configuration needs to be applied. This change reworks the mux
      configurations query mechanism to allow more than one set of registers
      to be programmed.
      
      v2: s/n_mux_regs/n_mux_configs/ (Matthew)
      Signed-off-by: NLionel Landwerlin <lionel.g.landwerlin@intel.com>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NBen Widawsky <ben@bwidawsk.net>
      3f488d99
  3. 13 6月, 2017 2 次提交
  4. 09 6月, 2017 1 次提交
  5. 07 6月, 2017 2 次提交
  6. 03 6月, 2017 2 次提交
  7. 30 5月, 2017 1 次提交
  8. 29 5月, 2017 1 次提交
  9. 26 5月, 2017 2 次提交
  10. 25 5月, 2017 1 次提交
    • J
      drm/i915: Serialize GTT/Aperture accesses on BXT · 0ef34ad6
      Jon Bloomfield 提交于
      BXT has a H/W issue with IOMMU which can lead to system hangs when
      Aperture accesses are queued within the GAM behind GTT Accesses.
      
      This patch avoids the condition by wrapping all GTT updates in stop_machine
      and using a flushing read prior to restarting the machine.
      
      The stop_machine guarantees no new Aperture accesses can begin while
      the PTE writes are being emmitted. The flushing read ensures that
      any following Aperture accesses cannot begin until the PTE writes
      have been cleared out of the GAM's fifo.
      
      Only FOLLOWING Aperture accesses need to be separated from in flight
      PTE updates. PTE Writes may follow tightly behind already in flight
      Aperture accesses, so no flushing read is required at the start of
      a PTE update sequence.
      
      This issue was reproduced by running
      	igt/gem_readwrite and
      	igt/gem_render_copy
      simultaneously from different processes, each in a tight loop,
      with INTEL_IOMMU enabled.
      
      This patch was originally published as:
      	drm/i915: Serialize GTT Updates on BXT
      
      v2: Move bxt/iommu detection into static function
          Remove #ifdef CONFIG_INTEL_IOMMU protection
          Make function names more reflective of purpose
          Move flushing read into static function
      
      v3: Tidy up for checkpatch.pl
      
      Testcase: igt/gem_concurrent_blit
      Signed-off-by: NJon Bloomfield <jon.bloomfield@intel.com>
      Cc: John Harrison <john.C.Harrison@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/1495641251-30022-1-git-send-email-jon.bloomfield@intel.comReviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      0ef34ad6
  11. 18 5月, 2017 3 次提交
  12. 17 5月, 2017 1 次提交
  13. 15 5月, 2017 1 次提交
  14. 13 5月, 2017 3 次提交
    • R
      drm/i915/perf: rate limit spurious oa report notice · 712122ea
      Robert Bragg 提交于
      This change is pre-emptively aiming to avoid a potential cause of kernel
      logging noise in case some condition were to result in us seeing invalid
      OA reports.
      
      The workaround for the OA unit's tail pointer race condition is what
      avoids the primary known cause of invalid reports being seen and with
      that in place we aren't expecting to see this notice but it can't be
      entirely ruled out.
      
      Just in case some condition does lead to the notice then it's likely
      that it will be triggered repeatedly while attempting to append a
      sequence of reports and depending on the configured OA sampling
      frequency that might be a large number of repeat notices.
      
      v2: (Chris) avoid inconsistent warning on throttle with
          printk_ratelimit()
      v3: (Matt) init and summarise with stream init/close not driver init/fini
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170511154345.962-9-lionel.g.landwerlin@intel.comSigned-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      712122ea
    • R
      drm/i915/perf: improve tail race workaround · 0dd860cf
      Robert Bragg 提交于
      There's a HW race condition between OA unit tail pointer register
      updates and writes to memory whereby the tail pointer can sometimes get
      ahead of what's been written out to the OA buffer so far (in terms of
      what's visible to the CPU).
      
      Although this can be observed explicitly while copying reports to
      userspace by checking for a zeroed report-id field in tail reports, we
      want to account for this earlier, as part of the _oa_buffer_check to
      avoid lots of redundant read() attempts.
      
      Previously the driver used to define an effective tail pointer that
      lagged the real pointer by a 'tail margin' measured in bytes derived
      from OA_TAIL_MARGIN_NSEC and the configured sampling frequency.
      Unfortunately this was flawed considering that the OA unit may also
      automatically generate non-periodic reports (such as on context switch)
      or the OA unit may be enabled without any periodic sampling.
      
      This improves how we define a tail pointer for reading that lags the
      real tail pointer by at least %OA_TAIL_MARGIN_NSEC nanoseconds, which
      gives enough time for the corresponding reports to become visible to the
      CPU.
      
      The driver now maintains two tail pointers:
       1) An 'aging' tail with an associated timestamp that is tracked until we
          can trust the corresponding data is visible to the CPU; at which point
          it is considered 'aged'.
       2) An 'aged' tail that can be used for read()ing.
      
      The two separate pointers let us decouple read()s from tail pointer aging.
      
      The tail pointers are checked and updated at a limited rate within a
      hrtimer callback (the same callback that is used for delivering POLLIN
      events) and since we're now measuring the wall clock time elapsed since
      a given tail pointer was read the mechanism no longer cares about
      the OA unit's periodic sampling frequency.
      
      The natural place to handle the tail pointer updates was in
      gen7_oa_buffer_is_empty() which is called as part of blocking reads and
      the hrtimer callback used for polling, and so this was renamed to
      oa_buffer_check() considering the added side effect while checking
      whether the buffer contains data.
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170511154345.962-6-lionel.g.landwerlin@intel.comSigned-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      0dd860cf
    • R
      drm/i915/perf: avoid read back of head register · f279020a
      Robert Bragg 提交于
      There's no need for the driver to keep reading back the head pointer
      from hardware since the hardware doesn't update it automatically. This
      way we can treat any invalid head pointer value as a software/driver
      bug instead of spurious hardware behaviour.
      
      This change is also a small stepping stone towards re-working how
      the head and tail state is managed as part of an improved workaround
      for the tail register race condition.
      Signed-off-by: NRobert Bragg <robert@sixbynine.org>
      Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170511154345.962-4-lionel.g.landwerlin@intel.com
      f279020a
  15. 11 5月, 2017 1 次提交
    • V
      drm/i915: Support variable cursor height on ivb+ · 024faac7
      Ville Syrjälä 提交于
      IVB introduced the CUR_FBC_CTL register which allows reducing the cursor
      height down to 8 lines from the otherwise square cursor dimensions.
      Implement support for it. CUR_FBC_CTL can't be used when the cursor
      is rotated.
      
      Commandeer the otherwise unused cursor->cursor.size to track the
      current value of CUR_FBC_CTL to optimize away redundant CUR_FBC_CTL
      writes, and to notice when we need to arm the update via CURBASE if
      just CUR_FBC_CTL changes.
      
      v2: Reverse the gen check to make it sane
      v3: Only enable CUR_FBC_CTL when cursor is enabled, adapt to
          earlier code changes which means we now actually turn off
          the cursor when we're supposed to unlike v2
      v4: Add a comment about rotation vs. CUR_FBC_CTL,
          rebase due to 'dirty' (Chris)
      v5: Rebase to the atomic world
          Handle 180 degree rotation
          Add HAS_CUR_FBC()
      v6: Rebase
      v7: Rebase due to I915_WRITE_FW/uncore.lock
          s/size/fbc_ctl/
      Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: http://patchwork.freedesktop.org/patch/msgid/20170327185546.2977-12-ville.syrjala@linux.intel.comReviewed-by: NImre Deak <imre.deak@intel.com>
      024faac7
  16. 10 5月, 2017 2 次提交
  17. 09 5月, 2017 1 次提交
  18. 03 5月, 2017 3 次提交
  19. 02 5月, 2017 2 次提交
  20. 28 4月, 2017 1 次提交
    • J
      drm/i915: Eliminate HAS_HW_CONTEXTS · f2e4d76e
      Joonas Lahtinen 提交于
      HAS_HW_CONTEXTS is misleading condition for GPU reset and CCID,
      replace it with Gen specific (to be updated in next patches).
      
      HAS_HW_CONTEXTS in i915_l3_write is bogus because each HAS_L3_DPF
      match also has .has_hw_contexts = 1 set.
      
      This leads to us being able to get rid of the property completely.
      
      v2:
      - Keep the checks at Gen6 for no functional change (Ville)
      Signed-off-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      f2e4d76e