1. 08 5月, 2019 1 次提交
  2. 07 5月, 2019 1 次提交
  3. 03 5月, 2019 1 次提交
  4. 02 5月, 2019 1 次提交
  5. 01 5月, 2019 1 次提交
  6. 27 4月, 2019 3 次提交
  7. 26 4月, 2019 2 次提交
    • C
      drm/i915: Enable render context support for gen4 (Broadwater to Cantiga) · 9ce9bdb0
      Chris Wilson 提交于
      Broadwater and the rest of gen4  do support being able to saving and
      reloading context specific registers between contexts, providing isolation
      of the basic GPU state (as programmable by userspace). This allows
      userspace to assume that the GPU retains their state from one batch to the
      next, minimising the amount of state it needs to reload and manually save
      across batches.
      
      v2: CONSTANT_BUFFER woes
      
      Running through piglit turned up an interesting issue, a GPU hang inside
      the context load. The context image includes the CONSTANT_BUFFER command
      that loads an address into a on-gpu buffer, and the context load was
      executing that immediately. However, since it was reading from the GTT
      there is no guarantee that the GTT retains the same configuration as
      when the context was saved, resulting in stray reads and a GPU hang.
      
      Having tried issuing a CONSTANT_BUFFER (to disable the command) from the
      ring before saving the context to no avail, we resort to patching out
      the instruction inside the context image before loading.
      
      This does impose that gen4 always reissues CONSTANT_BUFFER commands on
      each batch, but due to the use of a shared GTT that was and will remain
      a requirement.
      
      v3: ECOSKPD to the rescue
      
      Ville found the magic bit in the ECOSKPD to disable saving and restoring
      the CONSTANT_BUFFER from the context image, thereby completely avoiding
      the GPU hangs from chasing invalid pointers. This appears to be the
      default behaviour for gen5, and so we just need to tweak gen4 to match.
      
      v4: Fix spelling of ECOSKPD and discover it already exists
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Kenneth Graunke <kenneth@whitecape.org>
      Reviewed-by: NKenneth Graunke <kenneth@whitecape.org>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190419172720.5462-1-chris@chris-wilson.co.uk
      9ce9bdb0
    • C
      drm/i915: Enable render context support for Ironlake (gen5) · 1215d28e
      Chris Wilson 提交于
      Ironlake does support being able to saving and reloading context specific
      registers between contexts, providing isolation of the basic GPU state
      (as programmable by userspace). This allows userspace to assume that the
      GPU retains their state from one batch to the next, minimising the
      amount of state it needs to reload, or manually save and restore.
      
      v2: Fix off-by-one in reading CXT_SIZE, and add a comment that the
      CXT_SIZE and context-layout do not match in bspec, but the difference is
      irrelevant as we overallocate the full page anyway (Ville).
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
      Cc: Kenneth Graunke <kenneth@whitecape.org>
      Reviewed-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190419111749.3910-2-chris@chris-wilson.co.uk
      1215d28e
  8. 25 4月, 2019 2 次提交
    • C
      drm/i915: Invert the GEM wakeref hierarchy · 79ffac85
      Chris Wilson 提交于
      In the current scheme, on submitting a request we take a single global
      GEM wakeref, which trickles down to wake up all GT power domains. This
      is undesirable as we would like to be able to localise our power
      management to the available power domains and to remove the global GEM
      operations from the heart of the driver. (The intent there is to push
      global GEM decisions to the boundary as used by the GEM user interface.)
      
      Now during request construction, each request is responsible via its
      logical context to acquire a wakeref on each power domain it intends to
      utilize. Currently, each request takes a wakeref on the engine(s) and
      the engines themselves take a chipset wakeref. This gives us a
      transition on each engine which we can extend if we want to insert more
      powermangement control (such as soft rc6). The global GEM operations
      that currently require a struct_mutex are reduced to listening to pm
      events from the chipset GT wakeref. As we reduce the struct_mutex
      requirement, these listeners should evaporate.
      
      Perhaps the biggest immediate change is that this removes the
      struct_mutex requirement around GT power management, allowing us greater
      flexibility in request construction. Another important knock-on effect,
      is that by tracking engine usage, we can insert a switch back to the
      kernel context on that engine immediately, avoiding any extra delay or
      inserting global synchronisation barriers. This makes tracking when an
      engine and its associated contexts are idle much easier -- important for
      when we forgo our assumed execution ordering and need idle barriers to
      unpin used contexts. In the process, it means we remove a large chunk of
      code whose only purpose was to switch back to the kernel context.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Cc: Imre Deak <imre.deak@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190424200717.1686-5-chris@chris-wilson.co.uk
      79ffac85
    • C
      drm/i915: Move GraphicsTechnology files under gt/ · 112ed2d3
      Chris Wilson 提交于
      Start partitioning off the code that talks to the hardware (GT) from the
      uapi layers and move the device facing code under gt/
      
      One casualty is s/intel_ringbuffer.h/intel_engine.h/ with the plan to
      subdivide that header and body further (and split out the submission
      code from the ringbuffer and logical context handling). This patch aims
      to be simple motion so git can fixup inflight patches with little mess.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Acked-by: NJani Nikula <jani.nikula@intel.com>
      Acked-by: NRodrigo Vivi <rodrigo.vivi@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190424174839.7141-1-chris@chris-wilson.co.uk
      112ed2d3
  9. 24 4月, 2019 1 次提交
  10. 11 4月, 2019 2 次提交
  11. 27 3月, 2019 3 次提交
  12. 22 3月, 2019 1 次提交
    • C
      drm/i915: Flush pages on acquisition · a679f58d
      Chris Wilson 提交于
      When we return pages to the system, we ensure that they are marked as
      being in the CPU domain since any external access is uncontrolled and we
      must assume the worst. This means that we need to always flush the pages
      on acquisition if we need to use them on the GPU, and from the beginning
      have used set-domain. Set-domain is overkill for the purpose as it is a
      general synchronisation barrier, but our intent is to only flush the
      pages being swapped in. If we move that flush into the pages acquisition
      phase, we know then that when we have obj->mm.pages, they are coherent
      with the GPU and need only maintain that status without resorting to
      heavy handed use of set-domain.
      
      The principle knock-on effect for userspace is through mmap-gtt
      pagefaulting. Our uAPI has always implied that the GTT mmap was async
      (especially as when any pagefault occurs is unpredicatable to userspace)
      and so userspace had to apply explicit domain control itself
      (set-domain). However, swapping is transparent to the kernel, and so on
      first fault we need to acquire the pages and make them coherent for
      access through the GTT. Our use of set-domain here leaks into the uABI
      that the first pagefault was synchronous. This is unintentional and
      baring a few igt should be unoticed, nevertheless we bump the uABI
      version for mmap-gtt to reflect the change in behaviour.
      
      Another implication of the change is that gem_create() is presumed to
      create an object that is coherent with the CPU and is in the CPU write
      domain, so a set-domain(CPU) following a gem_create() would be a minor
      operation that merely checked whether we could allocate all pages for
      the object. On applying this change, a set-domain(CPU) causes a clflush
      as we acquire the pages. This will have a small impact on mesa as we move
      the clflush here on !llc from execbuf time to create, but that should
      have minimal performance impact as the same clflush exists but is now
      done early and because of the clflush issue, userspace recycles bo and
      so should resist allocating fresh objects.
      
      Internally, the presumption that objects are created in the CPU
      write-domain and remain so through writes to obj->mm.mapping is more
      prevalent than I expected; but easy enough to catch and apply a manual
      flush.
      
      For the future, we should push the page flush from the central
      set_pages() into the callers so that we can more finely control when it
      is applied, but for now doing it one location is easier to validate, at
      the cost of sometimes flushing when there is no need.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Matthew Auld <matthew.william.auld@gmail.com>
      Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
      Cc: Antonio Argenziano <antonio.argenziano@intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Reviewed-by: NMatthew Auld <matthew.william.auld@gmail.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190321161908.8007-1-chris@chris-wilson.co.uk
      a679f58d
  13. 21 3月, 2019 2 次提交
  14. 08 3月, 2019 4 次提交
  15. 06 3月, 2019 3 次提交
  16. 02 3月, 2019 1 次提交
    • C
      drm/i915: Use HW semaphores for inter-engine synchronisation on gen8+ · e8861964
      Chris Wilson 提交于
      Having introduced per-context seqno, we now have a means to identity
      progress across the system without feel of rollback as befell the
      global_seqno. That is we can program a MI_SEMAPHORE_WAIT operation in
      advance of submission safe in the knowledge that our target seqno and
      address is stable.
      
      However, since we are telling the GPU to busy-spin on the target address
      until it matches the signaling seqno, we only want to do so when we are
      sure that busy-spin will be completed quickly. To achieve this we only
      submit the request to HW once the signaler is itself executing (modulo
      preemption causing us to wait longer), and we only do so for default and
      above priority requests (so that idle priority tasks never themselves
      hog the GPU waiting for others).
      
      As might be reasonably expected, HW semaphores excel in inter-engine
      synchronisation microbenchmarks (where the 3x reduced latency / increased
      throughput more than offset the power cost of spinning on a second ring)
      and have significant improvement (can be up to ~10%, most see no change)
      for single clients that utilize multiple engines (typically media players
      and transcoders), without regressing multiple clients that can saturate
      the system or changing the power envelope dramatically.
      
      v3: Drop the older NEQ branch, now we pin the signaler's HWSP anyway.
      v4: Tell the world and include it as part of scheduler caps.
      
      Testcase: igt/gem_exec_whisper
      Testcase: igt/benchmarks/gem_wsim
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190301170901.8340-3-chris@chris-wilson.co.uk
      e8861964
  17. 28 2月, 2019 3 次提交
  18. 27 2月, 2019 1 次提交
    • C
      drm/i915: Avoid waking the engines just to check if they are idle · 0b702dca
      Chris Wilson 提交于
      Exploit that reads of the ring registers return 0 from the engine when
      it is idle and we do not apply forcewake to know that if the engine is
      idle then both reads will be identical (and so we interpret the ring as
      idle).
      
      The ulterior motive is to try and reduce the number of spurious wakeups
      to avoid untimely death, such as:
      
      <3> [85.046836] [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
      <4> [85.051916] ------------[ cut here ]------------
      <4> [85.051917] GT thread status wait timed out
      <4> [85.051963] WARNING: CPU: 2 PID: 2195 at drivers/gpu/drm/i915/intel_uncore.c:303 __gen6_gt_wait_for_thread_c0+0x6e/0xa0 [i915]
      <4> [85.051964] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp mei_hdcp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec broadcom bcm_phy_lib i2c_i801 snd_hwdep snd_hda_core tg3 snd_pcm ptp pps_core mei_me mei prime_numbers lpc_ich
      <4> [85.051980] CPU: 2 PID: 2195 Comm: drm_read Tainted: G     U            5.0.0-rc8-CI-CI_DRM_5662+ #1
      <4> [85.051981] Hardware name: Dell Inc. XPS 8300  /0Y2MRG, BIOS A06 10/17/2011
      <4> [85.052012] RIP: 0010:__gen6_gt_wait_for_thread_c0+0x6e/0xa0 [i915]
      <4> [85.052015] Code: 8b 92 5c 80 13 00 83 e2 07 75 d5 5b 5d c3 80 3d 5b 6a 1a 00 00 75 f4 48 c7 c7 38 21 31 a0 c6 05 4b 6a 1a 00 01 e8 e2 84 ea e0 <0f> 0b eb dd 80 3d 3a 6a 1a 00 00 75 98 48 c7 c6 08 21 31 a0 48 c7
      <4> [85.052016] RSP: 0018:ffffc9000043bd00 EFLAGS: 00010086
      <4> [85.052019] RAX: 0000000000000000 RBX: ffff888217c50000 RCX: 0000000000000000
      <4> [85.052020] RDX: 0000000000000007 RSI: ffffffff820cb141 RDI: 00000000ffffffff
      <4> [85.052022] RBP: 00000013cd30f2fb R08: 0000000000000000 R09: 0000000000000001
      <4> [85.052024] R10: ffffc9000043bce0 R11: 0000000000000000 R12: ffff888217c50ee0
      <4> [85.052025] R13: 0000000000000001 R14: 00000000ffffffff R15: ffff888218076530
      <4> [85.052028] FS:  00007fc79d049980(0000) GS:ffff888227a80000(0000) knlGS:0000000000000000
      <4> [85.052029] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      <4> [85.052031] CR2: 00007f782e2940f8 CR3: 000000022458e006 CR4: 00000000000606e0
      <4> [85.052033] Call Trace:
      <4> [85.052064]  gen6_read32+0x14e/0x250 [i915]
      <4> [85.052096]  intel_engine_is_idle+0x7d/0x180 [i915]
      <4> [85.052126]  intel_engines_are_idle+0x29/0x50 [i915]
      <4> [85.052153]  i915_drop_caches_set+0x21c/0x290 [i915]
      <4> [85.052160]  simple_attr_write+0xb0/0xd0
      <4> [85.052165]  full_proxy_write+0x51/0x80
      <4> [85.052170]  __vfs_write+0x31/0x190
      <4> [85.052176]  ? rcu_read_lock_sched_held+0x6f/0x80
      <4> [85.052178]  ? rcu_sync_lockdep_assert+0x29/0x50
      <4> [85.052181]  ? __sb_start_write+0x152/0x1f0
      <4> [85.052183]  ? __sb_start_write+0x163/0x1f0
      <4> [85.052187]  vfs_write+0xbd/0x1b0
      <4> [85.052191]  ksys_write+0x50/0xc0
      <4> [85.052196]  do_syscall_64+0x55/0x190
      <4> [85.052200]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      <4> [85.052202] RIP: 0033:0x7fc79c9d3281
      <4> [85.052204] Code: c3 0f 1f 84 00 00 00 00 00 48 8b 05 59 8d 20 00 c3 0f 1f 84 00 00 00 00 00 8b 05 8a d1 20 00 85 c0 75 16 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 57 f3 c3 0f 1f 44 00 00 41 54 55 49 89 d4 53
      <4> [85.052206] RSP: 002b:00007fffa4a0a7f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      <4> [85.052208] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fc79c9d3281
      <4> [85.052210] RDX: 0000000000000005 RSI: 00007fffa4a0a880 RDI: 0000000000000008
      <4> [85.052212] RBP: 00007fffa4a0a820 R08: 0000000000000000 R09: 0000000000000000
      <4> [85.052213] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc79c9bc718
      <4> [85.052215] R13: 0000000000000003 R14: 00007fc79c9c1628 R15: 00007fc79c9bdd80
      <4> [85.052223] irq event stamp: 71630
      <4> [85.052226] hardirqs last  enabled at (71629): [<ffffffff8197b64c>] _raw_spin_unlock_irqrestore+0x4c/0x60
      <4> [85.052228] hardirqs last disabled at (71630): [<ffffffff8197b4bd>] _raw_spin_lock_irqsave+0xd/0x50
      <4> [85.052231] softirqs last  enabled at (70444): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
      <4> [85.052234] softirqs last disabled at (70433): [<ffffffff810b51b1>] irq_exit+0xd1/0xe0
      <4> [85.052264] WARNING: CPU: 2 PID: 2195 at drivers/gpu/drm/i915/intel_uncore.c:303 __gen6_gt_wait_for_thread_c0+0x6e/0xa0 [i915]
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Mika Kuoppala <mika.kuoppala@intel.com>
      Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190227114958.32438-1-chris@chris-wilson.co.uk
      0b702dca
  19. 26 2月, 2019 3 次提交
  20. 21 2月, 2019 1 次提交
  21. 12 2月, 2019 1 次提交
  22. 06 2月, 2019 1 次提交
  23. 30 1月, 2019 1 次提交
    • C
      drm/i915: Replace global breadcrumbs with per-context interrupt tracking · 52c0fdb2
      Chris Wilson 提交于
      A few years ago, see commit 688e6c72 ("drm/i915: Slaughter the
      thundering i915_wait_request herd"), the issue of handling multiple
      clients waiting in parallel was brought to our attention. The
      requirement was that every client should be woken immediately upon its
      request being signaled, without incurring any cpu overhead.
      
      To handle certain fragility of our hw meant that we could not do a
      simple check inside the irq handler (some generations required almost
      unbounded delays before we could be sure of seqno coherency) and so
      request completion checking required delegation.
      
      Before commit 688e6c72, the solution was simple. Every client
      waiting on a request would be woken on every interrupt and each would do
      a heavyweight check to see if their request was complete. Commit
      688e6c72 introduced an rbtree so that only the earliest waiter on
      the global timeline would woken, and would wake the next and so on.
      (Along with various complications to handle requests being reordered
      along the global timeline, and also a requirement for kthread to provide
      a delegate for fence signaling that had no process context.)
      
      The global rbtree depends on knowing the execution timeline (and global
      seqno). Without knowing that order, we must instead check all contexts
      queued to the HW to see which may have advanced. We trim that list by
      only checking queued contexts that are being waited on, but still we
      keep a list of all active contexts and their active signalers that we
      inspect from inside the irq handler. By moving the waiters onto the fence
      signal list, we can combine the client wakeup with the dma_fence
      signaling (a dramatic reduction in complexity, but does require the HW
      being coherent, the seqno must be visible from the cpu before the
      interrupt is raised - we keep a timer backup just in case).
      
      Having previously fixed all the issues with irq-seqno serialisation (by
      inserting delays onto the GPU after each request instead of random delays
      on the CPU after each interrupt), we can rely on the seqno state to
      perfom direct wakeups from the interrupt handler. This allows us to
      preserve our single context switch behaviour of the current routine,
      with the only downside that we lose the RT priority sorting of wakeups.
      In general, direct wakeup latency of multiple clients is about the same
      (about 10% better in most cases) with a reduction in total CPU time spent
      in the waiter (about 20-50% depending on gen). Average herd behaviour is
      improved, but at the cost of not delegating wakeups on task_prio.
      
      v2: Capture fence signaling state for error state and add comments to
      warm even the most cold of hearts.
      v3: Check if the request is still active before busywaiting
      v4: Reduce the amount of pointer misdirection with list_for_each_safe
      and using a local i915_request variable inside the loops
      v5: Add a missing pluralisation to a purely informative selftest message.
      
      References: 688e6c72 ("drm/i915: Slaughter the thundering i915_wait_request herd")
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
      Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
      52c0fdb2