- 02 7月, 2016 4 次提交
-
-
由 Chris Wilson 提交于
The gen2 w/a buffer is stuffed into the same slot as the gen5+ scratch buffer. If we pass in the size we want to allocate for the scratch buffer, both callers can use the same routine. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-11-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
After the elimination of using the scratch page for Ironlake's breadcrumb, we no longer need to kmap the object. We therefore can move it into the high unmappable space and do not need to force the object to be coherent (i.e. snooped on !llc platforms). Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-9-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
By using the same address for storing the HWS on every platform, we can remove the platform specific vfuncs and reduce the get-seqno routine to a single read of a cached memory location. v2: Fix semaphore_passed() to look at the signaling engine (not the waiter's) Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-8-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
One particularly stressful scenario consists of many independent tasks all competing for GPU time and waiting upon the results (e.g. realtime transcoding of many, many streams). One bottleneck in particular is that each client waits on its own results, but every client is woken up after every batchbuffer - hence the thunder of hooves as then every client must do its heavyweight dance to read a coherent seqno to see if it is the lucky one. Ideally, we only want one client to wake up after the interrupt and check its request for completion. Since the requests must retire in order, we can select the first client on the oldest request to be woken. Once that client has completed his wait, we can then wake up the next client and so on. However, all clients then incur latency as every process in the chain may be delayed for scheduling - this may also then cause some priority inversion. To reduce the latency, when a client is added or removed from the list, we scan the tree for completed seqno and wake up all the completed waiters in parallel. Using igt/benchmarks/gem_latency, we can demonstrate this effect. The benchmark measures the number of GPU cycles between completion of a batch and the client waking up from a call to wait-ioctl. With many concurrent waiters, with each on a different request, we observe that the wakeup latency before the patch scales nearly linearly with the number of waiters (before external factors kick in making the scaling much worse). After applying the patch, we can see that only the single waiter for the request is being woken up, providing a constant wakeup latency for every operation. However, the situation is not quite as rosy for many waiters on the same request, though to the best of my knowledge this is much less likely in practice. Here, we can observe that the concurrent waiters incur extra latency from being woken up by the solitary bottom-half, rather than directly by the interrupt. This appears to be scheduler induced (having discounted adverse effects from having a rbtree walk/erase in the wakeup path), each additional wake_up_process() costs approximately 1us on big core. Another effect of performing the secondary wakeups from the first bottom-half is the incurred delay this imposes on high priority threads - rather than immediately returning to userspace and leaving the interrupt handler to wake the others. To offset the delay incurred with additional waiters on a request, we could use a hybrid scheme that did a quick read in the interrupt handler and dequeued all the completed waiters (incurring the overhead in the interrupt handler, not the best plan either as we then incur GPU submission latency) but we would still have to wake up the bottom-half every time to do the heavyweight slow read. Or we could only kick the waiters on the seqno with the same priority as the current task (i.e. in the realtime waiter scenario, only it is woken up immediately by the interrupt and simply queues the next waiter before returning to userspace, minimising its delay at the expense of the chain, and also reducing contention on its scheduler runqueue). This is effective at avoid long pauses in the interrupt handler and at avoiding the extra latency in realtime/high-priority waiters. v2: Convert from a kworker per engine into a dedicated kthread for the bottom-half. v3: Rename request members and tweak comments. v4: Use a per-engine spinlock in the breadcrumbs bottom-half. v5: Fix race in locklessly checking waiter status and kicking the task on adding a new waiter. v6: Fix deciding when to force the timer to hide missing interrupts. v7: Move the bottom-half from the kthread to the first client process. v8: Reword a few comments v9: Break the busy loop when the interrupt is unmasked or has fired. v10: Comments, unnecessary churn, better debugging from Tvrtko v11: Wake all completed waiters on removing the current bottom-half to reduce the latency of waking up a herd of clients all waiting on the same request. v12: Rearrange missed-interrupt fault injection so that it works with igt/drv_missed_irq_hang v13: Rename intel_breadcrumb and friends to intel_wait in preparation for signal handling. v14: RCU commentary, assert_spin_locked v15: Hide BUG_ON behind the compiler; report on gem_latency findings. v16: Sort seqno-groups by priority so that first-waiter has the highest task priority (and so avoid priority inversion). v17: Add waiters to post-mortem GPU hang state. v18: Return early for a completed wait after acquiring the spinlock. Avoids adding ourselves to the tree if the is already complete, and skips the awkward question of why we don't do completion wakeups for waits earlier than or equal to ourselves. v19: Prepare for init_breadcrumbs to fail. Later patches may want to allocate during init, so be prepared to propagate back the error code. Testcase: igt/gem_concurrent_blit Testcase: igt/benchmarks/gem_latency Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: "Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com> Cc: "Gong, Zhipeng" <zhipeng.gong@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: "Goel, Akash" <akash.goel@intel.com> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> #v18 Link: http://patchwork.freedesktop.org/patch/msgid/1467390209-3576-6-git-send-email-chris@chris-wilson.co.uk
-
- 01 7月, 2016 1 次提交
-
-
由 Tvrtko Ursulin 提交于
Replace the macro initializer with a programatic loop which results in smaller code and hopefully just as clear. v2: Rebase. Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
-
- 24 5月, 2016 1 次提交
-
-
由 Chris Wilson 提交于
Our goal is to rename the anonymous per-engine struct beneath the current intel_context. However, after a lively debate resolving around the confusion between intel_context_engine and intel_engine_context, the realisation is that the two structs target different users. The outer struct is API / user facing, and so carries the higher level GEM information. The inner struct is hw facing. Thus we want to name the inner struct intel_context and the outer one i915_gem_context. As the first step, we need to rename the current struct: s/struct intel_context/struct i915_gem_context/ which fits much better with its constructors already conveying the i915_gem_context prefix! Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1464098023-3294-1-git-send-email-chris@chris-wilson.co.uk
-
- 23 5月, 2016 2 次提交
-
-
由 Chris Wilson 提交于
Combine the near identical implementations of intel_logical_ring_begin() and intel_ring_begin() - the only difference is that the logical wait has to check for a matching ring (which is assumed by legacy). In the process some debug messages are culled as there were following a WARN if we hit an actual error. v2: Updated commentary Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461833819-3991-12-git-send-email-chris@chris-wilson.co.uk (cherry picked from commit 987046ad) Signed-off-by: NJani Nikula <jani.nikula@intel.com>
-
由 Chris Wilson 提交于
With the introduction of a distinct engine->id vs the hardware id, we need to fix up the value we use for selecting the target engine when signaling a semaphore. Note that these values can be merged with engine->guc_id. Fixes: de1add36 Cc: stable@vger.kernel.org # v4.6 Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NVille Syrjälä <ville.syrjala@linux.intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461932305-14637-3-git-send-email-chris@chris-wilson.co.uk (cherry picked from commit 215a7e32) Signed-off-by: NJani Nikula <jani.nikula@intel.com>
-
- 09 5月, 2016 1 次提交
-
-
由 Chris Wilson 提交于
text data bss dec hex filename 6309351 3578714 696320 10584385 a18141 vmlinux 6308391 3578714 696320 10583425 a17d81 vmlinux Almost 1KiB of code reduction. v2: More s/INTEL_INFO()->gen/INTEL_GEN()/ and IS_GENx() conversions text data bss dec hex filename 6304579 3578778 696320 10579677 a16edd vmlinux 6303427 3578778 696320 10578525 a16a5d vmlinux Now over 1KiB! Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1462545621-30125-3-git-send-email-chris@chris-wilson.co.uk
-
- 29 4月, 2016 3 次提交
-
-
由 Chris Wilson 提交于
With the introduction of a distinct engine->id vs the hardware id, we need to fix up the value we use for selecting the target engine when signaling a semaphore. Note that these values can be merged with engine->guc_id. Fixes: de1add36Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NVille Syrjälä <ville.syrjala@linux.intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461932305-14637-3-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
For legacy ringbuffer mode, we need the new ordered breadcrumb emission tried and tested on execlists in order to avoid the dreaded "missed interrupt" syndrome. A secondary advantage of the execlists method is that it writes to an arbitrary address, useful if one wants to write a breadcrumb elsewhere. This fix is taken from commit 7c17d377 (drm/i915: Use ordered seqno write interrupt generation on gen8+ execlists). Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461932305-14637-1-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
With 5 rings and a flush, we need 192 bytes of space to emit the breadcrumb and semaphores. However, we need some spare room the size of the single largest packet (36 dwords, 144 bytes) to accommodate wraparound giving a grand total of 336 bytes Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461917226-9132-1-git-send-email-chris@chris-wilson.co.uk
-
- 28 4月, 2016 3 次提交
-
-
由 Tvrtko Ursulin 提交于
With the previous patch having extended the pinned lifetime of contexts by referencing the previous context from the current request until the latter is retired (completed by the GPU), we can now remove usage of execlist retired queue entirely. This is because the above now guarantees that all execlist object access requirements are satisfied by this new tracking, and we can stop taking additional references and stop keeping request on the execlists retired queue. The latter was a source of significant scalability issues in the driver causing performance hits on some tests. Most dramatical of which was igt/gem_close_race which had run time in tens of minutes which is now reduced to tens of seconds. Signed-off-by: NTvrtko Ursulin <tvrtko@ursulin.net> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1461833819-3991-24-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Now that we share intel_ring_begin(), reserving space for the tail of the request is identical between legacy/execlists and so the tautology can be removed. In the process, we move the reserved space tracking from the ringbuffer on to the request. This is to enable us to reorder the reserved space allocation in the next patch. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461833819-3991-13-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Combine the near identical implementations of intel_logical_ring_begin() and intel_ring_begin() - the only difference is that the logical wait has to check for a matching ring (which is assumed by legacy). In the process some debug messages are culled as there were following a WARN if we hit an actual error. v2: Updated commentary Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1461833819-3991-12-git-send-email-chris@chris-wilson.co.uk
-
- 12 4月, 2016 1 次提交
-
-
由 Tvrtko Ursulin 提交于
Rather than blindly waking up all forcewake domains on command submission, we can teach each engine what is (or are) the correct one to take. On platforms with multiple forcewake domains like VLV, CHV, SKL and BXT, this has the potential of lowering the GPU and CPU power use and submission latency. To implement it we add a function named intel_uncore_forcewake_for_reg whose purpose is to query which forcewake domains need to be taken to read or write a specific register with raw mmio accessors. These enables the execlists engine setup to query which forcewake domains are relevant per engine on the currently running platform. v2: * Kerneldoc. * Split from intel_uncore.c macro extraction, WARN_ON, no warns on old platforms. (Chris Wilson) v3: * Single domain per engine, mention all registers, bi-directional function and a new name, fix handling of gen6 and gen7 writes. (Chris Wilson) Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1460468251-14069-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
- 09 4月, 2016 4 次提交
-
-
由 Chris Wilson 提交于
When reading from the HWS page, we use barrier() to prevent the compiler optimising away the read from the volatile (may be updated by the GPU) memory address. This is more suited to READ_ONCE(); make it so. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1460195877-20520-5-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Rather than call a function to compute the matching cachelines and clflush them, just call the clflush *instruction* directly. We also know that we can use the unpatched plain clflush rather than the clflushopt alternative. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Imre Deak <imre.deak@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1460195877-20520-4-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Only declare a missed interrupt if we find that the GPU is idle with waiters and a hangcheck interval has passed in which no new user interrupts have been raised. v2: Clear the stuck interrupt marker between successful batches Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1460195877-20520-3-git-send-email-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
In order to simplify future patches, extract the lazy_coherency optimisation our of the engine->get_seqno() vfunc into its own callback. v2: Rename the barrier to engine->irq_seqno_barrier to try and better reflect that the barrier is only required after the user interrupt before reading the seqno (to ensure that the seqno update lands in time as we do not have strict seqno-irq ordering on all platforms). Reviewed-by: Dave Gordon <david.s.gordon@intel.com> [#v2] v3: Comments for hangcheck paranoia. Mika wanted to keep the extra barrier inside the hangcheck, just in case. I can argue that it doesn't provide a barrier against anything, but the side-effects of applying the barrier may prevent a false declaration of a hung GPU. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Dave Gordon <david.s.gordon@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1460195877-20520-2-git-send-email-chris@chris-wilson.co.uk
-
- 08 4月, 2016 1 次提交
-
-
由 Chris Wilson 提交于
We reuse the same calculation into two macros, and I want to add a third user. Time to refactor. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1460010558-10705-5-git-send-email-chris@chris-wilson.co.uk
-
- 04 4月, 2016 1 次提交
-
-
由 Tvrtko Ursulin 提交于
Doing a lot of work in the interrupt handler introduces huge latencies to the system as a whole. Most dramatic effect can be seen by running an all engine stress test like igt/gem_exec_nop/all where, when the kernel config is lean enough, the whole system can be brought into multi-second periods of complete non-interactivty. That can look for example like this: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/u8:3:143] Modules linked in: [redacted for brevity] CPU: 0 PID: 143 Comm: kworker/u8:3 Tainted: G U L 4.5.0-160321+ #183 Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1 Workqueue: i915 gen6_pm_rps_work [i915] task: ffff8800aae88000 ti: ffff8800aae90000 task.ti: ffff8800aae90000 RIP: 0010:[<ffffffff8104a3c2>] [<ffffffff8104a3c2>] __do_softirq+0x72/0x1d0 RSP: 0000:ffff88014f403f38 EFLAGS: 00000206 RAX: ffff8800aae94000 RBX: 0000000000000000 RCX: 00000000000006e0 RDX: 0000000000000020 RSI: 0000000004208060 RDI: 0000000000215d80 RBP: ffff88014f403f80 R08: 0000000b1b42c180 R09: 0000000000000022 R10: 0000000000000004 R11: 00000000ffffffff R12: 000000000000a030 R13: 0000000000000082 R14: ffff8800aa4d0080 R15: 0000000000000082 FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa53b90c000 CR3: 0000000001a0a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 042080601b33869f ffff8800aae94000 00000000fffc2678 ffff88010000000a 0000000000000000 000000000000a030 0000000000005302 ffff8800aa4d0080 0000000000000206 ffff88014f403f90 ffffffff8104a716 ffff88014f403fa8 Call Trace: <IRQ> [<ffffffff8104a716>] irq_exit+0x86/0x90 [<ffffffff81031e7d>] smp_apic_timer_interrupt+0x3d/0x50 [<ffffffff814f3eac>] apic_timer_interrupt+0x7c/0x90 <EOI> [<ffffffffa01c5b40>] ? gen8_write64+0x1a0/0x1a0 [i915] [<ffffffff814f2b39>] ? _raw_spin_unlock_irqrestore+0x9/0x20 [<ffffffffa01c5c44>] gen8_write32+0x104/0x1a0 [i915] [<ffffffff8132c6a2>] ? n_tty_receive_buf_common+0x372/0xae0 [<ffffffffa017cc9e>] gen6_set_rps_thresholds+0x1be/0x330 [i915] [<ffffffffa017eaf0>] gen6_set_rps+0x70/0x200 [i915] [<ffffffffa0185375>] intel_set_rps+0x25/0x30 [i915] [<ffffffffa01768fd>] gen6_pm_rps_work+0x10d/0x2e0 [i915] [<ffffffff81063852>] ? finish_task_switch+0x72/0x1c0 [<ffffffff8105ab29>] process_one_work+0x139/0x350 [<ffffffff8105b186>] worker_thread+0x126/0x490 [<ffffffff8105b060>] ? rescuer_thread+0x320/0x320 [<ffffffff8105fa64>] kthread+0xc4/0xe0 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 [<ffffffff814f351f>] ret_from_fork+0x3f/0x70 [<ffffffff8105f9a0>] ? kthread_create_on_node+0x170/0x170 I could not explain, or find a code path, which would explain a +20 second lockup, but from some instrumentation it was apparent the interrupts off proportion of time was between 10-25% under heavy load which is quite bad. When a interrupt "cliff" is reached, which was >~320k irq/s on my machine, the whole system goes into a terrible state of the above described multi-second lockups. By moving the GT interrupt handling to a tasklet in a most simple way, the problem above disappears completely. Testing the effect on sytem-wide latencies using igt/gem_syslatency shows the following before this patch: gem_syslatency: cycles=1532739, latency mean=416531.829us max=2499237us gem_syslatency: cycles=1839434, latency mean=1458099.157us max=4998944us gem_syslatency: cycles=1432570, latency mean=2688.451us max=1201185us gem_syslatency: cycles=1533543, latency mean=416520.499us max=2498886us This shows that the unrelated process is experiencing huge delays in its wake-up latency. After the patch the results look like this: gem_syslatency: cycles=808907, latency mean=53.133us max=1640us gem_syslatency: cycles=862154, latency mean=62.778us max=2117us gem_syslatency: cycles=856039, latency mean=58.079us max=2123us gem_syslatency: cycles=841683, latency mean=56.914us max=1667us Showing a huge improvement in the unrelated process wake-up latency. It also shows an approximate halving in the number of total empty batches submitted during the test. This may not be worrying since the test puts the driver under a very unrealistic load with ncpu threads doing empty batch submission to all GPU engines each. Another benefit compared to the hard-irq handling is that now work on all engines can be dispatched in parallel since we can have up to number of CPUs active tasklets. (While previously a single hard-irq would serially dispatch on one engine after another.) More interesting scenario with regards to throughput is "gem_latency -n 100" which shows 25% better throughput and CPU usage, and 14% better dispatch latencies. I did not find any gains or regressions with Synmark2 or GLbench under light testing. More benchmarking is certainly required. v2: * execlists_lock should be taken as spin_lock_bh when queuing work from userspace now. (Chris Wilson) * uncore.lock must be taken with spin_lock_irq when submitting requests since that now runs from either softirq or process context. v3: * Expanded commit message with more testing data; * converted missed locking sites to _bh; * added execlist_lock comment. (Chris Wilson) v4: * Mention dispatch parallelism in commit. (Chris Wilson) * Do not hold uncore.lock over MMIO reads since the block is already serialised per-engine via the tasklet itself. (Chris Wilson) * intel_lrc_irq_handler should be static. (Chris Wilson) * Cancel/sync the tasklet on GPU reset. (Chris Wilson) * Document and WARN that tasklet cannot be active/pending on engine cleanup. (Chris Wilson/Imre Deak) Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Imre Deak <imre.deak@intel.com> Testcase: igt/gem_exec_nop/all Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=94350Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1459768316-6670-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
- 21 3月, 2016 1 次提交
-
-
由 Jordan Justen 提交于
For Haswell, we will want another table of registers while retaining the large common table of whitelisted registers shared by all gen7 devices. Signed-off-by: NJordan Justen <jordan.l.justen@intel.com> Reviewed-by: NFrancisco Jerez <currojerez@riseup.net> [danvet: Pipe patch through sed -e 's/\<ring\>/engine/g' to make it apply.] Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
-
- 16 3月, 2016 5 次提交
-
-
由 Tvrtko Ursulin 提交于
This time using only sed and a few by hand. v2: Rename also intel_ring_id and intel_ring_initialized. v3: Fixed typo in intel_ring_initialized. Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1458126040-33105-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
由 Tvrtko Ursulin 提交于
Some trivial ones, first pass done with Coccinelle: @@ @@ ( - I915_NUM_RINGS + I915_NUM_ENGINES | - intel_ring_flag + intel_engine_flag | - for_each_ring + for_each_engine | - i915_gem_request_get_ring + i915_gem_request_get_engine | - intel_ring_idle + intel_engine_idle | - i915_gem_reset_ring_status + i915_gem_reset_engine_status | - i915_gem_reset_ring_cleanup + i915_gem_reset_engine_cleanup | - init_ring_lists + init_engine_lists ) But that didn't fully work so I cleaned it up with: for f in *.[hc]; do sed -i -e s/I915_NUM_RINGS/I915_NUM_ENGINES/ $f; done for f in *.[hc]; do sed -i -e s/i915_gem_request_get_ring/i915_gem_request_get_engine/ $f; done for f in *.[hc]; do sed -i -e s/intel_ring_flag/intel_engine_flag/ $f; done for f in *.[hc]; do sed -i -e s/intel_ring_idle/intel_engine_idle/ $f; done for f in *.[hc]; do sed -i -e s/init_ring_lists/init_engine_lists/ $f; done for f in *.[hc]; do sed -i -e s/i915_gem_reset_ring_cleanup/i915_gem_reset_engine_cleanup/ $f; done for f in *.[hc]; do sed -i -e s/i915_gem_reset_ring_status/i915_gem_reset_engine_status/ $f; done v2: Rebase. Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
-
由 Tvrtko Ursulin 提交于
below and a couple manual fixups. @@ identifier I, J; @@ struct I { ... - struct intel_engine_cs *J; + struct intel_engine_cs *engine; ... } @@ identifier I, J; @@ struct I { ... - struct intel_engine_cs J; + struct intel_engine_cs engine; ... } @@ struct drm_i915_private *d; @@ ( - d->ring + d->engine ) @@ struct i915_execbuffer_params *p; @@ ( - p->ring + p->engine ) @@ struct intel_ringbuffer *r; @@ ( - r->ring + r->engine ) @@ struct drm_i915_gem_request *req; @@ ( - req->ring + req->engine ) v2: Script missed the tracepoint code - fixed up by hand. Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
-
由 Tvrtko Ursulin 提交于
@@ identifier func; @@ func(..., struct intel_engine_cs * - ring + engine , ...) { <... - ring + engine ...> } @@ identifier func; type T; @@ T func(..., struct intel_engine_cs * - ring + engine , ...); Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
-
由 Tvrtko Ursulin 提交于
Done by the Coccinelle script below plus a manual intervention to GEN8_RING_SEMAPHORE_INIT. @@ expression E; @@ - struct intel_engine_cs *ring = E; + struct intel_engine_cs *engine = E; <+... - ring + engine ...+> @@ @@ - struct intel_engine_cs *ring; + struct intel_engine_cs *engine; <+... - ring + engine ...+> Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk>
-
- 04 3月, 2016 1 次提交
-
-
由 Mika Kuoppala 提交于
With full-ppgtt, it takes the GPU an eon to traverse the entire 256PiB address space, causing a loop to be detected. Under the current scheme, if ACTHD walks off the end of a batch buffer and into an empty address space, we "never" detect the hang. If we always increment the score as the ACTHD is progressing then we will eventually timeout (after ~46.5s (31 * 1.5s) without advancing onto a new batch). To counter act this, increase the amount we reduce the score for good batches, so that only a series of almost-bad batches trigger a full reset. DoS detection suffers slightly but series of long running shader tests will benefit. Based on a patch from Chris Wilson. Testcase: igt/drv_hangman/hangcheck-unterminated Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: NMika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1456930109-21532-1-git-send-email-mika.kuoppala@intel.com
-
- 01 3月, 2016 1 次提交
-
-
由 Tvrtko Ursulin 提交于
Assorted changes in the areas of code cleanup, reduction of invariant conditional in the interrupt handler and lock contention and MMIO access optimisation. * Remove needless initialization. * Improve cache locality by reorganizing code and/or using branch hints to keep unexpected or error conditions out of line. * Favor busy submit path vs. empty queue. * Less branching in hot-paths. v2: * Avoid mmio reads when possible. (Chris Wilson) * Use natural integer size for csb indices. * Remove useless return value from execlists_update_context. * Extract 32-bit ppgtt PDPs update so it is out of line and shared with two callers. * Grab forcewake across all mmio operations to ease the load on uncore lock and use chepear mmio ops. v3: * Removed some more pointless u8 data types. * Removed unused return from execlists_context_queue. * Commit message updates. v4: * Unclumsify the unqueue if statement. (Chris Wilson) * Hide forcewake from the queuing function. (Chris Wilson) Version 3 now makes the irq handling code path ~20% smaller on 48-bit PPGTT hardware, and a little bit less elsewhere. Hot paths are mostly in-line now and hammering on the uncore spinlock is greatly reduced together with mmio traffic to an extent. Benchmarking with "gem_latency -n 100" (keep submitting batches with 100 nop instruction) shows approximately 4% higher throughput, 2% less CPU time and 22% smaller latencies. This was on a big-core while small-cores could benefit even more. Most likely reason for the improvements are the MMIO optimization and uncore lock traffic reduction. One odd result is with "gem_latency -n 0" (dispatching empty batches) which shows 5% more throughput, 8% less CPU time, 25% better producer and consumer latencies, but 15% higher dispatch latency which is yet unexplained. Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
- 25 1月, 2016 1 次提交
-
-
由 Alex Dai 提交于
Previously GuC uses ring id as engine id because of same definition. But this is not true since this commit: commit de1add36 Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Fri Jan 15 15:12:50 2016 +0000 drm/i915: Decouple execbuf uAPI from internal implementation Added GuC engine id into GuC interface to decouple it from ring id used by driver. v2: Keep ring name print out in debugfs; using for_each_ring() where possible to keep driver consistent. (Chris W.) Signed-off-by: NAlex Dai <yu.dai@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1453579094-29860-1-git-send-email-yu.dai@intel.com
-
- 21 1月, 2016 4 次提交
-
-
由 Chris Wilson 提交于
Tvrtko was looking through the execbuffer-ioctl and noticed that the uABI was tightly coupled to our internal engine identifiers. Close inspection also revealed that we leak those internal engine identifiers through the busy-ioctl, and those internal identifiers already do not match the user identifiers. Fortuitiously, there is only one user of the set of busy rings from the busy-ioctl, and they only wish to choose between the RENDER and the BLT engines. Let's fix the userspace ABI while we still can. v2: Update the uAPI documentation to explain the identifiers. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Testcase: igt/gem_busy Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/1452876706-21620-1-git-send-email-chris@chris-wilson.co.uk
-
由 Tvrtko Ursulin 提交于
At the moment execbuf ring selection is fully coupled to internal ring ids which is not a good thing on its own. This dependency is also spread between two source files and not spelled out at either side which makes it hidden and fragile. This patch decouples this dependency by introducing an explicit translation table of execbuf uAPI to ring id close to the only call site (i915_gem_do_execbuffer). This way we are free to change driver internal implementation details without breaking userspace. All state relating to the uAPI is now contained in, or next to, i915_gem_do_execbuffer. As a side benefit, this patch decreases the compiled size of i915_gem_do_execbuffer. v2: Extract ring selection into eb_select_ring. (Chris Wilson) Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/1452870770-13981-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
由 Chris Wilson 提交于
Broadwell and later currently use the same unordered command sequence to update the seqno in the HWS status page and then assert the user interrupt. We should apply the w/a from legacy (where we do an mmio read to delay the seqno read after the interrupt), but this is not enough to enforce coherent seqno visibilty on Skylake. Rather than search for the proper post-interrupt seqno barrier, use a strongly ordered command sequence to write the seqno, then assert the user interrupt from the ring. v2: Move around the wa tail dwords to avoid adding duplicate code. v3: Add references, comments on workarounds and bit5 check. References: https://bugs.freedesktop.org/show_bug.cgi?id=93693 Testcase: igt/gem_ring_sync_loop #skl Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1453297415-17793-1-git-send-email-mika.kuoppala@intel.com
-
由 Dave Gordon 提交于
Now that we've eliminated a lot of uses of ring->default_context, we can eliminate the pointer itself. All the engines share the same default intel_context, so we can just keep a single reference to it in the dev_priv structure rather than one in each of the engine[] elements. This make refcounting more sensible too, as we now have a refcount of one for the one pointer, rather than a refcount of one but multiple pointers. From an idea by Chris Wilson. v2: transform an extra instance of ring->default_context introduced by 42f1cae8 drm/i915: Restore inhibiting the load of the default context That patch's commentary includes: v2: Mark the global default context as uninitialized on GPU reset so that the context-local workarounds are reloaded upon re-enabling The code implementing that now also benefits from the replacement of the multiple (per-ring) pointers to the default context with a single pointer to the unique kernel context. v4: Rebased, remove underused local (Nick Hoath) Signed-off-by: NDave Gordon <david.s.gordon@intel.com> Reviewed-by: NNick Hoath <nicholas.hoath@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1453230175-19330-3-git-send-email-david.s.gordon@intel.comSigned-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
-
- 20 1月, 2016 1 次提交
-
-
由 Jani Nikula 提交于
Apparently accidental or misplaced /** kernel-doc comments, confusing the tool. Turn them to normal comments. Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: NJani Nikula <jani.nikula@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1453101588-18008-2-git-send-email-jani.nikula@intel.com
-
- 18 1月, 2016 2 次提交
-
-
由 Tvrtko Ursulin 提交于
Purpose is to avoid calling i915_gem_obj_ggtt_offset from the interrupt context without the big lock held. v2: Renamed gtt_start to gtt_offset. (Daniel Vetter) v3: Cache the VMA instead of address. (Chris Wilson) Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1452870629-13830-2-git-send-email-tvrtko.ursulin@linux.intel.com
-
由 Tvrtko Ursulin 提交于
LRC code was calling GEM API like i915_gem_obj_ggtt_offset from places where the struct_mutex cannot be grabbed (irq handlers). To avoid that this patch caches some interesting bits and values in the engine and context structures. Some usages are also removed where they are not needed like a few asserts which are either impossible or have been checked already during engine initialization. Side benefit is also that interrupt handlers and command submission stop evaluating invariant conditionals, like what Gen we are running on, on every interrupt and every command submitted. This patch deals with logical ring context id and descriptors while subsequent patches will deal with the remaining issues. v2: * Cache the VMA instead of the address. (Chris Wilson) * Incorporate Dave Gordon's good comments and function name. v3: * Extract ctx descriptor template to a function and group functions dealing with ctx descriptor & co together near top of the file. (Dave Gordon) Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Gordon <david.s.gordon@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1452870629-13830-1-git-send-email-tvrtko.ursulin@linux.intel.com
-
- 08 1月, 2016 1 次提交
-
-
由 Mika Kuoppala 提交于
If head seems stuck and engine in question is rcs, inspect subunit state transitions from undone to done, before deciding that this really is a hang instead of limited progress. Only account the transitions of subunits from undone to done once, to prevent unstable subunit states to keep us falsely active. As this adds one extra steps to hangcheck heuristics, before hang is declared, it adds 1500ms to to detect hang for render ring to a total of 7500ms. We could sample the subunit states on first head stuck condition but decide not to do so only in order to mimic old behaviour. This way the check order of promotion from seqno > atchd > instdone is consistently done. v2: Deal with unstable done states (Arun) Clear instdone progress on head and seqno movement (Chris) Report raw and accumulated instdone's in in debugfs (Chris) Return HANGCHECK_ACTIVE on undone->done References: https://bugs.freedesktop.org/show_bug.cgi?id=93029 Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Dave Gordon <david.s.gordon@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Arun Siluvery <arun.siluvery@linux.intel.com> Reviewed-by: NChris Wilson <chris@chris-wilson.co.uk> Signed-off-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1448985372-19535-1-git-send-email-mika.kuoppala@intel.com
-
- 10 12月, 2015 1 次提交
-
-
由 Dave Gordon 提交于
Based on Chris Wilson's patch from 6 months ago, rebased and adapted. The current implementation of intel_ring_initialized() is too heavyweight; it's a non-inlined function that chases several levels of pointers. This wouldn't matter too much if it were rarely called, but it's used inside the iterator test of for_each_ring() and is therefore called quite frequently. So let's make it simple and inline ... The idea here is to use ring->dev as an indicator showing which engines have been initialised and are therefore to be included in iterations that use for_each_ring(). This allows us to avoid multiple memory references and a (non-inlined) function call on each iteration of each such loop. Fixes regression from commit 48d82387 Author: Oscar Mateo <oscar.mateo@intel.com> Date: Thu Jul 24 17:04:23 2014 +0100 drm/i915/bdw: Generic logical ring init and cleanup Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Signed-off-by: NDave Gordon <david.s.gordon@intel.com> Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch> Link: http://patchwork.freedesktop.org/patch/msgid/1449586956-32360-2-git-send-email-david.s.gordon@intel.com
-