- 30 1月, 2019 2 次提交
-
-
由 Chris Wilson 提交于
A few years ago, see commit 688e6c72 ("drm/i915: Slaughter the thundering i915_wait_request herd"), the issue of handling multiple clients waiting in parallel was brought to our attention. The requirement was that every client should be woken immediately upon its request being signaled, without incurring any cpu overhead. To handle certain fragility of our hw meant that we could not do a simple check inside the irq handler (some generations required almost unbounded delays before we could be sure of seqno coherency) and so request completion checking required delegation. Before commit 688e6c72, the solution was simple. Every client waiting on a request would be woken on every interrupt and each would do a heavyweight check to see if their request was complete. Commit 688e6c72 introduced an rbtree so that only the earliest waiter on the global timeline would woken, and would wake the next and so on. (Along with various complications to handle requests being reordered along the global timeline, and also a requirement for kthread to provide a delegate for fence signaling that had no process context.) The global rbtree depends on knowing the execution timeline (and global seqno). Without knowing that order, we must instead check all contexts queued to the HW to see which may have advanced. We trim that list by only checking queued contexts that are being waited on, but still we keep a list of all active contexts and their active signalers that we inspect from inside the irq handler. By moving the waiters onto the fence signal list, we can combine the client wakeup with the dma_fence signaling (a dramatic reduction in complexity, but does require the HW being coherent, the seqno must be visible from the cpu before the interrupt is raised - we keep a timer backup just in case). Having previously fixed all the issues with irq-seqno serialisation (by inserting delays onto the GPU after each request instead of random delays on the CPU after each interrupt), we can rely on the seqno state to perfom direct wakeups from the interrupt handler. This allows us to preserve our single context switch behaviour of the current routine, with the only downside that we lose the RT priority sorting of wakeups. In general, direct wakeup latency of multiple clients is about the same (about 10% better in most cases) with a reduction in total CPU time spent in the waiter (about 20-50% depending on gen). Average herd behaviour is improved, but at the cost of not delegating wakeups on task_prio. v2: Capture fence signaling state for error state and add comments to warm even the most cold of hearts. v3: Check if the request is still active before busywaiting v4: Reduce the amount of pointer misdirection with list_for_each_safe and using a local i915_request variable inside the loops v5: Add a missing pluralisation to a purely informative selftest message. References: 688e6c72 ("drm/i915: Slaughter the thundering i915_wait_request herd") Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190129205230.19056-2-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
To allow requests to forgo a common execution timeline, one question we need to be able to answer is "is this request running?". To track whether a request has started on HW, we can emit a breadcrumb at the beginning of the request and check its timeline's HWSP to see if the breadcrumb has advanced past the start of this request. (This is in contrast to the global timeline where we need only ask if we are on the global timeline and if the timeline has advanced past the end of the previous request.) There is still confusion from a preempted request, which has already started but relinquished the HW to a high priority request. For the common case, this discrepancy should be negligible. However, for identification of hung requests, knowing which one was running at the time of the hang will be much more important. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190129185452.20989-2-chris@chris-wilson.co.uk
-
- 29 1月, 2019 2 次提交
-
-
由 Chris Wilson 提交于
Now that we have allocated ourselves a cacheline to store a breadcrumb, we can emit a write from the GPU into the timeline's HWSP of the per-context seqno as we complete each request. This drops the mirroring of the per-engine HWSP and allows each context to operate independently. We do not need to unwind the per-context timeline, and so requests are always consistent with the timeline breadcrumb, greatly simplifying the completion checks as we no longer need to be concerned about the global_seqno changing mid check. One complication though is that we have to be wary that the request may outlive the HWSP and so avoid touching the potentially danging pointer after we have retired the fence. We also have to guard our access of the HWSP with RCU, the release of the obj->mm.pages should already be RCU-safe. At this point, we are emitting both per-context and global seqno and still using the single per-engine execution timeline for resolving interrupts. v2: s/fake_complete/mark_complete/ Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-5-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Supplement the per-engine HWSP with a per-timeline HWSP. That is a per-request pointer through which we can check a local seqno, abstracting away the presumption of a global seqno. In this first step, we point each request back into the engine's HWSP so everything continues to work with the global timeline. v2: s/i915_request_hwsp/hwsp_seqno/ to emphasis that this is the current HW value and that we are accessing it via i915_request merely as a convenience. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NJohn Harrison <John.C.Harrison@Intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190128181812.22804-1-chris@chris-wilson.co.uk
-
- 22 1月, 2019 1 次提交
-
-
由 Chris Wilson 提交于
A repeated pattern is to test the signaled bit of our request->fence.flags. Make this an inline to shorten a few lines and remove unnecessary line continuations. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190121222117.23305-20-chris@chris-wilson.co.uk
-
- 27 12月, 2018 1 次提交
-
-
由 Brajeswar Ghosh 提交于
Remove i915_scheduler.h which is included more than once Signed-off-by: NBrajeswar Ghosh <brajeswar.linux@gmail.com> Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20181225132340.GA2584@hp-pavilion-15-notebook-pc-brajeswar
-
- 02 10月, 2018 2 次提交
-
-
由 Chris Wilson 提交于
Latency is in the eye of the beholder. In the case where a client stops and waits for the gpu, give that request chain a small priority boost (not so that it overtakes higher priority clients, to preserve the external ordering) so that ideally the wait completes earlier. v2: Tvrtko recommends to keep the boost-from-user-stall as small as possible and to allow new client flows to be preferred for interactivity over stalls. Testcase: igt/gem_sync/switch-default Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181001144755.7978-3-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Currently, the backend scheduling code abuses struct_mutex into order to have a global lock to manipulate a temporary list (without widespread allocation) and to protect against list modifications. This is an extraneous coupling to struct_mutex and further can not extend beyond the local device. Pull all the code that needs to be under the one true lock into i915_scheduler.c, and make it so. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20181001144755.7978-2-chris@chris-wilson.co.uk
-
- 14 9月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
If we try and fail to allocate a i915_request, we apply some backpressure on the clients to throttle the memory allocations coming from i915.ko. Currently, we wait until completely idle, but this is far too heavy and leads to some situations where the only escape is to declare a client hung and reset the GPU. The intent is to only ratelimit the allocation requests and to allow ourselves to recycle requests and memory from any long queues built up by a client hog. Although the system memory is inherently a global resources, we don't want to overly penalize an unlucky client to pay the price of reaping a hog. To reduce the influence of one client on another, we can instead of waiting for the entire GPU to idle, impose a barrier on the local client. (One end goal for request allocation is for scalability to many concurrent allocators; simultaneous execbufs.) To prevent ourselves from getting caught out by long running requests (requests that may never finish without userspace intervention, whom we are blocking) we need to impose a finite timeout, ideally shorter than hangcheck. A long time ago Paul McKenney suggested that RCU users should ratelimit themselves using judicious use of cond_synchronize_rcu(). This gives us the opportunity to reduce our indefinite wait for the GPU to idle to a wait for the RCU grace period of the previous allocation along this timeline to expire, satisfying both the local and finite properties we desire for our ratelimiting. There are still a few global steps (reclaim not least amongst those!) when we exhaust the immediate slab pool, at least now the wait is itself decoupled from struct_mutex for our glorious highly parallel future! Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106680Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-1-chris@chris-wilson.co.uk
-
- 07 8月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
We have a few instances of checking seqno-1 to see if the HW has started the request. Pull those together under a helper. v2: Pull the !seqno assertion higher, as given seqno==1 we may indeed check to see if we have started using seqno==0. Suggested-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NMika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180806112605.20725-1-chris@chris-wilson.co.uk
-
- 07 7月, 2018 2 次提交
-
-
由 Chris Wilson 提交于
In the next patch, we will want to be able to use more flexible request timelines that can hop between engines. From the vma pov, we can then not rely on the binding of this request to an engine and so can not ensure that different requests are ordered through a per-engine timeline, and so we must track activity of all timelines. (We track activity on the vma itself to prevent unbinding from HW before the HW has finished accessing it.) v2: Switch to a rbtree for 32b safety (since using u64 as a radixtree index is fraught with aliasing of unsigned longs). v3: s/lookup_active/active_instance/ because we can never agree on names Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180706103947.15919-5-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
In the next patch, we will want to start skipping requests on failing to complete their payloads. So export the utility function current used to make requests inoperable following a failed gpu reset. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180706103947.15919-2-chris@chris-wilson.co.uk
-
- 14 6月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
For symmetry, simplicity and ensuring the request is always truly idle upon its completion, always emit the closing flush prior to emitting the request breadcrumb. Previously, we would only emit the flush if we had started a user batch, but this just leaves all the other paths open to speculation (do they affect the GPU caches or not?) With mm switching, a key requirement is that the GPU is flushed and invalidated before hand, so for absolute safety, we want that closing flush be mandatory. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180612105135.4459-1-chris@chris-wilson.co.uk
-
- 11 6月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
The discovery with trying to enable full-ppgtt was that we were completely failing to the load both the mm and context following the reset. Although we were performing mmio to set the PP_DIR (per-process GTT) and CCID (context), these were taking no effect (the assumption was that this would trigger reload of the context and restore the page tables). It was not until we performed the LRI + MI_SET_CONTEXT in a following context switch would anything occur. Since we are then required to reset the context image and PP_DIR using CS commands, we place those commands into every batch. The hardware should recognise the no-ops and eliminate the expensive context loads, but we still have to pay the cost of using cross-powerwell register writes. In practice, this has no effect on actual context switch times, and only adds a few hundred nanoseconds to no-op switches. We can improve the latter by eliminating the w/a around known no-op switches, but there is an ulterior motive to keeping them. Always emitting the context switch at the beginning of the request (and relying on HW to skip unneeded switches) does have one key advantage. Should we implement request reordering on Haswell, we will not know in advance what the previous executing context was on the GPU and so we would not be able to elide the MI_SET_CONTEXT commands ourselves and always have to emit them. Having our hand forced now actually prepares us for later. Now since that context and mm follow the request, we no longer (and not for a long time since requests took over!) require a trace point to tell when we write the switch into the ring, since it is always. (This is even more important when you remember that simply writing into the ring bears no relation to the current mm.) v2: Sandybridge has to agree to use LRI as well. Testcase: igt/drv_selftests/live_hangcheck Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Matthew Auld <matthew.william.auld@gmail.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180611110845.31890-1-chris@chris-wilson.co.uk
-
- 01 6月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
During suspend we want to flush out all active contexts and their rendering. To do so we queue a request from the kernel's context, once we know that request is done, we know the GPU is completely idle. To speed up that switch bump the GPU clocks. Switching to the kernel context prior to idling is also used to enforce a barrier before changing OA properties, and when evicting active rendering from the global GTT. All cases where we do want to race-to-idle. v2: Limit the boosting to only the switch before suspend. v3: Limit it to the wait-for-idle on suspend. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: David Weinehall <david.weinehall@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Tested-by: David Weinehall <david.weinehall@linux.intel.com> #v1 Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-2-chris@chris-wilson.co.uk
-
- 18 5月, 2018 2 次提交
-
-
由 Chris Wilson 提交于
To ease the frequent and ugly pointer dance of &request->gem_context->engine[request->engine->id] during request submission, store that pointer as request->hw_context. One major advantage that we will exploit later is that this decouples the logical context state from the engine itself. v2: Set mock_context->ops so we don't crash and burn in selftests. Cleanups from Tvrtko. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Acked-by: NZhenyu Wang <zhenyuw@linux.intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180517212633.24934-3-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
In the next patch, we want to store the intel_context pointer inside i915_request, as it is frequently access via a convoluted dance when submitting the request to hw. Having two context pointers inside i915_request leads to confusion so first rename the existing i915_gem_context pointer to i915_request.gem_context. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180517212633.24934-1-chris@chris-wilson.co.uk
-
- 03 5月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
We need to move to a more flexible timeline that doesn't assume one fence context per engine, and so allow for a single timeline to be used across a combination of engines. This means that preallocating a fence context per engine is now a hindrance, and so we want to introduce the singular timeline. From the code perspective, this has the notable advantage of clearing up a lot of mirky semantics and some clumsy pointer chasing. By splitting the timeline up into a single entity rather than an array of per-engine timelines, we can realise the goal of the previous patch of tracking the timeline alongside the ring. v2: Tweak wait_for_idle to stop the compiling thinking that ret may be uninitialised. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180502163839.3248-2-chris@chris-wilson.co.uk
-
- 19 4月, 2018 3 次提交
-
-
由 Chris Wilson 提交于
Today we only want to pass along the priority to engine->schedule(), but in the future we want to have much more control over the various aspects of the GPU during a context's execution, for example controlling the frequency allowed. As we need an ever growing number of parameters for scheduling, move those into a struct for convenience. v2: Move the anonymous struct into its own function for legibility and ye olde gcc. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180418184052.7129-3-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Having moved the priotree struct into i915_scheduler.h, identify it as the scheduling element and rebrand into i915_sched. This becomes more useful as we start attaching more information we require to propagate through the scheduler. v2: Use i915_sched_node for future distinctiveness Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180418184052.7129-2-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Over time the priotree has grown from a sorted list to a more complicated structure for propagating constraints along the dependency chain to try and resolve priority inversion. Start to segregate this information from the rest of the request/fence tracking. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180418184052.7129-1-chris@chris-wilson.co.uk
-
- 06 3月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
The goal here is to try and reduce the latency of signaling additional requests following the wakeup from interrupt by reducing the list of to-be-signaled requests from an rbtree to a sorted linked list. The original choice of using an rbtree was to facilitate random insertions of request into the signaler while maintaining a sorted list. However, if we assume that most new requests are added when they are submitted, we see those new requests in execution order making a insertion sort fast, and the reduction in overhead of each signaler iteration significant. Since commit 56299fb7 ("drm/i915: Signal first fence from irq handler if complete"), we signal most fences directly from notify_ring() in the interrupt handler greatly reducing the amount of work that actually needs to be done by the signaler kthread. All the thread is then required to do is operate as the bottom-half, cleaning up after the interrupt handler and preparing the next waiter. This includes signaling all later completed fences in a saturated system, but on a mostly idle system we only have to rebuild the wait rbtree in time for the next interrupt. With this de-emphasis of the signaler's role, we want to rejig it's datastructures to reduce the amount of work we require to both setup the signal tree and maintain it on every interrupt. References: 56299fb7 ("drm/i915: Signal first fence from irq handler if complete") Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180222092545.17216-1-chris@chris-wilson.co.uk
-
- 22 2月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
We want to de-emphasize the link between the request (dependency, execution and fence tracking) from GEM and so rename the struct from drm_i915_gem_request to i915_request. That is we may implement the GEM user interface on top of requests, but they are an abstraction for tracking execution rather than an implementation detail of GEM. (Since they are not tied to HW, we keep the i915 prefix as opposed to intel.) In short, the spatch: @@ @@ - struct drm_i915_gem_request + struct i915_request A corollary to contracting the type name, we also harmonise on using 'rq' shorthand for local variables where space if of the essence and repetition makes 'request' unwieldy. For globals and struct members, 'request' is still much preferred for its clarity. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180221095636.6649-1-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com> Acked-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
-
- 19 1月, 2018 1 次提交
-
-
由 Chris Wilson 提交于
Watching a light workload on Baytrail (running glxgears and a 1080p decode), instead of the system remaining at low frequency, the glxgears would regularly trigger waitboosting after which it would have to spend a few seconds throttling back down. In this case, the waitboosting is counter productive as the minimal wait for glxgears doesn't prevent it from functioning correctly and delivering frames on time. In this case, glxgears happens to almost always be waiting on the current request, which we already expect to complete quickly (see i915_spin_request) and so avoiding the waitboost on the active request and spinning instead provides the best latency without overcommitting to upclocking. However, if the system falls behind we still force the waitboost. Similarly, we will also trigger upclocking if we detect the system is not delivering frames on time - again using a mechanism that tries to detect a miss and not preemptively upclock. v2: Also skip boosting for after missed vblank if the desired request is already active. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Radoslaw Szwichtenberg <radoslaw.szwichtenberg@intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180118131609.16574-1-chris@chris-wilson.co.uk
-
- 03 1月, 2018 2 次提交
-
-
由 Chris Wilson 提交于
Back up our comment that all signalers should have been signaled before we ourselves were retired with an assert to that effect. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180102151235.3949-9-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
i915_gem_request_assign() is not used since commit 77f0d0e9 ("drm/i915/execlists: Pack the count into the low bits of the port.request"), so remove the defunct code References: 77f0d0e9 ("drm/i915/execlists: Pack the count into the low bits of the port.request") Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NRodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180102151235.3949-1-chris@chris-wilson.co.uk
-
- 13 12月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
If a fence allocation fails in a blocking context, we will sleep on the fence as a last resort. We can therefore allow ourselves to fail and sleep on the fence instead of triggering a system-wide oom. This allows us to throttle malicious clients that are consuming lots of system resources by capping the amount of memory used by fences. Testcase: igt/gem_shrink/execbufX Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171212180652.22061-2-chris@chris-wilson.co.uk
-
- 05 10月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
Use a priority stored in the context as the initial value when submitting a request. This allows us to change the default priority on a per-context basis, allowing different contexts to be favoured with GPU time at the expense of lower importance work. The user can adjust the context's priority via I915_CONTEXT_PARAM_PRIORITY, with more positive values being higher priority (they will be serviced earlier, after their dependencies have been resolved). Any prerequisite work for an execbuf will have its priority raised to match the new request as required. Normal users can specify any value in the range of -1023 to 0 [default], i.e. they can reduce the priority of their workloads (and temporarily boost it back to normal if so desired). Privileged users can specify any value in the range of -1023 to 1023, [default is 0], i.e. they can raise their priority above all overs and so potentially starve the system. Note that the existing schedulers are not fair, nor load balancing, the execution is strictly by priority on a first-come, first-served basis, and the driver may choose to boost some requests above the range available to users. This priority was originally based around nice(2), but evolved to allow clients to adjust their priority within a small range, and allow for a privileged high priority range. For example, this can be used to implement EGL_IMG_context_priority https://www.khronos.org/registry/egl/extensions/IMG/EGL_IMG_context_priority.txt EGL_CONTEXT_PRIORITY_LEVEL_IMG determines the priority level of the context to be created. This attribute is a hint, as an implementation may not support multiple contexts at some priority levels and system policy may limit access to high priority contexts to appropriate system privilege level. The default value for EGL_CONTEXT_PRIORITY_LEVEL_IMG is EGL_CONTEXT_PRIORITY_MEDIUM_IMG." so we can map PRIORITY_HIGH -> 1023 [privileged, will failback to 0] PRIORITY_MED -> 0 [default] PRIORITY_LOW -> -1023 They also map onto the priorities used by VkQueue (and a VkQueue is essentially a timeline, our i915_gem_context under full-ppgtt). v2: s/CAP_SYS_ADMIN/CAP_SYS_NICE/ v3: Report min/max user priorities as defines in the uapi, and rebase internal priorities on the exposed values. Testcase: igt/gem_exec_schedule Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171003203453.15692-9-chris@chris-wilson.co.uk
-
- 29 9月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
We use INT_MIN to denote the priority of a request that has not been submitted to the scheduler; we treat INT_MIN as an invalid priority and initialise the request to it. Give the value a name so it stands out. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170928193910.17988-3-chris@chris-wilson.co.ukReviewed-by: NMika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com
-
- 23 9月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
No users now outside of i915_wait_request(), so we can make it private to i915_gem_request.c, and assume the caller knows the seqno. In the process, also remove i915_gem_request_started() as that was only ever used by i915_spin_request(). Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Michal Winiarski <michal.winiarski@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20170922120333.25535-1-chris@chris-wilson.co.ukReviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
-
- 28 6月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
Once a client has requested a waitboost, we keep that waitboost active until all clients are no longer waiting. This is because we don't distinguish which waiter deserves the boost. However, with the advent of fence signaling, the signaler threads appear as waiters to the RPS interrupt handler. So instead of using a single boolean to track when to keep the waitboost active, use a counter of all outstanding waitboosted requests. At this point, I have removed all vestiges of the rate limiting on clients. Whilst this means that compositors should remain more fluid, it also means that boosts are more prevalent. See commit b29c19b6 ("drm/i915: Boost RPS frequency for CPU stalls") for a longer discussion on the pros and cons of both approaches. A drawback of this implementation is that it requires constant request submission to keep the waitboost trimmed (as it is now cancelled when the request is completed). This will be fine for a busy system, but near idle the boosts may be kept for longer than desired (effectively tens of vblanks worstcase) and there is a reliance on rc6 instead. v2: Remove defunct rps.client_lock Reported-by: NMichał Winiarski <michal.winiarski@intel.com> Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Michał Winiarski <michal.winiarski@intel.com> Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170628123548.9236-1-chris@chris-wilson.co.uk
-
- 20 6月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
Rename: wait_queue_t => wait_queue_entry_t 'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue", but in reality it's a queue *entry*. The 'real' queue is the wait queue head, which had to carry the name. Start sorting this out by renaming it to 'wait_queue_entry_t'. This also allows the real structure name 'struct __wait_queue' to lose its double underscore and become 'struct wait_queue_entry', which is the more canonical nomenclature for such data types. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 5月, 2017 2 次提交
-
-
由 Chris Wilson 提交于
All the requests at the same priority are executed in FIFO order. They do not need to be stored in the rbtree themselves, as they are a simple list within a level. If we move the requests at one priority into a list, we can then reduce the rbtree to the set of priorities. This should keep the height of the rbtree small, as the number of active priorities can not exceed the number of active requests and should be typically only a few. Currently, we have ~2k possible different priority levels, that may increase to allow even more fine grained selection. Allocating those in advance seems a waste (and may be impossible), so we opt for allocating upon first use, and freeing after its requests are depleted. To avoid the possibility of an allocation failure causing us to lose a request, we preallocate the default priority (0) and bump any request to that priority if we fail to allocate it the appropriate plist. Having a request (that is ready to run, so not leading to corruption) execute out-of-order is better than leaking the request (and its dependency tree) entirely. There should be a benefit to reducing execlists_dequeue() to principally using a simple list (and reducing the frequency of both rbtree iteration and balancing on erase) but for typical workloads, request coalescing should be small enough that we don't notice any change. The main gain is from improving PI calls to schedule, and the explicit list within a level should make request unwinding simpler (we just need to insert at the head of the list rather than the tail and not have to make the rbtree search more complicated). v2: Avoid use-after-free when deleting a depleted priolist v3: Michał found the solution to handling the allocation failure gracefully. If we disable all priority scheduling following the allocation failure, those requests will be executed in fifo and we will ensure that this request and its dependencies are in strict fifo (even when it doesn't realise it is only a single list). Normal scheduling is restored once we know the device is idle, until the next failure! Suggested-by: NMichał Wajdeczko <michal.wajdeczko@intel.com> Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Reviewed-by: NMichał Winiarski <michal.winiarski@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-8-chris@chris-wilson.co.uk
-
由 Chris Wilson 提交于
Explicitly assign the default priority, and give it a name. After much discussion, we have chosen to call it I915_PRIORITY_NORMAL! Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170517121007.27224-7-chris@chris-wilson.co.uk
-
- 19 4月, 2017 1 次提交
-
-
由 Paul E. McKenney 提交于
A group of Linux kernel hackers reported chasing a bug that resulted from their assumption that SLAB_DESTROY_BY_RCU provided an existence guarantee, that is, that no block from such a slab would be reallocated during an RCU read-side critical section. Of course, that is not the case. Instead, SLAB_DESTROY_BY_RCU only prevents freeing of an entire slab of blocks. However, there is a phrase for this, namely "type safety". This commit therefore renames SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU in order to avoid future instances of this sort of confusion. Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <linux-mm@kvack.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> [ paulmck: Add comments mentioning the old name, as requested by Eric Dumazet, in order to help people familiar with the old name find the new one. ] Acked-by: NDavid Rientjes <rientjes@google.com>
-
- 15 4月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
Introduce a new execobject.flag (EXEC_OBJECT_CAPTURE) that userspace may use to indicate that it wants the contents of this buffer preserved in the error state (/sys/class/drm/cardN/error) following a GPU hang involving this batch. Use this at your discretion, the contents of the error state. although compressed, are allocated with GFP_ATOMIC (i.e. limited) and kept for all eternity (until the error state is destroyed). Based on an earlier patch by Ben Widawsky <ben@bwidawsk.net> Testcase: igt/gem_exec_capture Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Ben Widawsky <ben@bwidawsk.net> Cc: Matt Turner <mattst88@gmail.com> Acked-by: NBen Widawsky <ben@bwidawsk.net> Acked-by: NMatt Turner <mattst88@gmail.com> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170415093902.22581-1-chris@chris-wilson.co.uk
-
- 17 3月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
The only time we need to emit a flush inside request emission is after an execbuffer, for which we can use the full __i915_add_request(). All other instances want the simpler i915_add_request() without flushing, so remove the useless helper. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170317114709.8388-1-chris@chris-wilson.co.uk
-
- 09 3月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
__i915_gem_request_started() asserts that the seqno is valid, but i915_spin_request() was not checking before querying whether the request had started. Reported-by: NMichał Winiarski <michal.winiarski@intel.com> Fixes: 754c9fd5 ("drm/i915: Protect the request->global_seqno with the engine->timeline lock") Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170308142238.22994-1-chris@chris-wilson.co.ukReviewed-by: NMichał Winiarski <michal.winiarski@intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
-
- 03 3月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
Adding to the tail of the client request list as the only other user is in the throttle ioctl that iterates forwards over the list. It only needs protection against deletion of a request as it reads it, it simply won't see a new request added to the end of the list, or it would be too early and rejected. We can further reduce the number of spinlocks required when throttling by removing stale requests from the client_list as we throttle. Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Reviewed-by: NMika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170302122525.19675-1-chris@chris-wilson.co.uk
-
- 28 2月, 2017 1 次提交
-
-
由 Chris Wilson 提交于
As execlists and other non-semaphore multi-engine devices coordinate between engines using interrupts, we can shave off a few 10s of microsecond of scheduling latency by doing the fence signaling from the interrupt as opposed to a RT kthread. (Realistically the delay adds about 1% to an individual cross-engine workload.) We only signal the first fence in order to limit the amount of work we move into the interrupt handler. We also have to remember that our breadcrumbs may be unordered with respect to the interrupt and so we still require the waiter process to perform some heavyweight coherency fixups, as well as traversing the tree of waiters. v2: No need for early exit in irq handler - it breaks the flow between patches and prevents the tracepoint v3: Restore rcu hold across irq signaling of request Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170227205850.2828-2-chris@chris-wilson.co.uk
-