提交 · 7aa6d5fe6cdb4347c427caaba38f11cc88a8ed4d · openeuler / Kernel

16 12月, 2021 3 次提交

drm/i915/guc: Remove racey GEM_BUG_ON · 7aa6d5fe

由 Matthew Brost 提交于 12月 14, 2021

A full GT reset can race with the last context put resulting in the
context ref count being zero but the destroyed bit not yet being set.
Remove GEM_BUG_ON in scrub_guc_desc_for_outstanding_g2h that asserts the
destroyed bit must be set in ref count is zero.
Reviewed-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-4-matthew.brost@intel.com

7aa6d5fe

drm/i915/guc: Only assign guc_id.id when stealing guc_id · 939d8e9c

由 Matthew Brost 提交于 12月 14, 2021

Previously assigned whole guc_id structure (list, spin lock) which is
incorrect, only assign the guc_id.id.

Fixes: 0f797650 ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Reviewed-by: NJohn Harrison <John.C.Harrison@Intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-3-matthew.brost@intel.com

939d8e9c

drm/i915/guc: Use correct context lock when callig clr_context_registered · b25db8c7

由 Matthew Brost 提交于 12月 14, 2021

s/ce/cn/ when grabbing guc_state.lock before calling
clr_context_registered.

Fixes: 0f797650 ("drm/i915/guc: Rework and simplify locking")
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Reviewed-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211214170500.28569-2-matthew.brost@intel.com

b25db8c7

14 12月, 2021 3 次提交

drm/i915/guc: support bigger RSA keys · b2657ed0

由 Daniele Ceraolo Spurio 提交于 12月 10, 2021

Some of the newer HW will use bigger RSA keys to authenticate the GuC
binary. On those platforms the HW will read the key from memory instead
of the RSA registers, so we need to copy it in a dedicated vma, like we
do for the HuC. The address of the key is provided to the HW via the
first RSA register.

v2: clarify that the RSA behavior is hardcoded in the bootrom (Matt)
Signed-off-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211211000756.1698923-4-daniele.ceraolospurio@intel.com

b2657ed0

drm/i915/uc: Prepare for different firmware key sizes · 013005d9

由 Michal Wajdeczko 提交于 12月 10, 2021

Future GuC/HuC firmwares might be signed with different key sizes.
Don't assume that it must be always 2048 bits long.
Signed-off-by: NMichal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211211000756.1698923-3-daniele.ceraolospurio@intel.com

013005d9

drm/i915/uc: correctly track uc_fw init failure · 35d4efec

由 Daniele Ceraolo Spurio 提交于 12月 10, 2021

The FAILURE state of uc_fw currently implies that the fw is loadable
(i.e init completed), so we can't use it for init failures and instead
need a dedicated error code.

Note that this currently does not cause any issues because if we fail to
init any of the firmwares we abort the load, but better be accurate
anyway in case things change in the future.
Signed-off-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211211000756.1698923-2-daniele.ceraolospurio@intel.com

35d4efec

11 12月, 2021 2 次提交

drm/i915/guc: Don't go bang in GuC log if no GuC · 76aee865

由 John Harrison 提交于 12月 09, 2021

If the GuC has failed to load for any reason and then the user pokes
the debugfs GuC log interface, a BUG and/or null pointer deref can
occur. Don't let that happen.
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Reviewed-by: NLucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211210044022.1842938-5-John.C.Harrison@Intel.com

76aee865

drm/i915/uc: Allow platforms to have GuC but not HuC · 3d832f37

由 John Harrison 提交于 12月 09, 2021

It is possible for platforms to require GuC but not HuC firmware.
Also, the firmware versions for GuC and HuC advance independently. So
split the macros up to allow the lists to be maintained separately.
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Reviewed-by: NLucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: NDaniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211210044022.1842938-2-John.C.Harrison@Intel.com

3d832f37

10 12月, 2021 1 次提交

drm/i915/pmu: Fix wakeref leak in PMU busyness during reset · 1ff9fc70

由 Umesh Nerlige Ramappa 提交于 12月 06, 2021

GuC PMU busyness gets gt wakeref if awake, but fails to release the
wakeref if a reset is in progress. Release the wakeref if it was
acquried successfully.

v2: Simplify the fix (Ashutosh)

Fixes: 2a67b18e ("drm/i915/pmu: Fix synchronization of PMU callback with reset")
Signed-off-by: NUmesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211207020239.43402-1-umesh.nerlige.ramappa@intel.com

1ff9fc70

01 12月, 2021 1 次提交

drm/i915/pmu: Fix synchronization of PMU callback with reset · 2a67b18e

由 Umesh Nerlige Ramappa 提交于 11月 08, 2021

Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

v2:
- The 2 reset counts captured in the PMU callback can end up being the
  same if they were captured right after the count is incremented in the
  reset flow. This can lead to a bad busyness state. Ensure that reset
  is not in progress when the initial reset count is captured.
Signed-off-by: NUmesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108211057.68783-1-umesh.nerlige.ramappa@intel.com

2a67b18e

22 11月, 2021 1 次提交

drm/i915/pmu: Avoid with_intel_runtime_pm within spinlock · 865fbc0f

由 Umesh Nerlige Ramappa 提交于 11月 19, 2021

When guc timestamp ping worker runs it takes the spinlock and calls
with_intel_runtime_pm. Since with_intel_runtime_pm may sleep, move the
spinlock inside __update_guc_busyness_stats.
Signed-off-by: NUmesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211120014201.26480-1-umesh.nerlige.ramappa@intel.com

865fbc0f

17 11月, 2021 2 次提交

drm/i915/guc: fix NULL vs IS_ERR() checking · 8b2abf77

由 Dan Carpenter 提交于 11月 16, 2021

The intel_engine_create_virtual() function does not return NULL.  It
returns error pointers.

Fixes: e5e32171 ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NRodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211116114916.GB11936@kili
(cherry picked from commit fc12b70d)
Signed-off-by: NRodrigo Vivi <rodrigo.vivi@intel.com>

8b2abf77

drm/i915/guc: fix NULL vs IS_ERR() checking · fc12b70d

由 Dan Carpenter 提交于 11月 16, 2021

The intel_engine_create_virtual() function does not return NULL.  It
returns error pointers.

Fixes: e5e32171 ("drm/i915/guc: Connect UAPI to GuC multi-lrc interface")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NRodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211116114916.GB11936@kili

fc12b70d

13 11月, 2021 1 次提交

drm/i915/guc/slpc: Check GuC status before freq boost · 5f1176b4

由 Vinay Belgaumkar 提交于 11月 11, 2021

It's possible that i915 might get wedged between a boost
and un-boost. Validate the i915-GuC connection before trying
to send a H2G to change the min frequency.

Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/4464

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NVinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211112071016.9640-1-vinay.belgaumkar@intel.com

5f1176b4

10 11月, 2021 1 次提交

drm/i915/guc: Refcount context during error capture · 08d1ecd9

由 John Harrison 提交于 11月 08, 2021

When i915 receives a context reset notification from GuC, it triggers
an error capture before resetting any outstanding requsts of that
context. Unfortunately, the error capture is not a time bound
operation. In certain situations it can take a long time, particularly
when multiple large LMEM buffers must be read back and eoncoded. If
this delay is longer than other timeouts (heartbeat, test recovery,
etc.) then a full GT reset can be triggered in the middle.

That can result in the context being reset by GuC actually being
destroyed before the error capture completes and the GuC submission
code resumes. Thus, the GuC side can start dereferencing stale
pointers and Bad Things ensue.

So add a refcount get of the context during the entire reset
operation. That way, the context can't be destroyed part way through
no matter what other resets or user interactions occur.

v2:
 (Matthew Brost)
  - Update patch to work with async error capture
v3:
 (Matthew Brost)
  - Drop async capture support as that hasn't landed yet
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211108164054.23588-1-matthew.brost@intel.com

08d1ecd9

04 11月, 2021 3 次提交

drm/i915/guc/slpc: Update boost sysfs hooks for SLPC · 1448d5c4

由 Vinay Belgaumkar 提交于 11月 01, 2021

Add a helper to sort through the SLPC/RPS paths of get/set methods.
Boost frequency will be modified as long as it is within the constraints
of RP0 and if it is different from the existing one. We will set min
freq to boost only if there is at least one active waiter.

v2: Add num_boosts to guc_slpc_info and changes for worker function
v3: Review comments (Ashutosh)

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NVinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211102012608.8609-4-vinay.belgaumkar@intel.com

1448d5c4

drm/i915/guc/slpc: Add waitboost functionality for SLPC · 493043fe

由 Vinay Belgaumkar 提交于 11月 01, 2021

Add helper in RPS code for handling SLPC and non-SLPC paths.
When boost is requested in the SLPC path, we can ask GuC to ramp
up the frequency req by setting the minimum frequency to boost freq.
Reset freq back to the min softlimit when there are no more waiters.

v2: Schedule a worker thread which can boost freq from within
an interrupt context as well.

v3: No need to check against requested freq before scheduling boost
work (Ashutosh)

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NVinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211102012608.8609-3-vinay.belgaumkar@intel.com

493043fe

drm/i915/guc/slpc: Define and initialize boost frequency · 292e4fb0

由 Vinay Belgaumkar 提交于 11月 01, 2021

Define helpers and struct members required to record boost info.
Boost frequency is initialized to RP0 at SLPC init. Also define num_waiters
which can track the pending boost requests.

Boost will be done by scheduling a worker thread. This will avoid
the need to make H2G calls inside an interrupt context. Initialize the
worker function during SLPC init as well. Had to move intel_guc_slpc_init
a few lines below to accommodate this.

v2: Add a workqueue to handle waitboost
v3: Code review comments (Ashutosh)

Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NVinay Belgaumkar <vinay.belgaumkar@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211102012608.8609-2-vinay.belgaumkar@intel.com

292e4fb0

29 10月, 2021 1 次提交

drm/i915/pmu: Connect engine busyness stats from GuC to pmu · 77cdd054

由 Umesh Nerlige Ramappa 提交于 10月 26, 2021

With GuC handling scheduling, i915 is not aware of the time that a
context is scheduled in and out of the engine. Since i915 pmu relies on
this info to provide engine busyness to the user, GuC shares this info
with i915 for all engines using shared memory. For each engine, this
info contains:

- total busyness: total time that the context was running (total)
- id: id of the running context (id)
- start timestamp: timestamp when the context started running (start)

At the time (now) of sampling the engine busyness, if the id is valid
(!= ~0), and start is non-zero, then the context is considered to be
active and the engine busyness is calculated using the below equation

engine busyness = total + (now - start)

All times are obtained from the gt clock base. For inactive contexts,
engine busyness is just equal to the total.

The start and total values provided by GuC are 32 bits and wrap around
in a few minutes. Since perf pmu provides busyness as 64 bit
monotonically increasing values, there is a need for this implementation
to account for overflows and extend the time to 64 bits before returning
busyness to the user. In order to do that, a worker runs periodically at
frequency = 1/8th the time it takes for the timestamp to wrap. As an
example, that would be once in 27 seconds for a gt clock frequency of
19.2 MHz.

Note:
There might be an over-accounting of busyness due to the fact that GuC
may be updating the total and start values while kmd is reading them.
(i.e kmd may read the updated total and the stale start). In such a
case, user may see higher busyness value followed by smaller ones which
would eventually catch up to the higher value.

v2: (Tvrtko)
- Include details in commit message
- Move intel engine busyness function into execlist code
- Use union inside engine->stats
- Use natural type for ping delay jiffies
- Drop active_work condition checks
- Use for_each_engine if iterating all engines
- Drop seq locking, use spinlock at GuC level to update engine stats
- Document worker specific details

v3: (Tvrtko/Umesh)
- Demarcate GuC and execlist stat objects with comments
- Document known over-accounting issue in commit
- Provide a consistent view of GuC state
- Add hooks to gt park/unpark for GuC busyness
- Stop/start worker in gt park/unpark path
- Drop inline
- Move spinlock and worker inits to GuC initialization
- Drop helpers that are called only once

v4: (Tvrtko/Matt/Umesh)
- Drop addressed opens from commit message
- Get runtime pm in ping, remove from the park path
- Use cancel_delayed_work_sync in disable_submission path
- Update stats during reset prepare
- Skip ping if reset in progress
- Explicitly name execlists and GuC stats objects
- Since disable_submission is called from many places, move resetting
stats to intel_guc_submission_reset_prepare

v5: (Tvrtko)
- Add a trylock helper that does not sleep and synchronize PMU event
callbacks and worker with gt reset

v6: (CI BAT failures)
- DUTs using execlist submission failed to boot since __gt_unpark is
called during i915 load. This ends up calling the GuC busyness unpark
hook and results in kick-starting an uninitialized worker. Let
park/unpark hooks check if GuC submission has been initialized.
- drop cant_sleep() from trylock helper since rcu_read_lock takes care
of that.

v7: (CI) Fix igt@i915_selftest@live@gt_engines
- For GuC mode of submission the engine busyness is derived from gt time
domain. Use gt time elapsed as reference in the selftest.
- Increase busyness calculation to 10ms duration to ensure batch runs
longer and falls within the busyness tolerances in selftest.

v8:
- Use ktime_get in selftest as before
- intel_reset_trylock_no_wait results in a lockdep splat that is not
trivial to fix since the PMU callback runs in irq context and the
reset paths are tightly knit into the driver. The test that uncovers
this is igt@perf_pmu@faulting-read. Drop intel_reset_trylock_no_wait,
instead use the reset_count to synchronize with gt reset during pmu
callback. For the ping, continue to use intel_reset_trylock since ping
is not run in irq context.

- GuC PM timestamp does not tick when GuC is idle. This can potentially
result in wrong busyness values when a context is active on the
engine, but GuC is idle. Use the RING TIMESTAMP as GPU timestamp to
process the GuC busyness stats. This works since both GuC timestamp and
RING timestamp are synced with the same clock.

- The busyness stats may get updated after the batch starts running.
This delay causes the busyness reported for 100us duration to fall
below 95% in the selftest. The only option at this time is to wait for
GuC busyness to change from idle to active before we sample busyness
over a 100us period.
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Signed-off-by: NUmesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: NTvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: NMatthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211027004821.66097-2-umesh.nerlige.ramappa@intel.com

77cdd054

27 10月, 2021 1 次提交

drm/i915/guc: Fix recursive lock in GuC submission · 9ca8bb7a

由 Matthew Brost 提交于 10月 20, 2021

Use __release_guc_id (lock held) rather than release_guc_id (acquires
lock), add lockdep annotations.

213.280129] i915: Running i915_perf_live_selftests/live_noa_gpr
[ 213.283459] ============================================
[ 213.283462] WARNING: possible recursive locking detected
{{[ 213.283466] 5.15.0-rc6+ #18 Tainted: G U W }}
[ 213.283470] --------------------------------------------
[ 213.283472] kworker/u24:0/8 is trying to acquire lock:
[ 213.283475] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x2df/0x350 [i915]
{{[ 213.283618] }}
{{ but task is already holding lock:}}
[ 213.283621] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283720] }}
{{ other info that might help us debug this:}}
[ 213.283724] Possible unsafe locking scenario:[ 213.283727] CPU0
[ 213.283728] ----
[ 213.283730] lock(&guc->submission_state.lock);
[ 213.283734] lock(&guc->submission_state.lock);
{{[ 213.283737] }}
{{ *** DEADLOCK ***}}[ 213.283740] May be due to missing lock nesting notation[ 213.283744] 3 locks held by kworker/u24:0/8:
[ 213.283747] #0: ffff8ffb80059d38 ((wq_completion)events_unbound){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283757] #1: ffffb509000e3e78 ((work_completion)(&guc->submission_state.destroyed_worker)){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283766] #2: ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283860] }}
{{ stack backtrace:}}
[ 213.283863] CPU: 8 PID: 8 Comm: kworker/u24:0 Tainted: G U W 5.15.0-rc6+ #18
[ 213.283868] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 0403 01/26/2021
[ 213.283873] Workqueue: events_unbound destroyed_worker_func [i915]
[ 213.283957] Call Trace:
[ 213.283960] dump_stack_lvl+0x57/0x72
[ 213.283966] __lock_acquire.cold+0x191/0x2d3
[ 213.283972] lock_acquire+0xb5/0x2b0
[ 213.283978] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284059] ? destroyed_worker_func+0x2d7/0x350 [i915]
[ 213.284139] ? lock_release+0xb9/0x280
[ 213.284143] _raw_spin_lock_irqsave+0x48/0x60
[ 213.284148] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284226] destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284310] process_one_work+0x270/0x550
[ 213.284315] worker_thread+0x52/0x3b0
[ 213.284319] ? process_one_work+0x550/0x550
[ 213.284322] kthread+0x135/0x160
[ 213.284326] ? set_kthread_struct+0x40/0x40
[ 213.284331] ret_from_fork+0x1f/0x30

and a bit later in the trace:

{{ 227.499864] do_raw_spin_lock+0x94/0xa0}}
[ 227.499868] _raw_spin_lock_irqsave+0x50/0x60
[ 227.499871] ? guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.499995] guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.500104] intel_guc_submission_reset_prepare+0x99/0x4b0 [i915]
[ 227.500209] ? mark_held_locks+0x49/0x70
[ 227.500212] intel_uc_reset_prepare+0x46/0x50 [i915]
[ 227.500320] reset_prepare+0x78/0x90 [i915]
[ 227.500412] __intel_gt_set_wedged.part.0+0x13/0xe0 [i915]
[ 227.500485] intel_gt_set_wedged.part.0+0x54/0x100 [i915]
[ 227.500556] intel_gt_set_wedged_on_fini+0x1a/0x30 [i915]
[ 227.500622] intel_gt_driver_unregister+0x1e/0x60 [i915]
[ 227.500694] i915_driver_remove+0x4a/0xf0 [i915]
[ 227.500767] i915_pci_probe+0x84/0x170 [i915]
[ 227.500838] local_pci_probe+0x42/0x80
[ 227.500842] pci_device_probe+0xd9/0x190
[ 227.500844] really_probe+0x1f2/0x3f0
[ 227.500847] __driver_probe_device+0xfe/0x180
[ 227.500848] driver_probe_device+0x1e/0x90
[ 227.500850] __driver_attach+0xc4/0x1d0
[ 227.500851] ? __device_attach_driver+0xe0/0xe0
[ 227.500853] ? __device_attach_driver+0xe0/0xe0
[ 227.500854] bus_for_each_dev+0x64/0x90
[ 227.500856] bus_add_driver+0x12e/0x1f0
[ 227.500857] driver_register+0x8f/0xe0
[ 227.500859] i915_init+0x1d/0x8f [i915]
[ 227.500934] ? 0xffffffffc144a000
[ 227.500936] do_one_initcall+0x58/0x2d0
[ 227.500938] ? rcu_read_lock_sched_held+0x3f/0x80
[ 227.500940] ? kmem_cache_alloc_trace+0x238/0x2d0
[ 227.500944] do_init_module+0x5c/0x270
[ 227.500946] __do_sys_finit_module+0x95/0xe0
[ 227.500949] do_syscall_64+0x38/0x90
[ 227.500951] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 227.500953] RIP: 0033:0x7ffa59d2ae0d
[ 227.500954] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3b 80 0c 00 f7 d8 64 89 01 48
[ 227.500955] RSP: 002b:00007fff320bbf48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 227.500956] RAX: ffffffffffffffda RBX: 00000000022ea710 RCX: 00007ffa59d2ae0d
[ 227.500957] RDX: 0000000000000000 RSI: 00000000022e1d90 RDI: 0000000000000004
[ 227.500958] RBP: 0000000000000020 R08: 00007ffa59df3a60 R09: 0000000000000070
[ 227.500958] R10: 00000000022e1d90 R11: 0000000000000246 R12: 00000000022e1d90
[ 227.500959] R13: 00000000022e58e0 R14: 0000000000000043 R15: 00000000022e42c0

v2:
 (CI build)
  - Fix build error

Fixes: 1a52faed ("drm/i915/guc: Take GT PM ref when deregistering context")
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: NThomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211020192147.8048-1-matthew.brost@intel.com
(cherry picked from commit 12a9917e)
Signed-off-by: NRodrigo Vivi <rodrigo.vivi@intel.com>

9ca8bb7a

23 10月, 2021 1 次提交

drm/i915/guc: Fix recursive lock in GuC submission · 12a9917e

由 Matthew Brost 提交于 10月 20, 2021

Use __release_guc_id (lock held) rather than release_guc_id (acquires
lock), add lockdep annotations.

213.280129] i915: Running i915_perf_live_selftests/live_noa_gpr
[ 213.283459] ============================================
[ 213.283462] WARNING: possible recursive locking detected
{{[ 213.283466] 5.15.0-rc6+ #18 Tainted: G U W }}
[ 213.283470] --------------------------------------------
[ 213.283472] kworker/u24:0/8 is trying to acquire lock:
[ 213.283475] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x2df/0x350 [i915]
{{[ 213.283618] }}
{{ but task is already holding lock:}}
[ 213.283621] ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283720] }}
{{ other info that might help us debug this:}}
[ 213.283724] Possible unsafe locking scenario:[ 213.283727] CPU0
[ 213.283728] ----
[ 213.283730] lock(&guc->submission_state.lock);
[ 213.283734] lock(&guc->submission_state.lock);
{{[ 213.283737] }}
{{ *** DEADLOCK ***}}[ 213.283740] May be due to missing lock nesting notation[ 213.283744] 3 locks held by kworker/u24:0/8:
[ 213.283747] #0: ffff8ffb80059d38 ((wq_completion)events_unbound){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283757] #1: ffffb509000e3e78 ((work_completion)(&guc->submission_state.destroyed_worker)){..}-{0:0}, at: process_one_work+0x1f3/0x550
[ 213.283766] #2: ffff8ffc4f6cc1e8 (&guc->submission_state.lock){....}-{2:2}, at: destroyed_worker_func+0x4f/0x350 [i915]
{{[ 213.283860] }}
{{ stack backtrace:}}
[ 213.283863] CPU: 8 PID: 8 Comm: kworker/u24:0 Tainted: G U W 5.15.0-rc6+ #18
[ 213.283868] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 0403 01/26/2021
[ 213.283873] Workqueue: events_unbound destroyed_worker_func [i915]
[ 213.283957] Call Trace:
[ 213.283960] dump_stack_lvl+0x57/0x72
[ 213.283966] __lock_acquire.cold+0x191/0x2d3
[ 213.283972] lock_acquire+0xb5/0x2b0
[ 213.283978] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284059] ? destroyed_worker_func+0x2d7/0x350 [i915]
[ 213.284139] ? lock_release+0xb9/0x280
[ 213.284143] _raw_spin_lock_irqsave+0x48/0x60
[ 213.284148] ? destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284226] destroyed_worker_func+0x2df/0x350 [i915]
[ 213.284310] process_one_work+0x270/0x550
[ 213.284315] worker_thread+0x52/0x3b0
[ 213.284319] ? process_one_work+0x550/0x550
[ 213.284322] kthread+0x135/0x160
[ 213.284326] ? set_kthread_struct+0x40/0x40
[ 213.284331] ret_from_fork+0x1f/0x30

and a bit later in the trace:

{{ 227.499864] do_raw_spin_lock+0x94/0xa0}}
[ 227.499868] _raw_spin_lock_irqsave+0x50/0x60
[ 227.499871] ? guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.499995] guc_flush_destroyed_contexts+0x4f/0xf0 [i915]
[ 227.500104] intel_guc_submission_reset_prepare+0x99/0x4b0 [i915]
[ 227.500209] ? mark_held_locks+0x49/0x70
[ 227.500212] intel_uc_reset_prepare+0x46/0x50 [i915]
[ 227.500320] reset_prepare+0x78/0x90 [i915]
[ 227.500412] __intel_gt_set_wedged.part.0+0x13/0xe0 [i915]
[ 227.500485] intel_gt_set_wedged.part.0+0x54/0x100 [i915]
[ 227.500556] intel_gt_set_wedged_on_fini+0x1a/0x30 [i915]
[ 227.500622] intel_gt_driver_unregister+0x1e/0x60 [i915]
[ 227.500694] i915_driver_remove+0x4a/0xf0 [i915]
[ 227.500767] i915_pci_probe+0x84/0x170 [i915]
[ 227.500838] local_pci_probe+0x42/0x80
[ 227.500842] pci_device_probe+0xd9/0x190
[ 227.500844] really_probe+0x1f2/0x3f0
[ 227.500847] __driver_probe_device+0xfe/0x180
[ 227.500848] driver_probe_device+0x1e/0x90
[ 227.500850] __driver_attach+0xc4/0x1d0
[ 227.500851] ? __device_attach_driver+0xe0/0xe0
[ 227.500853] ? __device_attach_driver+0xe0/0xe0
[ 227.500854] bus_for_each_dev+0x64/0x90
[ 227.500856] bus_add_driver+0x12e/0x1f0
[ 227.500857] driver_register+0x8f/0xe0
[ 227.500859] i915_init+0x1d/0x8f [i915]
[ 227.500934] ? 0xffffffffc144a000
[ 227.500936] do_one_initcall+0x58/0x2d0
[ 227.500938] ? rcu_read_lock_sched_held+0x3f/0x80
[ 227.500940] ? kmem_cache_alloc_trace+0x238/0x2d0
[ 227.500944] do_init_module+0x5c/0x270
[ 227.500946] __do_sys_finit_module+0x95/0xe0
[ 227.500949] do_syscall_64+0x38/0x90
[ 227.500951] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 227.500953] RIP: 0033:0x7ffa59d2ae0d
[ 227.500954] Code: c8 0c 00 0f 05 eb a9 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3b 80 0c 00 f7 d8 64 89 01 48
[ 227.500955] RSP: 002b:00007fff320bbf48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 227.500956] RAX: ffffffffffffffda RBX: 00000000022ea710 RCX: 00007ffa59d2ae0d
[ 227.500957] RDX: 0000000000000000 RSI: 00000000022e1d90 RDI: 0000000000000004
[ 227.500958] RBP: 0000000000000020 R08: 00007ffa59df3a60 R09: 0000000000000070
[ 227.500958] R10: 00000000022e1d90 R11: 0000000000000246 R12: 00000000022e1d90
[ 227.500959] R13: 00000000022e58e0 R14: 0000000000000043 R15: 00000000022e42c0

v2:
 (CI build)
  - Fix build error

Fixes: 1a52faed ("drm/i915/guc: Take GT PM ref when deregistering context")
Signed-off-by: NMatthew Brost <matthew.brost@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: NThomas Hellström <thomas.hellstrom@linux.intel.com>
Signed-off-by: NJohn Harrison <John.C.Harrison@Intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20211020192147.8048-1-matthew.brost@intel.com

12a9917e

16 10月, 2021 17 次提交

drm/i915/guc: Handle errors in multi-lrc requests · 28c70233