提交 · daf0678c2036c918f01e4aa6035629d2debc2f30 · openeuler / raspberrypi-kernel

26 5月, 2018 1 次提交

drm/scheduler: fix a corner case in dependency optimization · 6201e033

由 Nayan Deshmukh 提交于 5月 25, 2018

When checking for a dependency fence for belonging to the same entity
compare it with scheduled as well finished fence. Earlier we were only
comparing it with the scheduled fence.
Signed-off-by: NNayan Deshmukh <nayan26deshmukh@gmail.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6201e033

19 5月, 2018 1 次提交

drm/scheduler: Remove obsolete spinlock. · 563e1e66

由 Andrey Grodzovsky 提交于 5月 15, 2018

This spinlock is superfluous, any call to drm_sched_entity_push_job
should already be under a lock together with matching drm_sched_job_init
to match the order of insertion into queue with job's fence seqence
number.

v2:
Improve patch description.
Add functions documentation describing the locking considerations
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NChunming Zhou <david1.zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

563e1e66

16 5月, 2018 5 次提交

drm/scheduler: remove unused parameter · 8344c53f

由 Nayan Deshmukh 提交于 3月 29, 2018

this patch also effect the amdgpu and etnaviv drivers which
use the function drm_sched_entity_init
Signed-off-by: NNayan Deshmukh <nayan26deshmukh@gmail.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Acked-by: NLucas Stach <l.stach@pengutronix.de>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8344c53f

drm/scheduler: don't update last scheduled fence in TDR · 5dd3f9ef

由 Pixel Ding 提交于 4月 24, 2018

The current sequence in scheduler thread is:
1. update last sched fence
2. job begin (adding to mirror list)
3. job finish (remove from mirror list)
4. back to 1

Since we update last sched prior to joining mirror list, the jobs
in mirror list already pass the last sched fence. TDR just run
the jobs in mirror list, so we should not update the last sched
fences in TDR.
Signed-off-by: NPixel Ding <Pixel.Ding@amd.com>
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5dd3f9ef

drm/scheduler: move last_sched fence updating prior to job popping (v2) · b5b4ea4d

由 Pixel Ding 提交于 4月 18, 2018

Make sure main thread won't update last_sched fence when entity
is cleanup.

Fix a racing issue which is caused by putting last_sched fence
twice. Running vulkaninfo in tight loop can produce this issue
as seeing wild fence pointer.

v2: squash in build fix (Christian)
Signed-off-by: NPixel Ding <Pixel.Ding@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NMonk Liu <Monk.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b5b4ea4d

drm/scheduler: always put last_sched fence in entity_fini · a4b3996a

由 Pixel Ding 提交于 4月 18, 2018

Fix the potential memleak since scheduler main thread always
hold one last_sched fence.
Signed-off-by: NPixel Ding <Pixel.Ding@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a4b3996a

drm/gpu-sched: fix force APP kill hang(v4) · 8ee3a52e

由 Emily Deng 提交于 4月 16, 2018

issue:
there are VMC page fault occurred if force APP kill during
3dmark test, the cause is in entity_fini we manually signal
all those jobs in entity's queue which confuse the sync/dep
mechanism:

1)page fault occurred in sdma's clear job which operate on
shadow buffer, and shadow buffer's Gart table is cleaned by
ttm_bo_release since the fence in its reservation was fake signaled
by entity_fini() under the case of SIGKILL received.

2)page fault occurred in gfx' job because during the lifetime
of gfx job we manually fake signal all jobs from its entity
in entity_fini(), thus the unmapping/clear PTE job depend on those
result fence is satisfied and sdma start clearing the PTE and lead
to GFX page fault.

fix:
1)should at least wait all jobs already scheduled complete in entity_fini()
if SIGKILL is the case.

2)if a fence signaled and try to clear some entity's dependency, should
set this entity guilty to prevent its job really run since the dependency
is fake signaled.

v2:
splitting drm_sched_entity_fini() into two functions:
1)The first one is does the waiting, removes the entity from the
runqueue and returns an error when the process was killed.
2)The second one then goes over the entity, install it as
completion signal for the remaining jobs and signals all jobs
with an error code.

v3:
1)Replace the fini1 and fini2 with better name
2)Call the first part before the VM teardown in
amdgpu_driver_postclose_kms() and the second part
after the VM teardown
3)Keep the original function drm_sched_entity_fini to
refine the code.

v4:
1)Rename entity->finished to entity->last_scheduled;
2)Rename drm_sched_entity_fini_job_cb() to
drm_sched_entity_kill_jobs_cb();
3)Pass NULL to drm_sched_entity_fini_job_cb() if -ENOENT;
4)Replace the type of entity->fini_status with "int";
5)Remove the check about entity->finished.
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Signed-off-by: NEmily Deng <Emily.Deng@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8ee3a52e

12 4月, 2018 2 次提交

drm/scheduler: move the tracepoints file from the include directory · a70cdb9e

由 Nayan Deshmukh 提交于 3月 29, 2018

Move it with the scheduler code. This is mostly a straight forward
rename with no code change except for updating the TRACE_INCLUDE_PATH
Signed-off-by: NNayan Deshmukh <nayan26deshmukh@gmail.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Acked-by: NLucas Stach <l.stach@pengutronix.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a70cdb9e

drm/scheduler: fix param documentation · ced54435

由 Nayan Deshmukh 提交于 3月 29, 2018

There is no @kernel parameter anymore and document the
@guilty parameter
Signed-off-by: NNayan Deshmukh <nayan26deshmukh@gmail.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ced54435

22 2月, 2018 1 次提交

treewide/trivial: Remove ';;$' typo noise · ed7158ba

由 Ingo Molnar 提交于 2月 22, 2018

On lkml suggestions were made to split up such trivial typo fixes into per subsystem
patches:

  --- a/arch/x86/boot/compressed/eboot.c
  +++ b/arch/x86/boot/compressed/eboot.c
  @@ -439,7 +439,7 @@ setup_uga32(void **uga_handle, unsigned long size, u32 *width, u32 *height)
          struct efi_uga_draw_protocol *uga = NULL, *first_uga;
          efi_guid_t uga_proto = EFI_UGA_PROTOCOL_GUID;
          unsigned long nr_ugas;
  -       u32 *handles = (u32 *)uga_handle;;
  +       u32 *handles = (u32 *)uga_handle;
          efi_status_t status = EFI_INVALID_PARAMETER;
          int i;

This patch is the result of the following script:

  $ sed -i 's/;;$/;/g' $(git grep -E ';;$'  | grep "\.[ch]:"  | grep -vwE 'for|ia64' | cut -d: -f1 | sort | uniq)

... followed by manual review to make sure it's all good.

Splitting this up is just crazy talk, let's get over with this and just do it.
Reported-by: NPavel Machek <pavel@ucw.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

ed7158ba

20 2月, 2018 1 次提交

drm: Fix trailing semicolon · e2751493

由 Luis de Bethencourt 提交于 1月 17, 2018

The trailing semicolon is an empty statement that does no operation.
Removing it since it doesn't do anything.
Signed-off-by: NLuis de Bethencourt <luisbg@kernel.org>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e2751493

08 12月, 2017 1 次提交

drm: move amd_gpu_scheduler into common location · 1b1f42d8

由 Lucas Stach 提交于 12月 06, 2017

This moves and renames the AMDGPU scheduler to a common location in DRM
in order to facilitate re-use by other drivers. This is mostly a straight
forward rename with no code changes.

One notable exception is the function to_drm_sched_fence(), which is no
longer a inline header function to avoid the need to export the
drm_sched_fence_ops_scheduled and drm_sched_fence_ops_finished structures.
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Tested-by: NDieter Nützel <Dieter@nuetzel-hh.de>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1b1f42d8

07 12月, 2017 2 次提交

drm/amd/scheduler: add WARN_ON for s_fence->parent · 45bfd969

由 Chunming Zhou 提交于 11月 07, 2017

Signed-off-by: NChunming Zhou <david1.zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

45bfd969

drm/amd/scheduler: fix page protection of cb · f4323bcc

由 Chunming Zhou 提交于 11月 07, 2017

We must remove the fence callback.
Signed-off-by: NChunming Zhou <david1.zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f4323bcc

05 12月, 2017 8 次提交

drm/amdgpu:fix gpu recover missing skipping(v2) · cfb83b1d

由 Monk Liu 提交于 11月 08, 2017

if app close CTX right after IB submit, gpu recover
will fail to find out the entity behind this guilty
job thus lead to no job skipping for this guilty job.

to fix this corner case just move the increasement of
job->karma out of the entity iteration.

v2:
only do karma increasment if bad->s_priority != KERNEL
because we always consider KERNEL job be correct and always
want to recover an unfinished kernel job (sometimes kernel
job is interrupted by VF FLR or other GPU hang event)
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-By: NXiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cfb83b1d

amd/scheduler:imple job skip feature(v3) · 48f05f29

由 Monk Liu 提交于 10月 25, 2017

jobs are skipped under two cases
1)when the entity behind this job marked guilty, the job
poped from this entity's queue will be dropped in sched_main loop.

2)in job_recovery(), skip the scheduling job if its karma detected
above limit, and also skipped as well for other jobs sharing the
same fence context. this approach is becuase job_recovery() cannot
access job->entity due to entity may already dead.

v2:
some logic fix

v3:
when entity detected guilty, don't drop the job in the poping
stage, instead set its fence error as -ECANCELED

in run_job(), skip the scheduling either:1) fence->error < 0
or 2) there was a VRAM LOST occurred on this job.
this way we can unify the job skipping logic.

with this feature we can introduce new gpu recover feature.
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

48f05f29

drm/amdgpu: Remove job->s_entity to avoid keeping reference to stale pointer. · a4176cb4

由 Andrey Grodzovsky 提交于 10月 24, 2017

Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a4176cb4

drm/amdgpu: Fix deadlock during GPU reset. · 83f4b118

由 Andrey Grodzovsky 提交于 10月 12, 2017

Bug:
Kfifo is limited at size, during GPU reset it would fill up to limit
and the pushing thread (producer) would wait for the scheduler worker to
consume the items in the fifo while holding reservation lock
on a BO. The gpu reset thread on the other hand blocks the scheduler
during reset. Before it unblocks the sceduler it might want
to recover VRAM and so will try to reserve the same BO the producer
thread is already holding creating a deadlock.

Fix:
Switch from kfifo to SPSC queue which is unlimited in size.
Signed-off-by: NAndrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83f4b118

drm/amdgpu:cleanup job reset routine(v2) · a8a51a70

由 Monk Liu 提交于 10月 16, 2017

merge the setting guilty on context into this function
to avoid implement extra routine.

v2:
go through entity list and compare the fence_ctx
before operate on the entity, otherwise the entity
may be just a wild pointer
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChunming Zhou <David1.Zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a8a51a70

drm/amd/scheduler:introduce guilty pointer member · b3eebe3d

由 Monk Liu 提交于 10月 23, 2017

this member will be used later, it will points to
the real var inside of context and CS_SUBMIT & gpu schdduler
can decide if skip a job depends on context->guilty or *entity->guilty
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChunming Zhou <David1.Zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b3eebe3d

drm/amdgpu:add hang_limit for sched(v2) · 95aa9b1d

由 Monk Liu 提交于 10月 17, 2017

since gpu_scheduler source domain cannot access amdgpu variable
so need create the hang_limit membewr for sched, and it can
refer it for the upcoming GPU RESET patches

v2:
make hang_limit a parameter of sched_init()
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChunming Zhou <David1.Zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

95aa9b1d

drm/amdgpu: Avoid accessing job->entity after the job is scheduled. · d1f6dc1a

由 Andrey Grodzovsky 提交于 10月 19, 2017

Bug: amdgpu_job_free_cb was accessing s_job->s_entity when the allocated
amdgpu_ctx (and the entity inside it) were already deallocated from
amdgpu_cs_parser_fini.

Fix: Save job's priority on it's creation instead of accessing it from
s_entity later on.
Signed-off-by: NAndrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d1f6dc1a

25 10月, 2017 1 次提交

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05

由 Mark Rutland 提交于 10月 23, 2017

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6aa7de05

20 10月, 2017 1 次提交

drm/amd/sched: fix job tear down order v2 · 7fd5e36c

由 Christian König 提交于 10月 13, 2017

Move the trace before we signal the scheduler fence and drop the
scheduler fence reference directly before we free the job.

v2: keep extra s_fence reference
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NLiu, Monk <Monk.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7fd5e36c

19 10月, 2017 1 次提交

Revert "drm/amdgpu: discard commands of killed processes" · c9450127

由 Alex Deucher 提交于 10月 12, 2017

This causes instability in piglit.  It's fixed in drm-next with:
515c6faf
1650c14b
214a91e6
29d25355
79867462

This reverts commit 6af0883e.
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c9450127

10 10月, 2017 1 次提交

drm/amd/sched: allow clients to edit an entity's rq v2 · 9ebbaabe

由 Andres Rodriguez 提交于 6月 02, 2017

This is useful for changing an entity's priority at runtime.

v2: don't modify the order of amd_sched_entity members
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAndres Rodriguez <andresx7@gmail.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9ebbaabe

07 10月, 2017 5 次提交

drm/amd/sched: fix deadlock caused by unsignaled fences of deleted jobs · 79867462

由 Nicolai Hähnle 提交于 9月 28, 2017

Highly concurrent Piglit runs can trigger a race condition where a pending
SDMA job on a buffer object is never executed because the corresponding
process is killed (perhaps due to a crash). Since the job's fences were
never signaled, the buffer object was effectively leaked. Worse, the
buffer was stuck wherever it happened to be at the time, possibly in VRAM.

The symptom was user space processes stuck in interruptible waits with
kernel stacks like:

    [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
    [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
    [<ffffffffbc5e82d2>] reservation_object_wait_timeout_rcu+0x1c2/0x300
    [<ffffffffc03ce56f>] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
    [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
    [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
    [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
    [<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
    [<ffffffffc042f523>] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
    [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
    [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
    [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
    [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
    [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
    [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
    [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
    [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
    [<ffffffffffffffff>] 0xffffffffffffffff

Note: The correctness of this change depends on the earlier commit
"drm/amd/sched: move adding finish callback to amd_sched_job_begin"

v2: set an error on the finished fence
Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

79867462

drm/amd/sched: NULL out the s_fence field after run_job · 29d25355

由 Nicolai Hähnle 提交于 9月 28, 2017

amd_sched_process_job drops the fence reference, so NULL out the s_fence
field before adding it as a callback to guard against accidentally using
s_fence after it may have be freed.

v2: add a clarifying comment
Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

29d25355

drm/amd/sched: move adding finish callback to amd_sched_job_begin · 214a91e6

由 Nicolai Hähnle 提交于 9月 28, 2017

The finish callback is responsible for removing the job from the ring
mirror list, among other things. It makes sense to add it as callback
in the place where the job is added to the ring mirror list.
Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

214a91e6

drm/amd/sched: fix an outdated comment · 1650c14b

由 Nicolai Hähnle 提交于 9月 28, 2017

Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1650c14b

drm/amd/sched: rename amd_sched_entity_pop_job · 515c6faf

由 Nicolai Hähnle 提交于 9月 28, 2017

The function does not actually remove the job from the FIFO, so "peek"
describes it better.
Signed-off-by: NNicolai Hähnle <nicolai.haehnle@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

515c6faf

30 8月, 2017 1 次提交

drm/amdgpu: discard commands of killed processes · f0694d3b

由 Christian König 提交于 8月 21, 2017

When a process is killed we shouldn't submit all waiting jobs, but instead
clean up as fast as possible.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f0694d3b

24 8月, 2017 1 次提交

drm/amdgpu: discard commands of killed processes · 6af0883e

由 Christian König 提交于 8月 21, 2017

When a process is killed we shouldn't submit all waiting jobs, but instead
clean up as fast as possible.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6af0883e

25 5月, 2017 1 次提交

drm/amdgpu/SRIOV:implement guilty job TDR for(V2) · 65781c78

由 Monk Liu 提交于 5月 11, 2017

1,TDR will kickout guilty job if it hang exceed the threshold
of the given one from kernel paramter "job_hang_limit", that
way a bad command stream will not infinitly cause GPU hang.

by default this threshold is 1 so a job will be kicked out
after it hang.

2,if a job timeout TDR routine will not reset all sched/ring,
instead if will only reset on the givn one which is indicated
by @job of amdgpu_sriov_gpu_reset, that way we don't need to
reset and recover each sched/ring if we already know which job
cause GPU hang.

3,unblock sriov_gpu_reset for AI family.

V2:
1:put kickout guilty job after sched parked.
2:since parking scheduler prior to kickout already occupies a
while, we can do last check on the in question job before
doing hw_reset.

TODO:
1:when a job is considered as guilty, we should mark some flag
in its fence status flag, and let UMD side aware that this
fence signaling is not due to job complete but job hang.

2:if gpu reset cause all video memory lost, we need introduce
a new policy to implement TDR, like drop all jobs not yet
signaled, and all IOCTL on this device will return ERROR
DEVICE_LOST.
this will be implemented later.
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

65781c78

11 5月, 2017 2 次提交

drm/amdgpu: fix dependency issue · 30514dec

由 Chunming Zhou 提交于 5月 09, 2017

The problem is that executing the jobs in the right order doesn't give you the right result
because consecutive jobs executed on the same engine are pipelined.
In other words job B does it buffer read before job A has written it's result.
Signed-off-by: NChunming Zhou <David1.Zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

30514dec

drm/amd: fix init order of sched job · cb3696fd

由 Chunming Zhou 提交于 5月 09, 2017

Need to increment after the fence check.
Signed-off-by: NChunming Zhou <David1.Zhou@amd.com>
Reviewed-by: NJunwei Zhang <Jerry.Zhang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cb3696fd

29 4月, 2017 1 次提交

drm/amdgpu: fix NULL pointer error · a6bef67e

由 Chunming Zhou 提交于 4月 24, 2017

[  141.420491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[  141.420532] IP: [<ffffffff81579ee1>] fence_remove_callback+0x11/0x60
[  141.420563] PGD 20a030067
[  141.420575] PUD 2088ca067
[  141.420587] PMD 0

[  141.420599] Oops: 0000 [#1] SMP
[  141.420612] Modules linked in: amdgpu(OE) ttm(OE) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) rpcsec_gss_krb5(E) nfsv4(E) nfs(E) fscache(E) eeepc_wmi(E) asus_wmi(E) sparse_keymap(E) snd_hda_codec_realtek(E) video(E) snd_hda_codec_generic(E) snd_hda_codec_hdmi(E) snd_hda_intel(E) joydev(E) snd_hda_codec(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_hda_core(E) snd_hwdep(E) snd_rawmidi(E) snd_pcm(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) snd_seq(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_seq_device(E) snd_timer(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ablk_helper(E) cryptd(E) snd(E) soundcore(E) serio_raw(E) shpchp(E) i2c_piix4(E) i2c_designware_platform(E) 8250_dw(E) i2c_designware_core(E) mac_hid(E) binfmt_misc(E)
[  141.420948]  nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) r8169(E) ahci(E) mii(E) libahci(E) wmi(E)
[  141.421042] CPU: 14 PID: 223 Comm: kworker/14:2 Tainted: G           OE   4.9.0-custom #4
[  141.421074] Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 0606 04/06/2017
[  141.421146] Workqueue: events amd_sched_job_timedout [amdgpu]
[  141.421169] task: ffff88020b03ba80 task.stack: ffffc900016f4000
[  141.421193] RIP: 0010:[<ffffffff81579ee1>]  [<ffffffff81579ee1>] fence_remove_callback+0x11/0x60
[  141.421229] RSP: 0018:ffffc900016f7d30  EFLAGS: 00010202
[  141.421250] RAX: ffff8801c049fc00 RBX: ffff8801d4d8dc00 RCX: 0000000000000000
[  141.421278] RDX: 0000000000000001 RSI: ffff8801c049fcc0 RDI: 0000000000000000
[  141.421307] RBP: ffffc900016f7d48 R08: 0000000000000000 R09: 0000000000000000
[  141.421334] R10: 00000020ed512a30 R11: 0000000000000001 R12: 0000000000000000
[  141.421362] R13: ffff880209ba4ba0 R14: ffff880209ba4c58 R15: ffff8801c055cc60
[  141.421390] FS:  0000000000000000(0000) GS:ffff88021ef80000(0000) knlGS:0000000000000000
[  141.421421] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  141.421443] CR2: 0000000000000030 CR3: 000000020b554000 CR4: 00000000003406e0
[  141.421471] Stack:
[  141.421480]  ffff8801d4d8dc00 ffff880209ba4c48 ffff880209ba4ba0 ffffc900016f7d78
[  141.421513]  ffffffffa0697920 ffff880209ba0000 0000000000000000 ffff880209ba2770
[  141.421549]  ffff880209ba4b08 ffffc900016f7df0 ffffffffa05ce2ae ffffffffa0509eb7
[  141.421583] Call Trace:
[  141.421628]  [<ffffffffa0697920>] amd_sched_hw_job_reset+0x50/0xb0 [amdgpu]
[  141.421676]  [<ffffffffa05ce2ae>] amdgpu_gpu_reset+0x8e/0x690 [amdgpu]
[  141.421712]  [<ffffffffa0509eb7>] ? drm_printk+0x97/0xa0 [drm]
[  141.421770]  [<ffffffffa0698156>] amdgpu_job_timedout+0x46/0x50 [amdgpu]
[  141.421829]  [<ffffffffa0696a07>] amd_sched_job_timedout+0x17/0x20 [amdgpu]
[  141.421859]  [<ffffffff81095493>] process_one_work+0x153/0x3f0
[  141.421884]  [<ffffffff81095c5b>] worker_thread+0x12b/0x4b0
[  141.421907]  [<ffffffff81095b30>] ? rescuer_thread+0x350/0x350
[  141.421931]  [<ffffffff8109b423>] kthread+0xd3/0xf0
[  141.421951]  [<ffffffff8109b350>] ? kthread_park+0x60/0x60
[  141.421975]  [<ffffffff817e1ee5>] ret_from_fork+0x25/0x30
[  141.421996] Code: ac 81 e8 a3 1f b0 ff 48 c7 c0 ea ff ff ff e9 48 ff ff ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 <48> 8b 7f 30 48 89 f3 e8 73 7c 26 00 48 8b 13 48 39 d3 41 0f 95
[  141.422156] RIP  [<ffffffff81579ee1>] fence_remove_callback+0x11/0x60
[  141.422183]  RSP <ffffc900016f7d30>
[  141.422197] CR2: 0000000000000030
[  141.433483] ---[ end trace bc0949bf7ddd6d4b ]---

if the job is reset twice, then the parent could be NULL.
Signed-off-by: NChunming Zhou <David1.Zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a6bef67e

30 3月, 2017 2 次提交

drm/amd/sched: revise priority number · 153de9df

由 Chunming Zhou 提交于 3月 16, 2017

big number is to high priority.
Signed-off-by: NChunming Zhou <David1.Zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

153de9df

drm/amd/sched: add a unique job id to amd_sched_job · 93f8b367

由 Andres Rodriguez 提交于 3月 09, 2017

A unique id is useful for debugging and tracing. Intended to replace
pointers in ftrace output.
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAndres Rodriguez <andresx7@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

93f8b367

02 3月, 2017 1 次提交

sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> · ae7e81c0

由 Ingo Molnar 提交于 2月 01, 2017

We are going to move scheduler ABI details to <uapi/linux/sched/types.h>,
which will be used from a number of .c files.

Create empty placeholder header that maps to <linux/types.h>.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

ae7e81c0