1. 21 9月, 2022 2 次提交
    • C
      drm/amdgpu: add gang submit frontend v6 · 4624459c
      Christian König 提交于
      Allows submitting jobs as gang which needs to run on multiple engines at the
      same time.
      
      All members of the gang get the same implicit, explicit and VM dependencies. So
      no gang member will start running until everything else is ready.
      
      The last job is considered the gang leader (usually a submission to the GFX
      ring) and used for signaling output dependencies.
      
      Each job is remembered individually as user of a buffer object, so there is no
      joining of work at the end.
      
      v2: rebase and fix review comments from Andrey and Yogesh
      v3: use READ instead of BOOKKEEP for now because of VM unmaps, set gang
          leader only when necessary
      v4: fix order of pushing jobs and adding fences found by Trigger.
      v5: fix job index calculation and adding IBs to jobs
      v6: fix typo found by Alex
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      4624459c
    • C
      drm/amdgpu: add gang submit backend v2 · 68ce8b24
      Christian König 提交于
      Allows submitting jobs as gang which needs to run on multiple
      engines at the same time.
      
      Basic idea is that we have a global gang submit fence representing when the
      gang leader is finally pushed to run on the hardware last.
      
      Jobs submitted as gang are never re-submitted in case of a GPU reset since this
      won't work and will just deadlock the hardware immediately again.
      
      v2: fix logic inversion, improve documentation, fix rcu
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      68ce8b24
  2. 14 9月, 2022 1 次提交
  3. 31 8月, 2022 1 次提交
  4. 30 8月, 2022 1 次提交
  5. 17 8月, 2022 3 次提交
  6. 19 7月, 2022 1 次提交
  7. 13 7月, 2022 1 次提交
  8. 28 6月, 2022 1 次提交
  9. 11 6月, 2022 1 次提交
  10. 05 3月, 2022 1 次提交
  11. 15 2月, 2022 1 次提交
  12. 10 2月, 2022 1 次提交
  13. 09 10月, 2021 1 次提交
  14. 30 8月, 2021 2 次提交
    • D
      drm/sched: drop entity parameter from drm_sched_push_job · 0e10e9a1
      Daniel Vetter 提交于
      Originally a job was only bound to the queue when we pushed this, but
      now that's done in drm_sched_job_init, making that parameter entirely
      redundant.
      
      Remove it.
      
      The same applies to the context parameter in
      lima_sched_context_queue_task, simplify that too.
      
      v2:
      Rebase on top of msm adopting drm/sched
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NEmma Anholt <emma@anholt.net>
      Acked-by: NMelissa Wen <mwen@igalia.com>
      Reviewed-by: Steven Price <steven.price@arm.com> (v1)
      Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> (v1)
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: Qiang Yu <yuq825@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Emma Anholt <emma@anholt.net>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Nirmoy Das <nirmoy.das@amd.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Chen Li <chenli@uniontech.com>
      Cc: Lee Jones <lee.jones@linaro.org>
      Cc: Deepak R Varma <mh12gx2825@gmail.com>
      Cc: Kevin Wang <kevin1.wang@amd.com>
      Cc: Luben Tuikov <luben.tuikov@amd.com>
      Cc: "Marek Olšák" <marek.olsak@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
      Cc: Dennis Li <Dennis.Li@amd.com>
      Cc: Boris Brezillon <boris.brezillon@collabora.com>
      Cc: etnaviv@lists.freedesktop.org
      Cc: lima@lists.freedesktop.org
      Cc: linux-media@vger.kernel.org
      Cc: linaro-mm-sig@lists.linaro.org
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: Melissa Wen <mwen@igalia.com>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: freedreno@lists.freedesktop.org
      Link: https://patchwork.freedesktop.org/patch/msgid/20210805104705.862416-6-daniel.vetter@ffwll.ch
      0e10e9a1
    • D
      drm/sched: Split drm_sched_job_init · dbe48d03
      Daniel Vetter 提交于
      This is a very confusingly named function, because not just does it
      init an object, it arms it and provides a point of no return for
      pushing a job into the scheduler. It would be nice if that's a bit
      clearer in the interface.
      
      But the real reason is that I want to push the dependency tracking
      helpers into the scheduler code, and that means drm_sched_job_init
      must be called a lot earlier, without arming the job.
      
      v2:
      - don't change .gitignore (Steven)
      - don't forget v3d (Emma)
      
      v3: Emma noticed that I leak the memory allocated in
      drm_sched_job_init if we bail out before the point of no return in
      subsequent driver patches. To be able to fix this change
      drm_sched_job_cleanup() so it can handle being called both before and
      after drm_sched_job_arm().
      
      Also improve the kerneldoc for this.
      
      v4:
      - Fix the drm_sched_job_cleanup logic, I inverted the booleans, as
        usual (Melissa)
      
      - Christian pointed out that drm_sched_entity_select_rq() also needs
        to be moved into drm_sched_job_arm, which made me realize that the
        job->id definitely needs to be moved too.
      
        Shuffle things to fit between job_init and job_arm.
      
      v5:
      Reshuffle the split between init/arm once more, amdgpu abuses
      drm_sched.ready to signal gpu reset failures. Also document this
      somewhat. (Christian)
      
      v6:
      Rebase on top of the msm drm/sched support. Note that the
      drm_sched_job_init() call is completely misplaced, and hence also the
      split-out drm_sched_entity_push_job(). I've put in a FIXME which the next
      patch will address.
      
      v7: Drop the FIXME in msm, after discussions with Rob I agree it shouldn't
      be a problem where it is now.
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NMelissa Wen <mwen@igalia.com>
      Cc: Melissa Wen <melissa.srw@gmail.com>
      Acked-by: NEmma Anholt <emma@anholt.net>
      Acked-by: Steven Price <steven.price@arm.com> (v2)
      Reviewed-by: Boris Brezillon <boris.brezillon@collabora.com> (v5)
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: Qiang Yu <yuq825@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Nick Terrell <terrelln@fb.com>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Dave Airlie <airlied@redhat.com>
      Cc: Nirmoy Das <nirmoy.das@amd.com>
      Cc: Deepak R Varma <mh12gx2825@gmail.com>
      Cc: Lee Jones <lee.jones@linaro.org>
      Cc: Kevin Wang <kevin1.wang@amd.com>
      Cc: Chen Li <chenli@uniontech.com>
      Cc: Luben Tuikov <luben.tuikov@amd.com>
      Cc: "Marek Olšák" <marek.olsak@amd.com>
      Cc: Dennis Li <Dennis.Li@amd.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
      Cc: Sonny Jiang <sonny.jiang@amd.com>
      Cc: Boris Brezillon <boris.brezillon@collabora.com>
      Cc: Tian Tao <tiantao6@hisilicon.com>
      Cc: etnaviv@lists.freedesktop.org
      Cc: lima@lists.freedesktop.org
      Cc: linux-media@vger.kernel.org
      Cc: linaro-mm-sig@lists.linaro.org
      Cc: Emma Anholt <emma@anholt.net>
      Cc: Rob Clark <robdclark@gmail.com>
      Cc: Sean Paul <sean@poorly.run>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: freedreno@lists.freedesktop.org
      Link: https://patchwork.freedesktop.org/patch/msgid/20210817084917.3555822-1-daniel.vetter@ffwll.ch
      dbe48d03
  15. 17 8月, 2021 1 次提交
    • J
      drm/amd/amdgpu embed hw_fence into amdgpu_job · c530b02f
      Jack Zhang 提交于
      Why: Previously hw fence is alloced separately with job.
      It caused historical lifetime issues and corner cases.
      The ideal situation is to take fence to manage both job
      and fence's lifetime, and simplify the design of gpu-scheduler.
      
      How:
      We propose to embed hw_fence into amdgpu_job.
      1. We cover the normal job submission by this method.
      2. For ib_test, and submit without a parent job keep the
      legacy way to create a hw fence separately.
      v2:
      use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
      embedded in a job.
      v3:
      remove redundant variable ring in amdgpu_job
      v4:
      add tdr sequence support for this feature. Add a job_run_counter to
      indicate whether this job is a resubmit job.
      v5
      add missing handling in amdgpu_fence_enable_signaling
      Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
      Signed-off-by: NJack Zhang <Jack.Zhang7@hotmail.com>
      Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Reviewed by: Monk Liu <monk.liu@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c530b02f
  16. 20 5月, 2021 1 次提交
  17. 10 2月, 2021 2 次提交
  18. 29 1月, 2021 1 次提交
    • L
      drm/scheduler: Job timeout handler returns status (v3) · a6a1f036
      Luben Tuikov 提交于
      This patch does not change current behaviour.
      
      The driver's job timeout handler now returns
      status indicating back to the DRM layer whether
      the device (GPU) is no longer available, such as
      after it's been unplugged, or whether all is
      normal, i.e. current behaviour.
      
      All drivers which make use of the
      drm_sched_backend_ops' .timedout_job() callback
      have been accordingly renamed and return the
      would've-been default value of
      DRM_GPU_SCHED_STAT_NOMINAL to restart the task's
      timeout timer--this is the old behaviour, and is
      preserved by this patch.
      
      v2: Use enum as the status of a driver's job
          timeout callback method.
      
      v3: Return scheduler/device information, rather
          than task information.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: Qiang Yu <yuq825@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Eric Anholt <eric@anholt.net>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Link: https://patchwork.freedesktop.org/patch/415095/
      a6a1f036
  19. 08 12月, 2020 2 次提交
  20. 19 8月, 2020 1 次提交
  21. 15 8月, 2020 1 次提交
  22. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  23. 11 7月, 2020 1 次提交
  24. 10 7月, 2020 1 次提交
  25. 01 7月, 2020 1 次提交
  26. 24 4月, 2020 1 次提交
  27. 14 4月, 2020 1 次提交
  28. 02 4月, 2020 1 次提交
  29. 10 3月, 2020 1 次提交
  30. 17 1月, 2020 1 次提交
  31. 10 12月, 2019 1 次提交
  32. 30 10月, 2019 1 次提交
  33. 26 10月, 2019 1 次提交
  34. 14 9月, 2019 1 次提交
    • A
      drm/amdgpu: Avoid HW GPU reset for RAS. · 7c6e68c7
      Andrey Grodzovsky 提交于
      Problem:
      Under certain conditions, when some IP bocks take a RAS error,
      we can get into a situation where a GPU reset is not possible
      due to issues in RAS in SMU/PSP.
      
      Temporary fix until proper solution in PSP/SMU is ready:
      When uncorrectable error happens the DF will unconditionally
      broadcast error event packets to all its clients/slave upon
      receiving fatal error event and freeze all its outbound queues,
      err_event_athub interrupt  will be triggered.
      In such case and we use this interrupt
      to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
      reset, only stops schedulers, deatches all in progress and not yet scheduled
      job's fences, set error code on them and signals.
      Also reject any new incoming job submissions from user space.
      All this is done to notify the applications of the problem.
      
      v2:
      Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
      Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
      Remove print param from amdgpu_ras_query_error_count
      
      v3:
      Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
      for other XGMI hive memebers.
      Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      7c6e68c7