1. 09 2月, 2023 1 次提交
    • G
      drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini · 5ad7bbf3
      Guilherme G. Piccoli 提交于
      Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
      routine - such function is expected to be called only after the
      respective init function - drm_sched_init() - was executed successfully.
      
      Happens that we faced a driver probe failure in the Steam Deck
      recently, and the function drm_sched_fini() was called even without
      its counter-part had been previously called, causing the following oops:
      
      amdgpu: probe of 0000:04:00.0 failed with error -110
      BUG: kernel NULL pointer dereference, address: 0000000000000090
      PGD 0 P4D 0
      Oops: 0002 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
      Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
      RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
      [...]
      Call Trace:
       <TASK>
       amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
       amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
       amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
       devm_drm_dev_init_release+0x49/0x70
       [...]
      
      To prevent that, check if the drm_sched was properly initialized for a
      given ring before calling its fini counter-part.
      
      Notice ideally we'd use sched.ready for that; such field is set as the latest
      thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
      field - in the above oops for example, it was a GFX ring causing the crash, and
      the sched.ready field was set to true in the ring init routine, regardless of
      the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
      Christian's suggestion [0], and also removed the no_scheduler check [1].
      
      [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
      [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
      
      Fixes: 067f44c8 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Cc: Guchun Chen <guchun.chen@amd.com>
      Cc: Luben Tuikov <luben.tuikov@amd.com>
      Cc: Mario Limonciello <mario.limonciello@amd.com>
      Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com>
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@igalia.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      5ad7bbf3
  2. 02 12月, 2022 1 次提交
    • J
      drm/amdgpu: MCBP based on DRM scheduler (v9) · 3f4c175d
      Jiadong.Zhu 提交于
      Trigger Mid-Command Buffer Preemption according to the priority of the software
      rings and the hw fence signalling condition.
      
      The muxer saves the locations of the indirect buffer frames from the software
      ring together with the fence sequence number in its fifo queue, and pops out
      those records when the fences are signalled. The locations are used to resubmit
      packages in preemption scenarios by coping the chunks from the software ring.
      
      v2: Update comment style.
      v3: Fix conflict caused by previous modifications.
      v4: Remove unnecessary prints.
      v5: Fix corner cases for resubmission cases.
      v6: Refactor functions for resubmission, calling fence_process in irq handler.
      v7: Solve conflict for removing amdgpu_sw_ring.c.
      v8: Add time threshold to judge if preemption request is needed.
      v9: Correct comment spelling. Set fence emit timestamp before rsu assignment.
      
      Cc: Christian Koenig <Christian.Koenig@amd.com>
      Cc: Luben Tuikov <Luben.Tuikov@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Michel Dänzer <michel@daenzer.net>
      Signed-off-by: NJiadong.Zhu <Jiadong.Zhu@amd.com>
      Acked-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NHuang Rui <ray.huang@amd.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3f4c175d
  3. 29 9月, 2022 1 次提交
  4. 28 9月, 2022 1 次提交
  5. 25 7月, 2022 1 次提交
  6. 13 7月, 2022 1 次提交
  7. 28 6月, 2022 3 次提交
  8. 11 6月, 2022 2 次提交
  9. 04 5月, 2022 1 次提交
  10. 10 2月, 2022 1 次提交
  11. 10 1月, 2022 1 次提交
  12. 18 12月, 2021 1 次提交
    • H
      drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · bf67014d
      Huang Rui 提交于
      The job embedded fence donesn't initialize the flags at
      dma_fence_init(). Then we will go a wrong way in
      amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
      once we enabled the trace event here. So introduce new amdgpu_fence
      object to indicate the job embedded fence.
      
      [  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
      [  156.131804] #PF: supervisor read access in kernel mode
      [  156.131811] #PF: error_code(0x0000) - not-present page
      [  156.131817] PGD 0 P4D 0
      [  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
      [  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
      [  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
      [  156.131848] RIP: 0010:strlen+0x0/0x20
      [  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
      [  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
      [  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
      [  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
      [  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
      [  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
      [  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
      [  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
      [  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
      [  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  156.131949] Call Trace:
      [  156.131953]  <TASK>
      [  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
      [  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
      [  156.131982]  dma_fence_init+0x92/0xb0
      [  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
      [  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
      [  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]
      
      v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      bf67014d
  13. 17 12月, 2021 1 次提交
    • H
      drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · 5c1e6fa4
      Huang Rui 提交于
      The job embedded fence donesn't initialize the flags at
      dma_fence_init(). Then we will go a wrong way in
      amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
      once we enabled the trace event here. So introduce new amdgpu_fence
      object to indicate the job embedded fence.
      
      [  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
      [  156.131804] #PF: supervisor read access in kernel mode
      [  156.131811] #PF: error_code(0x0000) - not-present page
      [  156.131817] PGD 0 P4D 0
      [  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
      [  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
      [  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
      [  156.131848] RIP: 0010:strlen+0x0/0x20
      [  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
      [  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
      [  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
      [  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
      [  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
      [  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
      [  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
      [  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
      [  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
      [  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  156.131949] Call Trace:
      [  156.131953]  <TASK>
      [  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
      [  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
      [  156.131982]  dma_fence_init+0x92/0xb0
      [  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
      [  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
      [  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]
      
      v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
      Signed-off-by: NHuang Rui <ray.huang@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      5c1e6fa4
  14. 09 10月, 2021 1 次提交
  15. 02 9月, 2021 1 次提交
  16. 01 9月, 2021 1 次提交
  17. 17 8月, 2021 1 次提交
    • J
      drm/amd/amdgpu embed hw_fence into amdgpu_job · c530b02f
      Jack Zhang 提交于
      Why: Previously hw fence is alloced separately with job.
      It caused historical lifetime issues and corner cases.
      The ideal situation is to take fence to manage both job
      and fence's lifetime, and simplify the design of gpu-scheduler.
      
      How:
      We propose to embed hw_fence into amdgpu_job.
      1. We cover the normal job submission by this method.
      2. For ib_test, and submit without a parent job keep the
      legacy way to create a hw fence separately.
      v2:
      use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
      embedded in a job.
      v3:
      remove redundant variable ring in amdgpu_job
      v4:
      add tdr sequence support for this feature. Add a job_run_counter to
      indicate whether this job is a resubmit job.
      v5
      add missing handling in amdgpu_fence_enable_signaling
      Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
      Signed-off-by: NJack Zhang <Jack.Zhang7@hotmail.com>
      Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Reviewed by: Monk Liu <monk.liu@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c530b02f
  18. 06 8月, 2021 1 次提交
  19. 29 7月, 2021 1 次提交
  20. 01 7月, 2021 1 次提交
  21. 20 5月, 2021 2 次提交
  22. 21 4月, 2021 1 次提交
  23. 10 4月, 2021 1 次提交
  24. 06 3月, 2021 1 次提交
  25. 19 2月, 2021 1 次提交
  26. 05 2月, 2021 1 次提交
  27. 13 11月, 2020 1 次提交
    • L
      drm/amd/amdgpu/amdgpu_fence: Fix some issues pertaining to function documentation · f02f8c32
      Lee Jones 提交于
      Fixes the following W=1 kernel build warning(s):
      
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:139: warning: Function parameter or member 'flags' not described in 'amdgpu_fence_emit'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:197: warning: Function parameter or member 'timeout' not described in 'amdgpu_fence_emit_polling'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:302: warning: Function parameter or member 't' not described in 'amdgpu_fence_fallback'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:302: warning: Excess function parameter 'work' description in 'amdgpu_fence_fallback'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:320: warning: Excess function parameter 'adev' description in 'amdgpu_fence_wait_empty'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:649: warning: Function parameter or member 'f' not described in 'amdgpu_fence_enable_signaling'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:649: warning: Excess function parameter 'fence' description in 'amdgpu_fence_enable_signaling'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:684: warning: Function parameter or member 'f' not described in 'amdgpu_fence_release'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:684: warning: Excess function parameter 'fence' description in 'amdgpu_fence_release'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:749: warning: Function parameter or member 'm' not described in 'amdgpu_debugfs_gpu_recover'
       drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:749: warning: Function parameter or member 'data' not described in 'amdgpu_debugfs_gpu_recover'
      
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Christian König" <christian.koenig@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Jerome Glisse <glisse@freedesktop.org>
      Cc: amd-gfx@lists.freedesktop.org
      Cc: dri-devel@lists.freedesktop.org
      Cc: linux-media@vger.kernel.org
      Cc: linaro-mm-sig@lists.linaro.org
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      f02f8c32
  28. 25 8月, 2020 2 次提交
  29. 15 7月, 2020 1 次提交
  30. 08 7月, 2020 1 次提交
    • T
      gpu/drm: Remove debug info about CPU address · e241df69
      Tiezhu Yang 提交于
      When I update the latest kernel, I see the following "____ptrval____" boot
      messages.
      
      [    1.872600] radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x0000000048000c00 and cpu addr 0x(____ptrval____)
      [    1.879095] radeon 0000:01:05.0: fence driver on ring 5 use gpu addr 0x0000000040056038 and cpu addr 0x(____ptrval____)
      
      Both radeon_fence_driver_start_ring() and amdgpu_fence_driver_start_ring()
      have the similar issue, there exists the following two methods to solve it:
      (1) Use "%pK" instead of "%p" so that the CPU address can be printed when
      the kptr_restrict sysctl is set to 1.
      (2) Just completely drop the CPU address suggested by Christian, because
      the CPU address was useful in the past, but isn't any more. We now have a
      debugfs file to read the current fence values.
      
      Since the CPU address is not much useful, just remove the debug info about
      CPU address.
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      e241df69
  31. 01 7月, 2020 5 次提交