1. 29 1月, 2021 1 次提交
    • L
      drm/scheduler: Job timeout handler returns status (v3) · a6a1f036
      Luben Tuikov 提交于
      This patch does not change current behaviour.
      
      The driver's job timeout handler now returns
      status indicating back to the DRM layer whether
      the device (GPU) is no longer available, such as
      after it's been unplugged, or whether all is
      normal, i.e. current behaviour.
      
      All drivers which make use of the
      drm_sched_backend_ops' .timedout_job() callback
      have been accordingly renamed and return the
      would've-been default value of
      DRM_GPU_SCHED_STAT_NOMINAL to restart the task's
      timeout timer--this is the old behaviour, and is
      preserved by this patch.
      
      v2: Use enum as the status of a driver's job
          timeout callback method.
      
      v3: Return scheduler/device information, rather
          than task information.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Russell King <linux+etnaviv@armlinux.org.uk>
      Cc: Christian Gmeiner <christian.gmeiner@gmail.com>
      Cc: Qiang Yu <yuq825@gmail.com>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Cc: Eric Anholt <eric@anholt.net>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Link: https://patchwork.freedesktop.org/patch/415095/
      a6a1f036
  2. 19 1月, 2021 1 次提交
  3. 08 12月, 2020 3 次提交
  4. 17 11月, 2020 1 次提交
  5. 13 11月, 2020 1 次提交
  6. 19 8月, 2020 1 次提交
  7. 26 6月, 2020 1 次提交
  8. 15 6月, 2020 1 次提交
  9. 15 4月, 2020 1 次提交
  10. 26 3月, 2020 2 次提交
    • Y
      drm/scheduler: fix rare NULL ptr race · 3c0fdf33
      Yintian Tao 提交于
      There is one one corner case at dma_fence_signal_locked
      which will raise the NULL pointer problem just like below.
      ->dma_fence_signal
          ->dma_fence_signal_locked
      	->test_and_set_bit
      here trigger dma_fence_release happen due to the zero of fence refcount.
      
      ->dma_fence_put
          ->dma_fence_release
      	->drm_sched_fence_release_scheduled
      	    ->call_rcu
      here make the union fled “cb_list” at finished fence
      to NULL because struct rcu_head contains two pointer
      which is same as struct list_head cb_list
      
      Therefore, to hold the reference of finished fence at drm_sched_process_job
      to prevent the null pointer during finished fence dma_fence_signal
      
      [  732.912867] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  732.914815] #PF: supervisor write access in kernel mode
      [  732.915731] #PF: error_code(0x0002) - not-present page
      [  732.916621] PGD 0 P4D 0
      [  732.917072] Oops: 0002 [#1] SMP PTI
      [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G           OE     5.4.0-rc7 #1
      [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
      [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
      [  732.938569] Call Trace:
      [  732.939003]  <IRQ>
      [  732.939364]  dma_fence_signal+0x29/0x50
      [  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
      [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
      [  732.941910]  dma_fence_signal_locked+0x85/0x100
      [  732.942692]  dma_fence_signal+0x29/0x50
      [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
      [  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
      
      v2: hold the finished fence at drm_sched_process_job instead of
          amdgpu_fence_process
      v3: resume the blank line
      Signed-off-by: NYintian Tao <yttao@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3c0fdf33
    • Y
      drm/scheduler: fix rare NULL ptr race · 77bb2f20
      Yintian Tao 提交于
      There is one one corner case at dma_fence_signal_locked
      which will raise the NULL pointer problem just like below.
      ->dma_fence_signal
          ->dma_fence_signal_locked
      	->test_and_set_bit
      here trigger dma_fence_release happen due to the zero of fence refcount.
      
      ->dma_fence_put
          ->dma_fence_release
      	->drm_sched_fence_release_scheduled
      	    ->call_rcu
      here make the union fled “cb_list” at finished fence
      to NULL because struct rcu_head contains two pointer
      which is same as struct list_head cb_list
      
      Therefore, to hold the reference of finished fence at drm_sched_process_job
      to prevent the null pointer during finished fence dma_fence_signal
      
      [  732.912867] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  732.914815] #PF: supervisor write access in kernel mode
      [  732.915731] #PF: error_code(0x0002) - not-present page
      [  732.916621] PGD 0 P4D 0
      [  732.917072] Oops: 0002 [#1] SMP PTI
      [  732.917682] CPU: 7 PID: 0 Comm: swapper/7 Tainted: G           OE     5.4.0-rc7 #1
      [  732.918980] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
      [  732.920906] RIP: 0010:dma_fence_signal_locked+0x3e/0x100
      [  732.938569] Call Trace:
      [  732.939003]  <IRQ>
      [  732.939364]  dma_fence_signal+0x29/0x50
      [  732.940036]  drm_sched_fence_finished+0x12/0x20 [gpu_sched]
      [  732.940996]  drm_sched_process_job+0x34/0xa0 [gpu_sched]
      [  732.941910]  dma_fence_signal_locked+0x85/0x100
      [  732.942692]  dma_fence_signal+0x29/0x50
      [  732.943457]  amdgpu_fence_process+0x99/0x120 [amdgpu]
      [  732.944393]  sdma_v4_0_process_trap_irq+0x81/0xa0 [amdgpu]
      
      v2: hold the finished fence at drm_sched_process_job instead of
          amdgpu_fence_process
      v3: resume the blank line
      Signed-off-by: NYintian Tao <yttao@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      77bb2f20
  11. 17 3月, 2020 2 次提交
  12. 13 3月, 2020 2 次提交
  13. 17 1月, 2020 1 次提交
    • N
      drm/scheduler: improve job distribution with multiple queues · 56822db1
      Nirmoy Das 提交于
      This patch uses score based logic to select a new rq for better
      loadbalance between multiple rq/scheds instead of num_jobs.
      
      Below are test results after running amdgpu_test from mesa drm
      
      Before this patch:
      
      sched_name     num of many times it got scheduled
      =========      ==================================
      sdma0          314
      sdma1          32
      comp_1.0.0     56
      comp_1.0.1     0
      comp_1.1.0     0
      comp_1.1.1     0
      comp_1.2.0     0
      comp_1.2.1     0
      comp_1.3.0     0
      comp_1.3.1     0
      After this patch:
      
      sched_name     num of many times it got scheduled
      =========      ==================================
      sdma0          216
      sdma1          185
      comp_1.0.0     39
      comp_1.0.1     9
      comp_1.1.0     12
      comp_1.1.1     0
      comp_1.2.0     12
      comp_1.2.1     0
      comp_1.3.0     12
      comp_1.3.1     0
      Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      56822db1
  14. 27 11月, 2019 1 次提交
  15. 08 11月, 2019 2 次提交
  16. 07 11月, 2019 1 次提交
  17. 30 10月, 2019 1 次提交
  18. 26 10月, 2019 1 次提交
  19. 25 10月, 2019 1 次提交
  20. 16 7月, 2019 1 次提交
  21. 30 5月, 2019 1 次提交
  22. 25 5月, 2019 1 次提交
  23. 22 5月, 2019 1 次提交
  24. 03 5月, 2019 3 次提交
  25. 24 4月, 2019 1 次提交
  26. 26 1月, 2019 2 次提交
  27. 06 12月, 2018 2 次提交
  28. 29 11月, 2018 1 次提交
  29. 20 11月, 2018 1 次提交
    • T
      drm/scheduler: Fix bad job be re-processed in TDR · 85744e9c
      Trigger Huang 提交于
      A bad job is the one triggered TDR(In the current amdgpu's
      implementation, actually all the jobs in the current joq-queue will
      be treated as bad jobs). In the recovery process, its fence
      will be fake signaled and as a result, the work behind will be scheduled
      to delete it from the mirror list, but if the TDR process is invoked
      before the work's execution, then this bad job might be processed again
      and the call dma_fence_set_error to its fence in TDR process will lead to
      kernel warning trace:
      
      [  143.033605] WARNING: CPU: 2 PID: 53 at ./include/linux/dma-fence.h:437 amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
      kernel: [  143.033606] Modules linked in: amdgpu(OE) amdchash(OE) amdttm(OE) amd_sched(OE) amdkcl(OE) amd_iommu_v2 drm_kms_helper drm i2c_algo_bit fb_sys_fops syscopyarea sysfillrect sysimgblt kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 snd_hda_codec_generic crypto_simd glue_helper cryptd snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev snd_seq_device snd_timer snd soundcore binfmt_misc input_leds mac_hid serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 8139too floppy psmouse 8139cp mii i2c_piix4 pata_acpi
      [  143.033649] CPU: 2 PID: 53 Comm: kworker/2:1 Tainted: G           OE    4.15.0-20-generic #21-Ubuntu
      [  143.033650] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      [  143.033653] Workqueue: events drm_sched_job_timedout [amd_sched]
      [  143.033656] RIP: 0010:amddrm_sched_job_recovery+0x1af/0x1c0 [amd_sched]
      [  143.033657] RSP: 0018:ffffa9f880fe7d48 EFLAGS: 00010202
      [  143.033659] RAX: 0000000000000007 RBX: ffff9b98f2b24c00 RCX: ffff9b98efef4f08
      [  143.033660] RDX: ffff9b98f2b27400 RSI: ffff9b98f2b24c50 RDI: ffff9b98efef4f18
      [  143.033660] RBP: ffffa9f880fe7d98 R08: 0000000000000001 R09: 00000000000002b6
      [  143.033661] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9b98efef3430
      [  143.033662] R13: ffff9b98efef4d80 R14: ffff9b98efef4e98 R15: ffff9b98eaf91c00
      [  143.033663] FS:  0000000000000000(0000) GS:ffff9b98ffd00000(0000) knlGS:0000000000000000
      [  143.033664] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  143.033665] CR2: 00007fc49c96d470 CR3: 000000001400a005 CR4: 00000000003606e0
      [  143.033669] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  143.033669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  143.033670] Call Trace:
      [  143.033744]  amdgpu_device_gpu_recover+0x144/0x820 [amdgpu]
      [  143.033788]  amdgpu_job_timedout+0x9b/0xa0 [amdgpu]
      [  143.033791]  drm_sched_job_timedout+0xcc/0x150 [amd_sched]
      [  143.033795]  process_one_work+0x1de/0x410
      [  143.033797]  worker_thread+0x32/0x410
      [  143.033799]  kthread+0x121/0x140
      [  143.033801]  ? process_one_work+0x410/0x410
      [  143.033803]  ? kthread_create_worker_on_cpu+0x70/0x70
      [  143.033806]  ret_from_fork+0x35/0x40
      
      So just delete the bad job from mirror list directly
      
      Changes in v3:
      	- Add a helper function to delete the bad jobs from mirror list and call
      		it directly *before* the job's fence is signaled
      
      Changes in v2:
      	- delete the useless list node check
      	- also delete bad jobs in drm_sched_main because:
      		kthread_unpark(ring->sched.thread) will be invoked very early before
      		amdgpu_device_gpu_recover's return, then drm_sched_main will have
      		chance to pick up a new job from the job queue. This new job will be
      		added into the mirror list and processed by amdgpu_job_run, but may
      		not be deleted from the mirror list on time due to the same reason.
      		And finally re-processed by drm_sched_job_recovery
      Signed-off-by: NTrigger Huang <Trigger.Huang@amd.com>
      Reviewed-by: NChristian König <chrstian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      85744e9c
  30. 06 11月, 2018 1 次提交