1. 17 2月, 2022 2 次提交
  2. 15 2月, 2022 17 次提交
  3. 12 2月, 2022 6 次提交
  4. 10 2月, 2022 4 次提交
  5. 08 2月, 2022 11 次提交
    • A
      drm/amdgpu: drop experimental flag on aldebaran · 3786a9bc
      Alex Deucher 提交于
      These have been at production level for a while. Drop
      the flag.
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3786a9bc
    • C
      drm/amdgpu: reserve the pd while cleaning up PRTs · b6fba4ec
      Christian König 提交于
      We want to have lockdep annotation here, so make sure that we reserve
      the PD while removing PRTs even if it isn't strictly necessary since the
      VM object is about to be destroyed anyway.
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      b6fba4ec
    • C
      drm/amdgpu: move lockdep assert to the right place. · d7d7ddc1
      Christian König 提交于
      Since newly added BOs don't have any mappings it's ok to add them
      without holding the VM lock. Only when we add per VM BOs the lock is
      mandatory.
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Reported-by: NBhardwaj, Rajneesh <Rajneesh.Bhardwaj@amd.com>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d7d7ddc1
    • A
      drm/amdgpu: check the GART table before invalidating TLB · 29ba7b16
      Aaron Liu 提交于
      Bypass group programming (utcl2_harvest) aims to forbid UTCL2 to send
      invalidation command to harvested SE/SA. Once invalidation command comes
      into harvested SE/SA, SE/SA has no response and system hang.
      
      This patch is to add checking if the GART table is already allocated before
      invalidating TLB. The new procedure is as following:
      1. Calling amdgpu_gtt_mgr_init() in amdgpu_ttm_init(). After this step GTT
         BOs can be allocated, but GART mappings are still ignored.
      2. Calling amdgpu_gart_table_vram_alloc() from the GMC code. This allocates
         the GART backing store.
      3. Initializing the hardware, and programming the backing store into VMID0
         for all VMHUBs.
      4. Calling amdgpu_gtt_mgr_recover() to make sure the table is updated with
         the GTT allocations done before it was allocated.
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAaron Liu <aaron.liu@amd.com>
      Acked-by: NHuang Rui <ray.huang@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      29ba7b16
    • A
      drm/amdgpu: add utcl2_harvest to gc 10.3.1 · 6d53b115
      Aaron Liu 提交于
      Confirmed with hardware team, there is harvesting for gc 10.3.1.
      Signed-off-by: NAaron Liu <aaron.liu@amd.com>
      Reviewed-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      6d53b115
    • T
      drm/amdgpu: fix list add issue in vram reserve · 4e781873
      Tao Zhou 提交于
      The parameter order in the list_add_tail is incorrect, it causes the
      reuse of ras reserved page.
      Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      4e781873
    • Y
      Revert "drm/amdgpu: Add judgement to avoid infinite loop" · a50b0482
      yipechai 提交于
      The commit d5e8ff5f ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
      had fixed this defect.
      
      Revert workaround
      commit a2170b4a ("drm/amdgpu: Add judgement to avoid infinite loop").
      Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
      Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      a50b0482
    • Y
      drm/amdgpu: Fixed the defect of soft lock caused by infinite loop · d5e8ff5f
      yipechai 提交于
      1. The infinite loop case only occurs on multiple cards support
         ras functions.
      2. The explanation of root cause refer to commit 76641cbbf196
         ("drm/amdgpu: Add judgement to avoid infinite loop").
      3. Create new node to manage each unique ras instance to guarantee
         each device .ras_list is completely independent.
      4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
         interface for each ras block").
      5. The soft locked logs are as follows:
      [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
      [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
      [  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
      [  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
      [  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
      [  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
      [  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
      [  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
      [  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
      [  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
      [  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
      [  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
      [  262.166267] Call Trace:
      [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
      [  262.166529]  ? psi_task_switch+0xd2/0x250
      [  262.166537]  ? __switch_to+0x11d/0x460
      [  262.166542]  ? __switch_to_asm+0x36/0x70
      [  262.166549]  process_one_work+0x220/0x3c0
      [  262.166556]  worker_thread+0x4d/0x3f0
      [  262.166560]  ? process_one_work+0x3c0/0x3c0
      [  262.166563]  kthread+0x12b/0x150
      [  262.166568]  ? set_kthread_struct+0x40/0x40
      [  262.166571]  ret_from_fork+0x22/0x30
      Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
      Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d5e8ff5f
    • L
      drm/amdgpu: Set FRU bus for Aldebaran and Vega 20 · 00d6936d
      Luben Tuikov 提交于
      The FRU and RAS EEPROMs share the same I2C bus on Aldebaran and Vega 20
      ASICs. Set the FRU bus "pointer" to this single bus, as access to the FRU
      is sought through that bus "pointer" and not through the RAS bus "pointer".
      
      Cc: Roy Sun <Roy.Sun@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Fixes: 2f60dd50 ("drm/amd: Expose the FRU SMU I2C bus")
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      00d6936d
    • R
      drm/amdgpu: Fix recursive locking warning · 447c7997
      Rajneesh Bhardwaj 提交于
      Noticed the below warning while running a pytorch workload on vega10
      GPUs. Change to trylock to avoid conflicts with already held reservation
      locks.
      
      [  +0.000003] WARNING: possible recursive locking detected
      [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
      [  +0.000004] --------------------------------------------
      [  +0.000002] python/4822 is trying to acquire lock:
      [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000203]
                    but task is already holding lock:
      [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
      at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000017]
                    other info that might help us debug this:
      [  +0.000002]  Possible unsafe locking scenario:
      
      [  +0.000003]        CPU0
      [  +0.000002]        ----
      [  +0.000002]   lock(reservation_ww_class_mutex);
      [  +0.000004]   lock(reservation_ww_class_mutex);
      [  +0.000003]
                     *** DEADLOCK ***
      
      [  +0.000002]  May be due to missing lock nesting notation
      
      [  +0.000003] 7 locks held by python/4822:
      [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
      kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
      [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
      [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
      [  +0.000236]  #3: ffffb2b35606fd28
      (reservation_ww_class_acquire){+.+.}-{0:0}, at:
      amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
      [  +0.000235]  #4: ffff932cbb7181f8
      (reservation_ww_class_mutex){+.+.}-{3:3}, at:
      ttm_eu_reserve_buffers+0x270/0x470 [ttm]
      [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
      drm_dev_enter+0x5/0xa0 [drm]
      [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
      at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
      [  +0.000195]
                    stack backtrace:
      [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
      5.13.0-kfd-rajneesh #1030
      [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
      08/29/2018
      [  +0.000003] Call Trace:
      [  +0.000003]  dump_stack+0x6d/0x89
      [  +0.000010]  __lock_acquire+0xb93/0x1a90
      [  +0.000009]  lock_acquire+0x25d/0x2d0
      [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
      [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  ? lock_release+0x13f/0x270
      [  +0.000005]  ? lock_is_held_type+0xa2/0x110
      [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
      [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
      [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
      [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
      [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
      [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
      [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
      [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
      [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
      [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
      [amdgpu]
      [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
      [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
      [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
      [  +0.000216]  ? lock_release+0x13f/0x270
      [  +0.000006]  ? __fget_files+0x107/0x1e0
      [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
      [  +0.000007]  do_syscall_64+0x36/0x70
      [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  +0.000007] RIP: 0033:0x7fbff90a7317
      [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
      48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
      05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
      [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
      0000000000000010
      [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
      00007fbff90a7317
      [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
      0000000000000004
      [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
      00007fbcc402d880
      [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
      00000000c0184b18
      [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
      00007fbcc402d820
      
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      447c7997
    • L
      drm/amdgpu: Prevent random memory access in FRU code · 00b14ce0
      Luben Tuikov 提交于
      Prevent random memory access in the FRU EEPROM code by passing the size of
      the destination buffer to the reading routine, and reading no more than the
      size of the buffer.
      
      Cc: Kent Russell <kent.russell@amd.com>
      Cc: Alex Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
      Reviewed-by: NKent Russell <kent.russell@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      00b14ce0