1. 08 12月, 2021 2 次提交
  2. 03 12月, 2021 1 次提交
    • S
      drm/amdgpu: skip query ecc info in gpu recovery · bab73f09
      Stanley.Yang 提交于
      this is a workaround due to get ecc info failed during gpu recovery
      
      [  700.236122] amdgpu 0000:09:00.0: amdgpu: Failed to export SMU ecc table!
      [  700.236128] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
      [  704.331171] amdgpu: qcm fence wait loop timeout expired
      [  704.331194] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
      [  704.332445] amdgpu 0000:09:00.0: amdgpu: GPU reset begin!
      [  704.332448] amdgpu 0000:09:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress
      [  704.332456] amdgpu: Pasid 0x8000 destroy queue 0 failed, ret -62
      [  710.360924] amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000013 SMN_C2PMSG_82:0x00000007
      [  710.360964] amdgpu 0000:09:00.0: amdgpu: Failed to disable smu features.
      [  710.361002] amdgpu 0000:09:00.0: amdgpu: Fail to disable dpm features!
      [  710.361014] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
      Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      bab73f09
  3. 02 12月, 2021 1 次提交
  4. 23 11月, 2021 2 次提交
  5. 14 10月, 2021 1 次提交
  6. 07 10月, 2021 1 次提交
  7. 05 10月, 2021 1 次提交
  8. 28 9月, 2021 2 次提交
  9. 24 9月, 2021 2 次提交
  10. 15 9月, 2021 1 次提交
  11. 02 9月, 2021 1 次提交
  12. 17 8月, 2021 2 次提交
  13. 13 7月, 2021 1 次提交
    • L
      drm/amdgpu: Return error if no RAS · 43a44c53
      Luben Tuikov 提交于
      In amdgpu_ras_query_error_count() return an error
      if the device doesn't support RAS. This prevents
      that function from having to always set the values
      of the integer pointers (if set), and thus
      prevents function side effects--always to have to
      set values of integers if integer pointers set,
      regardless of whether RAS is supported or
      not--with this change this side effect is
      mitigated.
      
      Also, if no pointers are set, don't count, since
      we've no way of reporting the counts.
      
      Also, give this function a kernel-doc.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Reported-by: NTom Rix <trix@redhat.com>
      Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      43a44c53
  14. 09 7月, 2021 1 次提交
    • L
      drm/amdgpu: Return error if no RAS · 4d9f771e
      Luben Tuikov 提交于
      In amdgpu_ras_query_error_count() return an error
      if the device doesn't support RAS. This prevents
      that function from having to always set the values
      of the integer pointers (if set), and thus
      prevents function side effects--always to have to
      set values of integers if integer pointers set,
      regardless of whether RAS is supported or
      not--with this change this side effect is
      mitigated.
      
      Also, if no pointers are set, don't count, since
      we've no way of reporting the counts.
      
      Also, give this function a kernel-doc.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Reported-by: NTom Rix <trix@redhat.com>
      Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      4d9f771e
  15. 01 7月, 2021 8 次提交
    • L
      drm/amdgpu: Fix koops when accessing RAS EEPROM · 1d9d2ca8
      Luben Tuikov 提交于
      Debugfs RAS EEPROM files are available when
      the ASIC supports RAS, and when the debugfs is
      enabled, an also when "ras_enable" module
      parameter is set to 0. However in this case,
      we get a kernel oops when accessing some of
      the "ras_..." controls in debugfs. The reason
      for this is that struct amdgpu_ras::adev is
      unset. This commit sets it, thus enabling access
      to those facilities. Note that this facilitates
      EEPROM access and not necessarily RAS features or
      functionality.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      1d9d2ca8
    • L
      drm/amdgpu: RAS EEPROM table is now in debugfs · c65b0805
      Luben Tuikov 提交于
      Add "ras_eeprom_size" file in debugfs, which
      reports the maximum size allocated to the RAS
      table in EEROM, as the number of bytes and the
      number of records it could store. For instance,
      
      $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size
      262144 bytes or 10921 records
      $_
      
      Add "ras_eeprom_table" file in debugfs, which
      dumps the RAS table stored EEPROM, in a formatted
      way. For instance,
      
      $cat ras_eeprom_table
       Signature    Version  FirstOffs       Size   Checksum
      0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA
      Index  Offset ErrType Bank/CU          TimeStamp      Offs/Addr MemChl MCUMCID    RetiredPage
          0 0x00014      ue    0x00 0x00000000607608DC 0x000000000000   0x00    0x00 0x000000000000
          1 0x0002C      ue    0x00 0x00000000607608DC 0x000000001000   0x00    0x00 0x000000000001
          2 0x00044      ue    0x00 0x00000000607608DC 0x000000002000   0x00    0x00 0x000000000002
          3 0x0005C      ue    0x00 0x00000000607608DC 0x000000003000   0x00    0x00 0x000000000003
          4 0x00074      ue    0x00 0x00000000607608DC 0x000000004000   0x00    0x00 0x000000000004
          5 0x0008C      ue    0x00 0x00000000607608DC 0x000000005000   0x00    0x00 0x000000000005
          6 0x000A4      ue    0x00 0x00000000607608DC 0x000000006000   0x00    0x00 0x000000000006
          7 0x000BC      ue    0x00 0x00000000607608DC 0x000000007000   0x00    0x00 0x000000000007
          8 0x000D4      ue    0x00 0x00000000607608DD 0x000000008000   0x00    0x00 0x000000000008
      $_
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Cc: Xinhui Pan <xinhui.pan@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c65b0805
    • L
      drm/amdgpu: Optimize EEPROM RAS table I/O · 63d4c081
      Luben Tuikov 提交于
      Split functionality between read and write, which
      simplifies the code and exposes areas of
      optimization and more or less complexity, and take
      advantage of that.
      
      Read and write the table in one go; use a separate
      stage to decode or encode the data, as opposed to
      on the fly, which keeps the I2C bus busy. Use a
      single read/write to read/write the table or at
      most two if the number of records we're
      reading/writing wraps around.
      
      Check the check-sum of a table in EEPROM on init.
      
      Update the checksum at the same time as when
      updating the table header signature, when the
      threshold was increased on boot.
      
      Take advantage of arithmetic modulo 256, that is,
      use a byte!, to greatly simplify checksum
      arithmetic.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      63d4c081
    • L
      drm/amdgpu: Some renames · 0686627b
      Luben Tuikov 提交于
      Qualify with "ras_". Use kernel's own--don't
      redefine your own.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      0686627b
    • L
      drm/amdgpu: Use explicit cardinality for clarity · e4e6a589
      Luben Tuikov 提交于
      RAS_MAX_RECORD_NUM may mean the maximum record
      number, as in the maximum house number on your
      street, or it may mean the maximum number of
      records, as in the count of records, which is also
      a number. To make this distinction whether the
      number is ordinal (index) or cardinal (count),
      rename this macro to RAS_MAX_RECORD_COUNT.
      
      This makes it easy to understand what it refers
      to, especially when we compute quantities such as,
      how many records do we have left in the table,
      especially when there are so many other numbers,
      quantities and numerical macros around.
      
      Also rename the long,
      amdgpu_ras_eeprom_get_record_max_length() to the
      more succinct and clear,
      amdgpu_ras_eeprom_max_record_count().
      
      When computing the threshold, which also deals
      with counts, i.e. "how many", use cardinal
      "max_eeprom_records_count", than the quantitative
      "max_eeprom_records_len".
      
      Simplify the logic here and there, as well.
      
      Cc: Guchun Chen <guchun.chen@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      e4e6a589
    • L
      drm/amdgpu: Return result fix in RAS · cf696091
      Luben Tuikov 提交于
      The low level EEPROM write method, doesn't return
      1, but the number of bytes written. Thus do not
      compare to 1, instead, compare to greater than 0
      for success.
      
      Other cleanup: if the lower layers returned
      -errno, then return that, as opposed to
      overwriting the error code with one-fits-all
      -EINVAL. For instance, some return -EAGAIN.
      
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Lijo Lazar <Lijo.Lazar@amd.com>
      Cc: Stanley Yang <Stanley.Yang@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      cf696091
    • L
      drm/amdgpu: RAS xfer to read/write · 1fab841f
      Luben Tuikov 提交于
      Wrap amdgpu_ras_eeprom_xfer(..., bool write),
      into amdgpu_ras_eeprom_read() and
      amdgpu_ras_eeprom_write(), as that makes reading
      and understanding the code clearer.
      
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Lijo Lazar <Lijo.Lazar@amd.com>
      Cc: Stanley Yang <Stanley.Yang@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      1fab841f
    • L
      drm/amdgpu: Rename misspelled function · a4399657
      Luben Tuikov 提交于
      Instead of fixing the spelling in
        amdgpu_ras_eeprom_process_recods(),
      rename it to,
        amdgpu_ras_eeprom_xfer(),
      to look similar to other I2C and protocol
      transfer (read/write) functions.
      
      Also to keep the column span to within reason by
      using a shorter name.
      
      Change the "num" function parameter from "int" to
      "const u32" since it is the number of items
      (records) to xfer, i.e. their count, which cannot
      be a negative number.
      
      Also swap the order of parameters, keeping the
      pointer to records and their number next to each
      other, while the direction now becomes the last
      parameter.
      
      Cc: Jean Delvare <jdelvare@suse.de>
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: Lijo Lazar <Lijo.Lazar@amd.com>
      Cc: Stanley Yang <Stanley.Yang@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      a4399657
  16. 19 6月, 2021 2 次提交
  17. 12 6月, 2021 1 次提交
  18. 28 5月, 2021 2 次提交
  19. 20 5月, 2021 2 次提交
  20. 11 5月, 2021 6 次提交