1. 21 6月, 2019 1 次提交
  2. 31 5月, 2019 1 次提交
  3. 29 9月, 2018 1 次提交
  4. 23 9月, 2018 1 次提交
  5. 15 9月, 2018 1 次提交
  6. 11 9月, 2018 2 次提交
    • Q
      EDAC, sb_edac: Fix reporting for patrol scrubber errors · 8489b17c
      Qiuxu Zhuo 提交于
      sb_edac sometimes reports the wrong DIMM for a memory error found by
      the patrol scrubber. That is because the hardware provides only a 4KB
      page-aligned address for the error case.
      
      This means that the EDAC driver will point at the DIMM matching offset
      0x0 in the 4KB page, but because of interleaving across channels and
      ranks, the actual DIMM involved may be different if the error is on some
      other cache line within the page.
      
      Therefore, reconstruct the socket/iMC/channel information from the "mce"
      structure passed to the EDAC driver. The DIMM cannot be determined, so
      pass "dimm=-1" to the EDAC core. It will report that all the DIMMs on
      that channel may be affected.
      Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20180907230828.13901-3-tony.luck@intel.com
      [ Improve comments on the functions to convert bank number
        to memory controller number. Minor cleanup to commit message. ]
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      [ Massage commit message more. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      8489b17c
    • Q
      EDAC, sb_edac: Return early on ADDRV bit and address type test · dcc960b2
      Qiuxu Zhuo 提交于
      Users of the mce_register_decode_chain() are called for every logged
      error. EDAC drivers should check:
      
      1) Is this a memory error? [bit 7 in status register]
      2) Is there a valid address? [bit 58 in status register]
      3) Is the address a system address? [bitfield 8:6 in misc register]
      
      The sb_edac driver performed test "1" twice. Waited far too long to
      perform check "2". Didn't do check "3" at all.
      
      Fix it by moving the test for valid address from
      sbridge_mce_output_error() into sbridge_mce_check_error() and add a test
      for the type immediately after. Delete the redundant check for the type
      of the error from sbridge_mce_output_error().
      Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20180907230828.13901-2-tony.luck@intel.com
      [ Re-word commit message. ]
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      dcc960b2
  7. 03 9月, 2018 1 次提交
  8. 25 7月, 2018 1 次提交
  9. 17 3月, 2018 1 次提交
  10. 23 2月, 2018 1 次提交
  11. 19 10月, 2017 1 次提交
  12. 11 10月, 2017 1 次提交
  13. 27 9月, 2017 1 次提交
    • Q
      EDAC, sb_edac: Don't create a second memory controller if HA1 is not present · 15cc3ae0
      Qiuxu Zhuo 提交于
      Yi Zhang reported the following failure on a 2-socket Haswell (E5-2603v3)
      server (DELL PowerEdge 730xd):
      
        EDAC sbridge: Some needed devices are missing
        EDAC MC: Removed device 0 for sb_edac.c Haswell SrcID#0_Ha#0: DEV 0000:7f:12.0
        EDAC MC: Removed device 1 for sb_edac.c Haswell SrcID#1_Ha#0: DEV 0000:ff:12.0
        EDAC sbridge: Couldn't find mci handler
        EDAC sbridge: Couldn't find mci handler
        EDAC sbridge: Failed to register device with error -19.
      
      The refactored sb_edac driver creates the IMC1 (the 2nd memory
      controller) if any IMC1 device is present. In this case only
      HA1_TA of IMC1 was present, but the driver expected to find
      HA1/HA1_TM/HA1_TAD[0-3] devices too, leading to the above failure.
      
      The document [1] says the 'E5-2603 v3' CPU has 4 memory channels max. Yi
      Zhang inserted one DIMM per channel for each CPU, and did random error
      address injection test with this patch:
      
            4024  addresses fell in TOLM hole area
           12715  addresses fell in CPU_SrcID#0_Ha#0_Chan#0_DIMM#0
           12774  addresses fell in CPU_SrcID#0_Ha#0_Chan#1_DIMM#0
           12798  addresses fell in CPU_SrcID#0_Ha#0_Chan#2_DIMM#0
           12913  addresses fell in CPU_SrcID#0_Ha#0_Chan#3_DIMM#0
           12674  addresses fell in CPU_SrcID#1_Ha#0_Chan#0_DIMM#0
           12686  addresses fell in CPU_SrcID#1_Ha#0_Chan#1_DIMM#0
           12882  addresses fell in CPU_SrcID#1_Ha#0_Chan#2_DIMM#0
           12934  addresses fell in CPU_SrcID#1_Ha#0_Chan#3_DIMM#0
          106400  addresses were injected totally.
      
      The test result shows that all the 4 channels belong to IMC0 per CPU, so
      the server really only has one IMC per CPU.
      
      In the 1st page of chapter 2 in datasheet [2], it also says 'E5-2600 v3'
      implements either one or two IMCs. For CPUs with one IMC, IMC1 is not
      used and should be ignored.
      
      Thus, do not create a second memory controller if the key HA1 is absent.
      
      [1] http://ark.intel.com/products/83349/Intel-Xeon-Processor-E5-2603-v3-15M-Cache-1_60-GHz
      [2] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xeon-e5-v3-datasheet-vol-2.pdfReported-and-tested-by: NYi Zhang <yizhan@redhat.com>
      Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Fixes: e2f747b1 ("EDAC, sb_edac: Assign EDAC memory controller per h/w controller")
      Link: http://lkml.kernel.org/r/20170913104214.7325-1-qiuxu.zhuo@intel.com
      [ Massage commit message. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      15cc3ae0
  14. 25 9月, 2017 1 次提交
  15. 21 9月, 2017 1 次提交
  16. 02 8月, 2017 1 次提交
    • Q
      EDAC, sb_edac: Classify memory mirroring modes · 039d7af6
      Qiuxu Zhuo 提交于
      Basically, there are full memory mirroring and address range partial
      memory mirroring (supported by Haswell EX and Broadwell EX) modes.
      
      a) In full memory mirroring, the memory behind each memory controller
         is mirrored, i.e. the memory is split into two identical mirrors
         (primary and secondary), half of the memory is reserved for redundancy.
      
      b) In address range partial memory mirroring, the memory size (range)
         of primary and secondary behind each memory controller can be user
         defined by the TAD0 register. The rest of memory ranges defined by
         TAD1/TAD2/... in that memory controller are non-mirrored.
      
      For more detail on memory mirroring, see the following link written by Tony Luck:
      
        https://01.org/lkp/blogs/tonyluck/2016/address-range-partial-memory-mirroring-linux
      
      Currently the sb_edac driver only supports address decoding in full
      memory mirroring and non-mirroring modes. In address range partial
      memory mirroring mode, it may fail to decode an address that falls in a
      non-mirroring area (the following was one of this kind of failed logs).
      
        mce: Uncorrected hardware memory error in user-access at 566d53a400
        Memory failure: 0x566d53a8: Killing einj_mem_uc:4647 due to hardware memory corruption
        Memory failure: 0x566d53a8: recovery action for dirty LRU page: Recovered
        mce: [Hardware Error]: Machine check events logged
        EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
        EDAC sbridge MC1: CPU 48: Machine Check Event: 0 Bank 7: ec00000000010090
        EDAC sbridge MC1: TSC 4b914aa5a99dab
        EDAC sbridge MC1: ADDR 566d53a400
        EDAC sbridge MC1: MISC 1443a0c86
        EDAC sbridge MC1: PROCESSOR 0:406f1 TIME 1499712764 SOCKET 2 APIC 80
        EDAC MC1: 0 UE Can't discover the memory rank for ch addr 0x7fb54e900 on any memory ( page:0x0 offset:0x0 grain:32)
        mce: [Hardware Error]: Machine check events logged
      
      Therefore, classify memory mirroring modes and make the address decoding
      in address range partial memory mode correct.
      Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170730180651.30060-1-qiuxu.zhuo@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      039d7af6
  17. 17 7月, 2017 1 次提交
  18. 14 6月, 2017 1 次提交
    • Q
      EDAC, sb_edac: Avoid creating SOCK memory controller · 133e4455
      Qiuxu Zhuo 提交于
      Xiaolong Ye reported the following failure on Broadwell D server:
      
        EDAC sbridge: Some needed devices are missing
        EDAC MC: Removed device 0 for sbridge_edac.c Broadwell SrcID#0_Ha#0: DEV 0000:ff:12.0
        EDAC sbridge: Couldn't find mci handler
        EDAC sbridge: Failed to register device with error -19.
      
      Broadwell D (only IMC0 per socket) and Broadwell X (IMC0 and IMC1 per
      socket) use the same PCI device IDs for IMC0 per socket, then they
      share pci_dev_descr_broadwell_table (n_imcs_per_sock=2). In this case,
      Broadwell D wrongly creates the nonexistent SOCK EDAC memory controller
      and reports above error messages, since it has no IMC1 per socket.
      
      Avoid creating the nonexistent SOCK memory controller.
      Reported-and-tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Signed-off-by: NQiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170608113351.25323-1-qiuxu.zhuo@intel.com
      [ Massage. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      133e4455
  19. 25 5月, 2017 8 次提交
  20. 10 4月, 2017 1 次提交
  21. 24 1月, 2017 1 次提交
  22. 23 1月, 2017 1 次提交
  23. 15 12月, 2016 1 次提交
  24. 19 10月, 2016 2 次提交
  25. 13 9月, 2016 1 次提交
  26. 08 8月, 2016 1 次提交
    • L
      EDAC, sb_edac: Fix channel reporting on Knights Landing · c5b48fa7
      Lukasz Odzioba 提交于
      On Intel Xeon Phi Knights Landing processor family the channels of the
      memory controller have untypical arrangement - MC0 is mapped to CH3,4,5
      and MC1 is mapped to CH0,1,2. This causes the EDAC driver to report the
      channel name incorrectly.
      
      We missed this change earlier, so the code already contains similar
      comment, but the translation function is incorrect.
      
      Without this patch:
        errors in DIMM_A and DIMM_D were reported in DIMM_D
        errors in DIMM_B and DIMM_E were reported in DIMM_E
        errors in DIMM_C and DIMM_F were reported in DIMM_F
      
      Correct this.
      
      Hubert Chrzaniuk:
       - rebased to 4.8
       - comments and code cleanup
      
      Fixes: d0cdf900 ("sb_edac: Add Knights Landing (Xeon Phi gen 2) support")
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: lukasz.anaczkowski@intel.com
      Cc: lukasz.odzioba@intel.com
      Cc: mchehab@kernel.org
      Cc: <stable@vger.kernel.org> # v4.5..
      Link: http://lkml.kernel.org/r/1469231089-22837-1-git-send-email-lukasz.odzioba@intel.comSigned-off-by: NLukasz Odzioba <lukasz.odzioba@intel.com>
      [ Boris: Simplify a bit by removing char mc. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      c5b48fa7
  27. 16 7月, 2016 1 次提交
  28. 03 6月, 2016 2 次提交
  29. 03 5月, 2016 1 次提交
    • T
      EDAC, sb_edac: Use cpu family/model in driver detection · 2c1ea4c7
      Tony Luck 提交于
      Instead of picking a random PCI ID from the dozen or so we need to
      access, just use x86_match_cpu() to pick based on CPU model number. The
      choosing of PCI devices has been problematic in the past, see
      
        11249e73 ("sb_edac: Fix detection on SNB machines")
      
      which fixed problems introduced by
      
        d0585cd8 ("sb_edac: Claim a different PCI device").
      
      This is especially ugly if future hardware might not even have
      EDAC-relevant registers in PCI config space and we would still be
      required to choose some "random" PCI devices to scan for just so our
      driver loads.
      
      Is this cleaner/clearer? It deletes much more code than it adds. Only
      tested on Broadwell. The driver loads/unloads and loads again. Still
      decodes errors too.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Suggested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      2c1ea4c7
  30. 29 4月, 2016 1 次提交