1. 23 2月, 2015 5 次提交
  2. 17 2月, 2015 1 次提交
    • D
      EDAC, amd64_edac: Prevent OOPS with >16 memory controllers · 0c510cc8
      Daniel J Blueman 提交于
      When DRAM errors occur on memory controllers after EDAC_MAX_MCS (16),
      the kernel fatally dereferences unallocated structures, see splat below;
      this occurs on at least NumaConnect systems.
      
      Fix by checking if a memory controller info structure was found.
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000320
      IP: [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      PGD 2f8b5a3067 PUD 2f8b5a2067 PMD 0
      Oops: 0000 [#2] SMP
      Modules linked in:
      CPU: 224 PID: 11930 Comm: stream_c.exe.gn Tainted: G   D    3.19.0 #1
      Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.5b    01/28/2015
      task: ffff8807dbfb8c00 ti: ffff8807dd16c000 task.ti: ffff8807dd16c000
      RIP: 0010:[<ffffffff819f714f>] [<ffffffff819f714f>] decode_bus_error+0x2f/0x2b0
      RSP: 0000:ffff8907dfc03c48 EFLAGS: 00010297
      RAX: 0000000000000001 RBX: 9c67400010080a13 RCX: 0000000000001dc6
      RDX: 000000001dc61dc6 RSI: ffff8907dfc03df0 RDI: 000000000000001c
      RBP: ffff8907dfc03ce8 R08: 0000000000000000 R09: 0000000000000022
      R10: ffff891fffa30380 R11: 00000000001cfc90 R12: 0000000000000008
      R13: 0000000000000000 R14: 000000000000001c R15: 00009c6740001000
      FS: 00007fa97ee18700(0000) GS:ffff8907dfc00000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000320 CR3: 0000003f889b8000 CR4: 00000000000407e0
      Stack:
       0000000000000000 ffff8907dfc03df0 0000000000000008 9c67400010080a13
       000000000000001c 00009c6740001000 ffff8907dfc03c88 ffffffff810e4f9a
       ffff8907dfc03ce8 ffffffff81b375b9 0000000000000000 0000000000000010
      Call Trace:
       <IRQ>
       ? vprintk_default
       ? printk
       amd_decode_mce
       notifier_call_chain
       atomic_notifier_call_chain
       mce_log
       machine_check_poll
       mce_timer_fn
       ? mce_cpu_restart
       call_timer_fn.isra.29
       run_timer_softirq
       __do_softirq
       irq_exit
       smp_apic_timer_interrupt
       apic_timer_interrupt
       <EOI>
       ? down_read_trylock
       __do_page_fault
       ? __schedule
       do_page_fault
       page_fault
      Signed-off-by: NDaniel J Blueman <daniel@numascale.com>
      Link: http://lkml.kernel.org/r/1424144078-24589-1-git-send-email-daniel@numascale.com
      Cc: stable@vger.kernel.org
      [ Boris: massage commit message ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      0c510cc8
  3. 09 2月, 2015 1 次提交
    • B
      sb_edac: Fix detection on SNB machines · 11249e73
      Borislav Petkov 提交于
      d0585cd8 ("sb_edac: Claim a different PCI device") changed the
      probing of sb_edac to look for PCI device 0x3ca0:
      
      3f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)
      00: 86 80 a0 3c 00 00 00 00 07 00 80 08 00 00 80 00
      ...
      
      but we're matching for 0x3ca8, i.e. PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_TA
      in sbridge_probe() therefore the probing fails.
      
      Changing it to probe for 0x3ca0 (PCI_DEVICE_ID_INTEL_SBRIDGE_IMC_HA0),
      .i.e., the 14.0 device, fixes the issue and driver loads successfully
      again:
      
      [ 2449.013120] EDAC DEBUG: sbridge_init:
      [ 2449.017029] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
      [ 2449.022368] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca0
      [ 2449.028498] EDAC sbridge: Seeking for: PCI ID 8086:3ca0
      [ 2449.033768] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
      [ 2449.039028] EDAC DEBUG: sbridge_get_onedevice: Detected 8086:3ca8
      [ 2449.045155] EDAC sbridge: Seeking for: PCI ID 8086:3ca8
      ...
      
      Add a debug printk while at it to be able to catch the failure in the
      future and dump driver version on successful load.
      
      Fixes: d0585cd8 ("sb_edac: Claim a different PCI device")
      Cc: stable@vger.kernel.org # 3.18
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      11249e73
  4. 31 1月, 2015 1 次提交
  5. 30 1月, 2015 2 次提交
  6. 12 1月, 2015 1 次提交
  7. 07 1月, 2015 1 次提交
  8. 02 1月, 2015 1 次提交
  9. 21 12月, 2014 1 次提交
  10. 03 12月, 2014 2 次提交
  11. 02 12月, 2014 3 次提交
    • T
      sb_edac: Fix discovery of top-of-low-memory for Haswell · f7cf2a22
      Tony Luck 提交于
      Haswell moved the TOLM/TOHM registers to a different device and offset.
      The sb_edac driver accounted for the change of device, but not for the
      new offset.  There was also a typo in the constant to fill in the low
      26 bits (was 0x1ffffff, should be 0x3ffffff).
      
      This resulted in a bogus value for the top of low memory:
      
        EDAC DEBUG: get_memory_layout: TOLM: 0.032 GB (0x0000000001ffffff)
      
      which would result in EDAC refusing to translate addresses for
      errors above the bogus value and below 4GB:
      
         sbridge MC3: HANDLING MCE MEMORY ERROR
         sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
         sbridge MC3: TSC 0
         sbridge MC3: ADDR 2000000
         sbridge MC3: MISC 523eac86
         sbridge MC3: PROCESSOR 0:306f3 TIME 1414600951 SOCKET 0 APIC 0
         MC3: 1 CE Error at TOLM area, on addr 0x02000000 on any memory ( page:0x0 offset:0x0 grain:32 syndrome:0x0)
      
      With the fix we see the correct TOLM value:
      
         DEBUG: get_memory_layout: TOLM: 2.048 GB (0x000000007fffffff)
      
      and we decode address 2000000 correctly:
      
         sbridge MC3: HANDLING MCE MEMORY ERROR
         sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090
         sbridge MC3: TSC 0
         sbridge MC3: ADDR 2000000
         sbridge MC3: MISC 523e1086
         sbridge MC3: PROCESSOR 0:306f3 TIME 1414601319 SOCKET 0 APIC 0
         DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 0
         DEBUG: get_memory_error_data: TAD#0: address 0x0000000002000000 < 0x000000007fffffff, socket interleave 1, channel interleave 4 (offset 0x00000000), index 0, base ch: 0, ch mask: 0x01
         DEBUG: get_memory_error_data: RIR#0, limit: 4.095 GB (0x00000000ffffffff), way: 1
         DEBUG: get_memory_error_data: RIR#0: channel address 0x00200000 < 0xffffffff, RIR interleave 0, index 0
         DEBUG: sbridge_mce_output_error:  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0
         MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2000 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0090 socket:0 channel_mask:1 rank:0)
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      f7cf2a22
    • J
    • J
      sb_edac: Fix off-by-one error in number of channels · 50043e25
      Jim Snow 提交于
      This prevented edac sysfs code from properly handling 6 channels
      per memory controller.
      Signed-off-by: NJim Snow <jim.snow@intel.com>
      Signed-off-by: NLukasz Anaczkowski <lukasz.anaczkowski@intel.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      50043e25
  12. 25 11月, 2014 5 次提交
  13. 20 11月, 2014 1 次提交
  14. 19 11月, 2014 1 次提交
  15. 12 11月, 2014 2 次提交
  16. 05 11月, 2014 2 次提交
  17. 30 10月, 2014 1 次提交
    • A
      amd64_edac: Add F15h M60h support · a597d2a5
      Aravind Gopalakrishnan 提交于
      This patch adds support for ECC error decoding for F15h M60h processor.
      Aside from the usual changes, the patch adds support for some new features
      in the processor:
       - DDR4(unbuffered, registered); LRDIMM DDR3 support
         - relevant debug messages have been modified/added to report these
           memory types
       - new dbam_to_cs mappers
         - if (F15h M60h && LRDIMM); we need a 'multiplier' value to find
           cs_size. This multiplier value is obtained from the per-dimm
           DCSM register. So, change the interface to accept a 'cs_mask_nr'
           value to facilitate this calculation
       - switch-casing determine_memory_type()
         - done to cleanse the function of too many if-else statements
           and improve readability
         - This is now called early in read_mc_regs() to cache dram_type
      
      Misc cleanup:
       - amd64_pci_table[] is condensed by using PCI_VDEVICE macro.
      
      Testing details:
      Tested the patch by injecting 'ECC' type errors using mce_amd_inj
      and error decoding works fine.
      Signed-off-by: NAravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Link: http://lkml.kernel.org/r/1414617483-4941-1-git-send-email-Aravind.Gopalakrishnan@amd.com
      [ Boris: determine_memory_type() cleanups ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      a597d2a5
  18. 23 10月, 2014 4 次提交
  19. 20 10月, 2014 4 次提交
  20. 09 10月, 2014 1 次提交
    • A
      sb_edac: Claim a different PCI device · d0585cd8
      Andy Lutomirski 提交于
      sb_edac controls a large number of different PCI functions.  Rather
      than registering as a normal PCI driver for all of them, it
      registers for just one so that it gets probed and, at probe time, it
      looks for all the others.
      
      Coincidentally, the device it registers for also contains the SMBUS
      registers, so the PCI core will refuse to probe both sb_edac and a
      future iMC SMBUS driver.  The drivers don't actually conflict, so
      just change sb_edac's device table to probe a different device.
      
      An alternative fix would be to merge the two drivers, but sb_edac
      will also refuse to load on non-ECC systems, whereas i2c_imc would
      still be useful without ECC.
      
      The only user-visible change should be that sb_edac appears to bind
      a different device.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Rui Wang <ruiv.wang@gmail.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@osg.samsung.com>
      d0585cd8