1. 24 10月, 2013 2 次提交
  2. 26 2月, 2013 8 次提交
    • W
      ghes_edac: fix to use list_for_each_entry_safe() when delete list items · 5dae92a7
      Wei Yongjun 提交于
      Since we will remove items off the list using list_del() we need
      to use a safe version of the list_for_each_entry() macro aptly named
      list_for_each_entry_safe().
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      5dae92a7
    • M
      ghes_edac: Fix RAS tracing · 8ae8f50a
      Mauro Carvalho Chehab 提交于
      With the current version of CPER, there's no way to associate an
      error with the memory error. So, the error location in EDAC
      layers is unused.
      
      As CPER has its own idea about memory architectural layers, just
      output whatever is there inside the driver's detail at the RAS
      tracepoint.
      
      The EDAC location keeps untouched, in the case that, in some future,
      we could actually map the error into the dimm labels.
      
      Now, the error message:
      
      [   72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
      [   72.396627] {1}[Hardware Error]: APEI generic hardware error status
      [   72.396628] {1}[Hardware Error]: severity: 2, corrected
      [   72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected
      [   72.396632] {1}[Hardware Error]: flags: 0x01
      [   72.396634] {1}[Hardware Error]: primary
      [   72.396635] {1}[Hardware Error]: section_type: memory error
      [   72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400
      [   72.396638] {1}[Hardware Error]: node: 3
      [   72.396639] {1}[Hardware Error]: card: 0
      [   72.396640] {1}[Hardware Error]: module: 0
      [   72.396641] {1}[Hardware Error]: device: 0
      [   72.396643] {1}[Hardware Error]: error_type: 18, unknown
      [   72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
      
      Is properly represented on the trace event:
      
           kworker/0:2-584   [000] ....    72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory)
      
      Tested on a 4 sockets E5-4650 Sandy Bridge machine.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      8ae8f50a
    • M
      ghes_edac: Make it compliant with UEFI spec 2.3.1 · 689c9cd8
      Mauro Carvalho Chehab 提交于
      The UEFI spec defines the memory error types ans the bits that
      validate each field on the memory error record, at
      Appendix N om items N.2.5 (Memory Error Section) and
      N.2.11 (Error Status). Make the error description compliant with
      it, only showing the valid fields.
      
      The EDAC error log is now properly reporting the error:
      
      [  281.556854] mce: [Hardware Error]: Machine check events logged
      [  281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
      [  281.557044] {2}[Hardware Error]: APEI generic hardware error status
      [  281.557046] {2}[Hardware Error]: severity: 2, corrected
      [  281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected
      [  281.557050] {2}[Hardware Error]: flags: 0x01
      [  281.557052] {2}[Hardware Error]: primary
      [  281.557053] {2}[Hardware Error]: section_type: memory error
      [  281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400
      [  281.557056] {2}[Hardware Error]: node: 3
      [  281.557057] {2}[Hardware Error]: card: 0
      [  281.557058] {2}[Hardware Error]: module: 1
      [  281.557059] {2}[Hardware Error]: device: 0
      [  281.557061] {2}[Hardware Error]: error_type: 18, unknown
      [  281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9
      [  281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory)
      
      Tested on a 4 CPUs E5-4650 Sandy Bridge machine.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      689c9cd8
    • M
      ghes_edac: Improve driver's printk messages · d2a68566
      Mauro Carvalho Chehab 提交于
      Provide a better infrastructure for printk's inside the driver:
      	- use edac_dbg() for debug messages;
      	- standardize the usage of pr_info();
      	- provide warning about the risk of relying on this
      	  driver.
      
      While here, changes the size of a fake memory to 1 page. This is
      as good or as bad as 1000 pages, but it is easier for userspace to
      detect, as I don't expect that any machine implementing GHES would
      provide just 1 page available ;)
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      
      Conflicts:
      	drivers/edac/ghes_edac.c
      d2a68566
    • M
      ghes_edac: Don't credit the same memory dimm twice · 5ee726db
      Mauro Carvalho Chehab 提交于
      On my tests on a 4xE5-4650 CPU's system, the GHES
      EDAC driver is called twice. As the SMBIOS DMI enumeration
      call will seek for the entire DIMM sockets in the system, on
      this machine, equipped with 128 GB of RAM, the memory is
      displayed twice:
      
                +-----------------------+
                |    mc0    |    mc1    |
      ----------+-----------------------+
      memory45: |  8192 MB  |  8192 MB  |
      memory44: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory43: |     0 MB  |     0 MB  |
      memory42: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory41: |     0 MB  |     0 MB  |
      memory40: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory39: |  8192 MB  |  8192 MB  |
      memory38: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory37: |     0 MB  |     0 MB  |
      memory36: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory35: |     0 MB  |     0 MB  |
      memory34: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory33: |  8192 MB  |  8192 MB  |
      memory32: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory31: |     0 MB  |     0 MB  |
      memory30: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory29: |     0 MB  |     0 MB  |
      memory28: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory27: |  8192 MB  |  8192 MB  |
      memory26: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory25: |     0 MB  |     0 MB  |
      memory24: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory23: |     0 MB  |     0 MB  |
      memory22: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory21: |  8192 MB  |  8192 MB  |
      memory20: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory19: |     0 MB  |     0 MB  |
      memory18: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory17: |     0 MB  |     0 MB  |
      memory16: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory15: |  8192 MB  |  8192 MB  |
      memory14: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory13: |     0 MB  |     0 MB  |
      memory12: |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory11: |     0 MB  |     0 MB  |
      memory10: |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory9:  |  8192 MB  |  8192 MB  |
      memory8:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory7:  |     0 MB  |     0 MB  |
      memory6:  |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      memory5:  |     0 MB  |     0 MB  |
      memory4:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory3:  |  8192 MB  |  8192 MB  |
      memory2:  |     0 MB  |     0 MB  |
      ----------+-----------------------+
      memory1:  |     0 MB  |     0 MB  |
      memory0:  |  8192 MB  |  8192 MB  |
      ----------+-----------------------+
      
      Total sum of 256 GB.
      
      As there's no reliable way to credit DIMMS to the right memory
      controller, just put everything on memory controller 0 (with should
      always exist).
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      5ee726db
    • M
      ghes_edac: do a better job of filling EDAC DIMM info · 32fa1f53
      Mauro Carvalho Chehab 提交于
      Instead of just faking a random value for the DIMM data, get
      the information that it is available via DMI table.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      32fa1f53
    • M
      ghes_edac: add support for reporting errors via EDAC · f04c62a7
      Mauro Carvalho Chehab 提交于
      Now that the EDAC core is capable of just forward the errors via
      the userspace API, add a report mechanism for the GHES errors.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      f04c62a7
    • M
      ghes_edac: Register at EDAC core the BIOS report · 77c5f5d2
      Mauro Carvalho Chehab 提交于
      Register GHES at EDAC MC core, in order to avoid other
      drivers to also handle errors and mangle with error data.
      
      The edac core will warrant that just one driver will be used,
      so the first one to register (BIOS first) will be the one that
      will be reporting the hardware errors.
      
      For now, the EDAC driver does nothing but to register at the
      EDAC core, preventing the hardware-driven mechanism to
      interfere with GHES.
      Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
      77c5f5d2