1. 22 11月, 2016 1 次提交
    • Y
      x86/mce/AMD: Add system physical address translation for AMD Fam17h · f5382de9
      Yazen Ghannam 提交于
      The Unified Memory Controllers (UMCs) on Fam17h log a normalized address
      in their MCA_ADDR registers. We need to convert that normalized address
      to a system physical address in order to support a few facilities:
      
      1) To offline poisoned pages in DRAM proactively in the deferred error
         handler.
      
      2) To print sysaddr and page info for DRAM ECC errors in EDAC.
      
      [ Boris: fixes/cleanups ontop:
      
        * hi_addr_offset = 0 - no need for that branch. Stick it all under the
          HiAddrOffsetEn case. It confines hi_addr_offset's declaration too.
      
        * Move variables to the innermost scope they're used at so that we save
          on stack and not blow it up immediately on function entry.
      
        * Do not modify *sys_addr prematurely - we want to not exit early and
          have modified *sys_addr some, which callers get to see. We either
          convert to a sys_addr or we don't do anything. And we signal that with
          the retval of the function.
      
        * Rename label out -> out_err - because it is the error path.
      
        * No need to pr_err of the conversion failed case: imagine a
          sparsely-populated machine with UMCs which don't have DIMMs. Callers
          should look at the retval instead and issue a printk only when really
          necessary. No need for useless info in dmesg.
      
        * s/temp_reg/tmp/ and other variable names shortening => shorter code.
      
        * Use BIT() everywhere.
      
        * Make error messages more informative.
      
        *  Small build fix for the !CONFIG_X86_MCE_AMD case.
      
        * ... and more minor cleanups.
      ]
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20161122111133.mjzpvzhf7o7yl2oa@pd.tnic
      [ Typo fixes. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f5382de9
  2. 11 11月, 2016 1 次提交
    • Y
      x86/mce/AMD: Fix HWID_MCATYPE calculation by grouping arguments · 859af13a
      Yazen Ghannam 提交于
      The calculation of the hwid_mcatype value in get_smca_bank_info()
      became incorrect after applying the following commit:
      
        1ce9cd7f ("x86/RAS: Simplify SMCA HWID descriptor struct")
      
      This causes the function to not match a bank to its type.
      
      Disassembly of hwid_mcatype calculation after change:
      
            db:       8b 45 e0                mov    -0x20(%rbp),%eax
            de:       41 89 c4                mov    %eax,%r12d
            e1:       25 00 00 ff 0f          and    $0xfff0000,%eax
            e6:       41 c1 ec 10             shr    $0x10,%r12d
            ea:       41 09 c4                or     %eax,%r12d
      
      Disassembly of hwid_mcatype calculation in original code:
      
           286:       8b 45 d0                mov    -0x30(%rbp),%eax
           289:       41 89 c5                mov    %eax,%r13d
           28c:       c1 e8 10                shr    $0x10,%eax
           28f:       41 81 e5 ff 0f 00 00    and    $0xfff,%r13d
           296:       41 c1 e5 10             shl    $0x10,%r13d
           29a:       41 09 c5                or     %eax,%r13d
      
      Grouping the arguments to the HWID_MCATYPE() macro fixes the issue.
      
      ( Boris suggested adding parentheses in the macro. )
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      859af13a
  3. 09 11月, 2016 4 次提交
  4. 13 9月, 2016 2 次提交
  5. 12 5月, 2016 1 次提交
  6. 03 5月, 2016 1 次提交
    • Y
      x86/mce: Define vendor-specific MSR accessors · a9750a31
      Yazen Ghannam 提交于
      Scalable MCA processors have a whole new range of MSR addresses to
      obtain bank related info such as CTL, MISC, ADDR, STATUS. Therefore, we
      need a way to abstract the MSR addresses per vendor.
      
      Carved out from a patch by Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>.
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1462019637-16474-5-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a9750a31
  7. 08 3月, 2016 4 次提交
  8. 18 2月, 2016 1 次提交
  9. 01 11月, 2015 1 次提交
  10. 13 8月, 2015 4 次提交
  11. 07 6月, 2015 2 次提交
  12. 07 5月, 2015 2 次提交
  13. 24 3月, 2015 2 次提交
  14. 19 2月, 2015 1 次提交
    • B
      x86/MCE/intel: Cleanup CMCI storm logic · 3f2f0680
      Borislav Petkov 提交于
      Initially, this started with the yet another report about a race
      condition in the CMCI storm adaptive period length thing. Yes, we have
      to admit, it is fragile and error prone. So let's simplify it.
      
      The simpler logic is: now, after we enter storm mode, we go straight to
      polling with CMCI_STORM_INTERVAL, i.e. once a second. We remain in storm
      mode as long as we see errors being logged while polling.
      
      Theoretically, if we see an uninterrupted error stream, we will remain
      in storm mode indefinitely and keep polling the MSRs.
      
      However, when the storm is actually a burst of errors, once we have
      logged them all, we back out of it after ~5 mins of polling and no more
      errors logged.
      
      If we encounter an error during those 5 minutes, we reset the polling
      interval to 5 mins.
      
      Making machine_check_poll() return a bool and denoting whether it has
      seen an error or not lets us simplify a bunch of code and move the storm
      handling private to mce_intel.c.
      
      Some minor cleanups while at it.
      Reported-by: NCalvin Owens <calvinowens@fb.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1417746575-23299-1-git-send-email-calvinowens@fb.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      3f2f0680
  15. 07 1月, 2015 1 次提交
  16. 20 11月, 2014 1 次提交
  17. 22 10月, 2014 1 次提交
  18. 05 6月, 2014 1 次提交
  19. 07 1月, 2014 1 次提交
  20. 24 10月, 2013 1 次提交
  21. 06 8月, 2013 1 次提交
  22. 09 7月, 2013 1 次提交
    • N
      mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC · c3d1fb56
      Naveen N. Rao 提交于
      The Corrected Machine Check structure (CMC) in HEST has a flag which can be
      set by the firmware to indicate to the OS that it prefers to process the
      corrected error events first. In this scenario, the OS is expected to not
      monitor for corrected errors (through CMCI/polling). Instead, the firmware
      notifies the OS on corrected error events through GHES.
      
      Linux already has support for GHES. This patch adds support for parsing CMC
      structure and to disable CMCI/polling if the firmware first flag is set.
      
      Further, the list of machine check bank structures at the end of CMC is used
      to determine which MCA banks function in FF mode, so that we continue to
      monitor error events on the other banks.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      c3d1fb56
  23. 13 6月, 2013 1 次提交
  24. 05 6月, 2013 1 次提交
  25. 03 4月, 2013 1 次提交
    • S
      x86/mce: Rework cmci_rediscover() to play well with CPU hotplug · 7a0c819d
      Srivatsa S. Bhat 提交于
      Dave Jones reports that offlining a CPU leads to this trace:
      
      numa_remove_cpu cpu 1 node 0: mask now 0,2-3
      smpboot: CPU 1 is now offline
      BUG: using smp_processor_id() in preemptible [00000000] code:
      cpu-offline.sh/10591
      caller is cmci_rediscover+0x6a/0xe0
      Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
      Call Trace:
       [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
       [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
       [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
       [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
       [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
       [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
       [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
       [<ffffffff815ef082>] _cpu_down+0x302/0x350
       [<ffffffff815ef106>] cpu_down+0x36/0x50
       [<ffffffff815f1c9d>] store_online+0x8d/0xd0
       [<ffffffff813edc48>] dev_attr_store+0x18/0x30
       [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
       [<ffffffff811adfb2>] vfs_write+0xa2/0x170
       [<ffffffff811ae16c>] sys_write+0x4c/0xa0
       [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
      
      However, a look at cmci_rediscover shows that it can be simplified quite
      a bit, apart from solving the above issue. It invokes functions that
      take spin locks with interrupts disabled, and hence it can run in atomic
      context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
      is already dead and out of the cpu_online_mask. So take these points into
      account and simplify the code, and thereby also fix the above issue.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      7a0c819d
  26. 09 1月, 2013 1 次提交
    • B
      x86, MCE: Retract most UAPI exports · f51bde6f
      Borislav Petkov 提交于
      Retract back most macro definitions which went into the
      user-visible mce.h header. Even though those bits are mostly
      hardware-defined/-architectural, their naming is not. If we export them
      to userspace, any kernel unification/renaming/cleanup cannot be done
      anymore since those are effectively cast in stone. Besides, if userspace
      wants those definitions, they can write their own defines and go crazy.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      f51bde6f
  27. 15 12月, 2012 1 次提交