1. 22 6月, 2018 2 次提交
    • T
      x86/mce: Fix incorrect "Machine check from unknown source" message · 40c36e27
      Tony Luck 提交于
      Some injection testing resulted in the following console log:
      
        mce: [Hardware Error]: CPU 22: Machine Check Exception: f Bank 1: bd80000000100134
        mce: [Hardware Error]: RIP 10:<ffffffffc05292dd> {pmem_do_bvec+0x11d/0x330 [nd_pmem]}
        mce: [Hardware Error]: TSC c51a63035d52 ADDR 3234bc4000 MISC 88
        mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1526502199 SOCKET 0 APIC 38 microcode 2000043
        mce: [Hardware Error]: Run the above through 'mcelog --ascii'
        Kernel panic - not syncing: Machine check from unknown source
      
      This confused everybody because the first line quite clearly shows
      that we found a logged error in "Bank 1", while the last line says
      "unknown source".
      
      The problem is that the Linux code doesn't do the right thing
      for a local machine check that results in a fatal error.
      
      It turns out that we know very early in the handler whether the
      machine check is fatal. The call to mce_no_way_out() has checked
      all the banks for the CPU that took the local machine check. If
      it says we must crash, we can do so right away with the right
      messages.
      
      We do scan all the banks again. This means that we might initially
      not see a problem, but during the second scan find something fatal.
      If this happens we print a slightly different message (so I can
      see if it actually every happens).
      
      [ bp: Remove unneeded severity assignment. ]
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: stable@vger.kernel.org # 4.2
      Link: http://lkml.kernel.org/r/52e049a497e86fd0b71c529651def8871c804df0.1527283897.git.tony.luck@intel.com
      40c36e27
    • B
      x86/mce: Do not overwrite MCi_STATUS in mce_no_way_out() · 1f74c8a6
      Borislav Petkov 提交于
      mce_no_way_out() does a quick check during #MC to see whether some of
      the MCEs logged would require the kernel to panic immediately. And it
      passes a struct mce where MCi_STATUS gets written.
      
      However, after having saved a valid status value, the next iteration
      of the loop which goes over the MCA banks on the CPU, overwrites the
      valid status value because we're using struct mce as storage instead of
      a temporary variable.
      
      Which leads to MCE records with an empty status value:
      
        mce: [Hardware Error]: CPU 0: Machine Check Exception: 6 Bank 0: 0000000000000000
        mce: [Hardware Error]: RIP 10:<ffffffffbd42fbd7> {trigger_mce+0x7/0x10}
      
      In order to prevent the loss of the status register value, return
      immediately when severity is a panic one so that we can panic
      immediately with the first fatal MCE logged. This is also the intention
      of this function and not to noodle over the banks while a fatal MCE is
      already logged.
      
      Tony: read the rest of the MCA bank to populate the struct mce fully.
      Suggested-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20180622095428.626-8-bp@alien8.de
      1f74c8a6
  2. 08 6月, 2018 1 次提交
    • T
      x86/mce: Improve error message when kernel cannot recover · c7d606f5
      Tony Luck 提交于
      Since we added support to add recovery from some errors inside the kernel in:
      
      commit b2f9d678 ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
      
      we have done a less than stellar job at reporting the cause of recoverable
      machine checks that occur in other parts of the kernel. The user just gets
      the unhelpful message:
      
      	mce: [Hardware Error]: Machine check: Action required: unknown MCACOD
      
      doubly unhelpful when they check the manual for the reported IA32_MSR_STATUS.MCACOD
      and see that it is listed as one of the standard recoverable values.
      
      Add an extra rule to the MCE severity table to catch this case and report it
      as:
      
      	mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
      
      Fixes: b2f9d678 ("x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries")
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Cc: stable@vger.kernel.org # 4.6+
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/4cc7c465150a9a48b8b9f45d0b840278e77eb9b5.1527283897.git.tony.luck@intel.com
      c7d606f5
  3. 23 5月, 2018 2 次提交
  4. 19 5月, 2018 8 次提交
  5. 18 5月, 2018 1 次提交
  6. 17 5月, 2018 13 次提交
  7. 14 5月, 2018 1 次提交
  8. 13 5月, 2018 5 次提交
  9. 12 5月, 2018 1 次提交
  10. 11 5月, 2018 2 次提交
  11. 10 5月, 2018 1 次提交
    • K
      x86/bugs: Rename _RDS to _SSBD · 9f65fb29
      Konrad Rzeszutek Wilk 提交于
      Intel collateral will reference the SSB mitigation bit in IA32_SPEC_CTL[2]
      as SSBD (Speculative Store Bypass Disable).
      
      Hence changing it.
      
      It is unclear yet what the MSR_IA32_ARCH_CAPABILITIES (0x10a) Bit(4) name
      is going to be. Following the rename it would be SSBD_NO but that rolls out
      to Speculative Store Bypass Disable No.
      
      Also fixed the missing space in X86_FEATURE_AMD_SSBD.
      
      [ tglx: Fixup x86_amd_rds_enable() and rds_tif_to_amd_ls_cfg() as well ]
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      9f65fb29
  12. 06 5月, 2018 3 次提交