1. 23 11月, 2016 1 次提交
  2. 22 11月, 2016 1 次提交
    • Y
      x86/mce/AMD: Add system physical address translation for AMD Fam17h · f5382de9
      Yazen Ghannam 提交于
      The Unified Memory Controllers (UMCs) on Fam17h log a normalized address
      in their MCA_ADDR registers. We need to convert that normalized address
      to a system physical address in order to support a few facilities:
      
      1) To offline poisoned pages in DRAM proactively in the deferred error
         handler.
      
      2) To print sysaddr and page info for DRAM ECC errors in EDAC.
      
      [ Boris: fixes/cleanups ontop:
      
        * hi_addr_offset = 0 - no need for that branch. Stick it all under the
          HiAddrOffsetEn case. It confines hi_addr_offset's declaration too.
      
        * Move variables to the innermost scope they're used at so that we save
          on stack and not blow it up immediately on function entry.
      
        * Do not modify *sys_addr prematurely - we want to not exit early and
          have modified *sys_addr some, which callers get to see. We either
          convert to a sys_addr or we don't do anything. And we signal that with
          the retval of the function.
      
        * Rename label out -> out_err - because it is the error path.
      
        * No need to pr_err of the conversion failed case: imagine a
          sparsely-populated machine with UMCs which don't have DIMMs. Callers
          should look at the retval instead and issue a printk only when really
          necessary. No need for useless info in dmesg.
      
        * s/temp_reg/tmp/ and other variable names shortening => shorter code.
      
        * Use BIT() everywhere.
      
        * Make error messages more informative.
      
        *  Small build fix for the !CONFIG_X86_MCE_AMD case.
      
        * ... and more minor cleanups.
      ]
      Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Aravind Gopalakrishnan <aravindksg.lkml@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20161122111133.mjzpvzhf7o7yl2oa@pd.tnic
      [ Typo fixes. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f5382de9
  3. 16 11月, 2016 1 次提交
  4. 11 11月, 2016 1 次提交
    • B
      x86/MCE: Correct TSC timestamping of error records · 54467353
      Borislav Petkov 提交于
      We did have logic in the MCE code which would TSC-timestamp an error
      record only when it is exact - i.e., when it wasn't detected by polling.
      This isn't the case anymore. So let's fix that:
      
      We have a valid TSC timestamp in the error record only when it has been
      a precise detection, i.e., either in the #MC handler or in one of the
      interrupt handlers (thresholding, deferred, ...).
      
      All other error records still have mce.time which contains the wall
      time in order to be able to place the error record in time at least
      approximately.
      
      Also, this fixes another bug where machine_check_poll() would clear
      mce.tsc unconditionally even if we requested precise MCP_TIMESTAMP
      logging.
      
      The proper fix would be to generate timestamp only when it has been
      requested and not always. But that would require a more thorough code
      audit of all mce_gather_info/mce_setup() users. Add a FIXME for now.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony <tony.luck@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: kernel test robot <xiaolong.ye@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: lkp@01.org
      Link: http://lkml.kernel.org/r/20161110131053.kybsijfs5venpjnf@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54467353
  5. 09 11月, 2016 6 次提交
  6. 13 9月, 2016 9 次提交
  7. 05 9月, 2016 2 次提交
  8. 08 7月, 2016 2 次提交
  9. 07 7月, 2016 1 次提交
  10. 14 6月, 2016 1 次提交
    • T
      x86/mce: Do not use bank 1 for APEI generated error logs · b2de4360
      Tony Luck 提交于
      BIOS can report a memory error to Linux using ACPI/APEI mechanism. When
      it does this, we create a fictitious machine check error record and
      feed it into the standard mce_log() function. The error record needs a
      machine check bank number, and for some reason we chose "1" for this.
      
      But "1" is a valid bank number, and this causes confusion and heartburn
      among h/w folks who are concerned that a memory error signature was
      somehow logged in bank 1.
      
      Change to use "-1" (field is a "u8" so will typically print as 255).
      This should make it clearer that this error did not originate in a
      machine check bank.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/b7fffb2b326bc1dd150ffceb9919a803f9496e0e.1464805958.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b2de4360
  11. 12 5月, 2016 4 次提交
  12. 03 5月, 2016 7 次提交
  13. 13 4月, 2016 3 次提交
  14. 26 3月, 2016 1 次提交
    • S
      ACPI / processor: Request native thermal interrupt handling via _OSC · a2121167
      Srinivas Pandruvada 提交于
      There are several reports of freeze on enabling HWP (Hardware PStates)
      feature on Skylake-based systems by the Intel P-states driver. The root
      cause is identified as the HWP interrupts causing BIOS code to freeze.
      
      HWP interrupts use the thermal LVT which can be handled by Linux
      natively, but on the affected Skylake-based systems SMM will respond
      to it by default.  This is a problem for several reasons:
       - On the affected systems the SMM thermal LVT handler is broken (it
         will crash when invoked) and a BIOS update is necessary to fix it.
       - With thermal interrupt handled in SMM we lose all of the reporting
         features of the arch/x86/kernel/cpu/mcheck/therm_throt driver.
       - Some thermal drivers like x86-package-temp depend on the thermal
         threshold interrupts signaled via the thermal LVT.
       - The HWP interrupts are useful for debugging and tuning
         performance (if the kernel can handle them).
      The native handling of thermal interrupts needs to be enabled
      because of that.
      
      This requires some way to tell SMM that the OS can handle thermal
      interrupts.  That can be done by using _OSC/_PDC in processor
      scope very early during ACPI initialization.
      
      The meaning of _OSC/_PDC bit 12 in processor scope is whether or
      not the OS supports native handling of interrupts for Collaborative
      Processor Performance Control (CPPC) notifications.  Since on
      HWP-capable systems CPPC is a firmware interface to HWP, setting
      this bit effectively tells the firmware that the OS will handle
      thermal interrupts natively going forward.
      
      For details on _OSC/_PDC refer to:
      http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html
      
      To implement the _OSC/_PDC handshake as described, introduce a new
      function, acpi_early_processor_osc(), that walks the ACPI
      namespace looking for ACPI processor objects and invokes _OSC for
      them with bit 12 in the capabilities buffer set and terminates the
      namespace walk on the first success.
      
      Also modify intel_thermal_interrupt() to clear HWP status bits in
      the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents
      them from firing continuously).
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw: Subject & changelog, function rename ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a2121167