1. 08 7月, 2016 2 次提交
  2. 07 7月, 2016 1 次提交
  3. 14 6月, 2016 1 次提交
    • T
      x86/mce: Do not use bank 1 for APEI generated error logs · b2de4360
      Tony Luck 提交于
      BIOS can report a memory error to Linux using ACPI/APEI mechanism. When
      it does this, we create a fictitious machine check error record and
      feed it into the standard mce_log() function. The error record needs a
      machine check bank number, and for some reason we chose "1" for this.
      
      But "1" is a valid bank number, and this causes confusion and heartburn
      among h/w folks who are concerned that a memory error signature was
      somehow logged in bank 1.
      
      Change to use "-1" (field is a "u8" so will typically print as 255).
      This should make it clearer that this error did not originate in a
      machine check bank.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aristeu Rozanski <arozansk@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/b7fffb2b326bc1dd150ffceb9919a803f9496e0e.1464805958.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b2de4360
  4. 12 5月, 2016 4 次提交
  5. 03 5月, 2016 7 次提交
  6. 13 4月, 2016 3 次提交
  7. 26 3月, 2016 1 次提交
    • S
      ACPI / processor: Request native thermal interrupt handling via _OSC · a2121167
      Srinivas Pandruvada 提交于
      There are several reports of freeze on enabling HWP (Hardware PStates)
      feature on Skylake-based systems by the Intel P-states driver. The root
      cause is identified as the HWP interrupts causing BIOS code to freeze.
      
      HWP interrupts use the thermal LVT which can be handled by Linux
      natively, but on the affected Skylake-based systems SMM will respond
      to it by default.  This is a problem for several reasons:
       - On the affected systems the SMM thermal LVT handler is broken (it
         will crash when invoked) and a BIOS update is necessary to fix it.
       - With thermal interrupt handled in SMM we lose all of the reporting
         features of the arch/x86/kernel/cpu/mcheck/therm_throt driver.
       - Some thermal drivers like x86-package-temp depend on the thermal
         threshold interrupts signaled via the thermal LVT.
       - The HWP interrupts are useful for debugging and tuning
         performance (if the kernel can handle them).
      The native handling of thermal interrupts needs to be enabled
      because of that.
      
      This requires some way to tell SMM that the OS can handle thermal
      interrupts.  That can be done by using _OSC/_PDC in processor
      scope very early during ACPI initialization.
      
      The meaning of _OSC/_PDC bit 12 in processor scope is whether or
      not the OS supports native handling of interrupts for Collaborative
      Processor Performance Control (CPPC) notifications.  Since on
      HWP-capable systems CPPC is a firmware interface to HWP, setting
      this bit effectively tells the firmware that the OS will handle
      thermal interrupts natively going forward.
      
      For details on _OSC/_PDC refer to:
      http://www.intel.com/content/www/us/en/standards/processor-vendor-specific-acpi-specification.html
      
      To implement the _OSC/_PDC handshake as described, introduce a new
      function, acpi_early_processor_osc(), that walks the ACPI
      namespace looking for ACPI processor objects and invokes _OSC for
      them with bit 12 in the capabilities buffer set and terminates the
      namespace walk on the first success.
      
      Also modify intel_thermal_interrupt() to clear HWP status bits in
      the HWP_STATUS MSR to acknowledge HWP interrupts (which prevents
      them from firing continuously).
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      [ rjw: Subject & changelog, function rename ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      a2121167
  8. 08 3月, 2016 3 次提交
  9. 18 2月, 2016 2 次提交
  10. 03 2月, 2016 1 次提交
  11. 01 2月, 2016 6 次提交
  12. 19 12月, 2015 1 次提交
    • A
      x86/mce: Ensure offline CPUs don't participate in rendezvous process · d90167a9
      Ashok Raj 提交于
      Intel's MCA implementation broadcasts MCEs to all CPUs on the
      node. This poses a problem for offlined CPUs which cannot
      participate in the rendezvous process:
      
        Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
        Kernel Offset: disabled
        Rebooting in 100 seconds..
      
      More specifically, Linux does a soft offline of a CPU when
      writing a 0 to /sys/devices/system/cpu/cpuX/online, which
      doesn't prevent the #MC exception from being broadcasted to that
      CPU.
      
      Ensure that offline CPUs don't participate in the MCE rendezvous
      and clear the RIP valid status bit so that a second MCE won't
      cause a shutdown.
      
      Without the patch, mce_start() will increment mce_callin and
      wait for all CPUs. Offlined CPUs should avoid participating in
      the rendezvous process altogether.
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      [ Massage commit message. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d90167a9
  13. 24 11月, 2015 4 次提交
  14. 01 11月, 2015 2 次提交
  15. 21 10月, 2015 1 次提交
    • A
      x86/mce: Fix thermal throttling reporting after kexec · 81ffdcdd
      Andi Kleen 提交于
      The per CPU thermal vector init code checks if the thermal
      vector is already installed and complains and bails out if it
      is.
      
      This happens after kexec, as kernel shut down does not clear the
      thermal vector APIC register.
      
      This causes two problems:
      
      1. So we always do not fully initialize thermal reports after
         kexec. The CPU is still likely initialized, as the previous
         kernel should have done it. But we don't set up the software
         pointer to the thermal vector, so reporting may end up with a
         unknown thermal interrupt message.
      
      2. Also it complains for every logical CPU, even though the
         value is actually derived from BP only.
      
      The problem is that we end up with one message per CPU, so on
      larger systems it becomes very noisy and messes up the otherwise
      nicely formatted CPU bootup numbers in the kernel log.
      
      Just remove the check. I checked the code and there's no valid
      code paths where the thermal init code for a CPU could be called
      multiple times.
      
      Why the kernel does not clean up this value on shutdown:
      
      The thermal monitoring is controlled per logical CPU thread.
      Normal shutdown code is just running on one CPU. To disable it
      we would need a broadcast NMI to all CPUs on shut down. That's
      overkill for this. So we just ignore it after kexec.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1445246268-26285-9-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81ffdcdd
  16. 28 9月, 2015 1 次提交
    • A
      x86/mce: Don't clear shared banks on Intel when offlining CPUs · 6e06780a
      Ashok Raj 提交于
      It is not safe to clear global MCi_CTL banks during CPU offline
      or suspend/resume operations. These MSRs are either
      thread-scoped (meaning private to a thread), or core-scoped
      (private to threads in that core only), or with a socket scope:
      visible and controllable from all threads in the socket.
      
      When we offline a single CPU, clearing those MCi_CTL bits will
      stop signaling for all the shared, i.e., socket-wide resources,
      such as LLC, iMC, etc.
      
      In addition, it might be possible to compromise the integrity of
      an Intel Secure Guard eXtentions (SGX) system if the attacker
      has control of the host system and is able to inject errors
      which would be otherwise ignored when MCi_CTL bits are cleared.
      
      Hence on SGX enabled systems, if MCi_CTL is cleared, SGX gets
      disabled.
      Tested-by: NSerge Ayoun <serge.ayoun@intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      [ Cleanup text. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1441391390-16985-1-git-send-email-ashok.raj@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e06780a