1. 24 1月, 2017 2 次提交
    • B
      x86/ras: Flip the TSC-adding logic · 669c00f0
      Borislav Petkov 提交于
      Add the TSC value to the MCE record only when the MCE being logged is
      precise, i.e., it is logged as an exception or an MCE-related interrupt.
      
      So it doesn't look particularly easy to do without touching/changing a
      bunch of places. That's why I'm trying tricks first.
      
      For example, the mce-apei.c case I'm addressing by setting ->tsc only
      for errors of panic severity. The idea there is, that, panic errors will
      have raised an #MC and not polled.
      
      And then instead of propagating a flag to mce_setup(), it seems
      easier/less code to set ->tsc depending on the call sites, i.e.,
      are we polling or are we preparing an MCE record in an exception
      handler/thresholding interrupt.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      669c00f0
    • B
      x86/ras/therm_throt: Do not log a fake MCE for thermal events · 9b052ea4
      Borislav Petkov 提交于
      We log a fake bank 128 MCE to note that we're handling a CPU thermal
      event. However, this confuses people into thinking that their hardware
      generates MCEs. Hijacking MCA for logging thermal events is a gross
      misuse anyway and it shouldn't have been done in the first place. And
      besides we have other means for dealing with thermal events which are
      much more suitable.
      
      So let's kill the MCE logging part.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAshok Raj <ashok.raj@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
      Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9b052ea4
  2. 23 11月, 2016 1 次提交
  3. 16 11月, 2016 5 次提交
  4. 11 11月, 2016 1 次提交
    • B
      x86/MCE: Correct TSC timestamping of error records · 54467353
      Borislav Petkov 提交于
      We did have logic in the MCE code which would TSC-timestamp an error
      record only when it is exact - i.e., when it wasn't detected by polling.
      This isn't the case anymore. So let's fix that:
      
      We have a valid TSC timestamp in the error record only when it has been
      a precise detection, i.e., either in the #MC handler or in one of the
      interrupt handlers (thresholding, deferred, ...).
      
      All other error records still have mce.time which contains the wall
      time in order to be able to place the error record in time at least
      approximately.
      
      Also, this fixes another bug where machine_check_poll() would clear
      mce.tsc unconditionally even if we requested precise MCP_TIMESTAMP
      logging.
      
      The proper fix would be to generate timestamp only when it has been
      requested and not always. But that would require a more thorough code
      audit of all mce_gather_info/mce_setup() users. Add a FIXME for now.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony <tony.luck@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: kernel test robot <xiaolong.ye@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Cc: lkp@01.org
      Link: http://lkml.kernel.org/r/20161110131053.kybsijfs5venpjnf@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54467353
  5. 09 11月, 2016 1 次提交
  6. 18 9月, 2016 1 次提交
    • T
      mce, workqueue: remove keventd_up() usage · a2c2727d
      Tejun Heo 提交于
      Now that workqueue can handle work item queueing from very early
      during boot, there is no need to gate schedule_work() with
      keventd_up().  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: linux-edac@vger.kernel.org
      a2c2727d
  7. 13 9月, 2016 4 次提交
  8. 05 9月, 2016 2 次提交
  9. 08 7月, 2016 1 次提交
    • B
      x86/mce: Fix mce_rdmsrl() warning message · 38c54ccb
      Borislav Petkov 提交于
      The MSR address we're dumping in there should be in hex, otherwise we
      get funsies like:
      
        [    0.016000] WARNING: CPU: 1 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:428 mce_rdmsrl+0xd9/0xe0
        [    0.016000] mce: Unable to read msr -1073733631!
      				       ^^^^^^^^^^^
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1467968983-4874-5-git-send-email-bp@alien8.de
      [ Fixed capitalization of 'MSR'. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      38c54ccb
  10. 07 7月, 2016 1 次提交
  11. 12 5月, 2016 1 次提交
  12. 03 5月, 2016 6 次提交
  13. 13 4月, 2016 1 次提交
  14. 18 2月, 2016 2 次提交
  15. 01 2月, 2016 1 次提交
  16. 19 12月, 2015 1 次提交
    • A
      x86/mce: Ensure offline CPUs don't participate in rendezvous process · d90167a9
      Ashok Raj 提交于
      Intel's MCA implementation broadcasts MCEs to all CPUs on the
      node. This poses a problem for offlined CPUs which cannot
      participate in the rendezvous process:
      
        Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
        Kernel Offset: disabled
        Rebooting in 100 seconds..
      
      More specifically, Linux does a soft offline of a CPU when
      writing a 0 to /sys/devices/system/cpu/cpuX/online, which
      doesn't prevent the #MC exception from being broadcasted to that
      CPU.
      
      Ensure that offline CPUs don't participate in the MCE rendezvous
      and clear the RIP valid status bit so that a second MCE won't
      cause a shutdown.
      
      Without the patch, mce_start() will increment mce_callin and
      wait for all CPUs. Offlined CPUs should avoid participating in
      the rendezvous process altogether.
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      [ Massage commit message. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d90167a9
  17. 24 11月, 2015 4 次提交
  18. 01 11月, 2015 2 次提交
  19. 28 9月, 2015 1 次提交
    • A
      x86/mce: Don't clear shared banks on Intel when offlining CPUs · 6e06780a
      Ashok Raj 提交于
      It is not safe to clear global MCi_CTL banks during CPU offline
      or suspend/resume operations. These MSRs are either
      thread-scoped (meaning private to a thread), or core-scoped
      (private to threads in that core only), or with a socket scope:
      visible and controllable from all threads in the socket.
      
      When we offline a single CPU, clearing those MCi_CTL bits will
      stop signaling for all the shared, i.e., socket-wide resources,
      such as LLC, iMC, etc.
      
      In addition, it might be possible to compromise the integrity of
      an Intel Secure Guard eXtentions (SGX) system if the attacker
      has control of the host system and is able to inject errors
      which would be otherwise ignored when MCi_CTL bits are cleared.
      
      Hence on SGX enabled systems, if MCi_CTL is cleared, SGX gets
      disabled.
      Tested-by: NSerge Ayoun <serge.ayoun@intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      [ Cleanup text. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1441391390-16985-1-git-send-email-ashok.raj@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e06780a
  20. 13 8月, 2015 2 次提交