1. 24 11月, 2015 1 次提交
    • T
      x86/mce: Do not enter deferred errors into the generic pool twice · 8b38937b
      Tony Luck 提交于
      We used to have a special ring buffer for deferred errors that
      was used to mark problem pages. We replaced that with a generic
      pool. Then later converted mce_log() to also use the same pool.
      As a result, we end up adding all deferred errors to the pool
      twice.
      
      Rearrange this code. Make sure to set the m.severity and
      m.usable_addr fields for deferred errors. Then if flags and
      mca_cfg.dont_log_ce mean we call mce_log() we are done, because
      that will add this entry to the generic pool.
      
      If we skipped mce_log(), then we still want to take action for
      the deferred error, so add to the pool.
      
      Change the name of the boolean "error_logged" to "error_seen",
      we should set it whether of not we logged an error because the
      return value from machine_check_poll() is used to decide whether
      storms have subsided or not.
      Reported-by: NGong Chen <gong.chen@linux.intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1448350880-5573-2-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b38937b
  2. 01 11月, 2015 2 次提交
  3. 21 10月, 2015 1 次提交
    • A
      x86/mce: Fix thermal throttling reporting after kexec · 81ffdcdd
      Andi Kleen 提交于
      The per CPU thermal vector init code checks if the thermal
      vector is already installed and complains and bails out if it
      is.
      
      This happens after kexec, as kernel shut down does not clear the
      thermal vector APIC register.
      
      This causes two problems:
      
      1. So we always do not fully initialize thermal reports after
         kexec. The CPU is still likely initialized, as the previous
         kernel should have done it. But we don't set up the software
         pointer to the thermal vector, so reporting may end up with a
         unknown thermal interrupt message.
      
      2. Also it complains for every logical CPU, even though the
         value is actually derived from BP only.
      
      The problem is that we end up with one message per CPU, so on
      larger systems it becomes very noisy and messes up the otherwise
      nicely formatted CPU bootup numbers in the kernel log.
      
      Just remove the check. I checked the code and there's no valid
      code paths where the thermal init code for a CPU could be called
      multiple times.
      
      Why the kernel does not clean up this value on shutdown:
      
      The thermal monitoring is controlled per logical CPU thread.
      Normal shutdown code is just running on one CPU. To disable it
      we would need a broadcast NMI to all CPUs on shut down. That's
      overkill for this. So we just ignore it after kexec.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1445246268-26285-9-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81ffdcdd
  4. 28 9月, 2015 1 次提交
    • A
      x86/mce: Don't clear shared banks on Intel when offlining CPUs · 6e06780a
      Ashok Raj 提交于
      It is not safe to clear global MCi_CTL banks during CPU offline
      or suspend/resume operations. These MSRs are either
      thread-scoped (meaning private to a thread), or core-scoped
      (private to threads in that core only), or with a socket scope:
      visible and controllable from all threads in the socket.
      
      When we offline a single CPU, clearing those MCi_CTL bits will
      stop signaling for all the shared, i.e., socket-wide resources,
      such as LLC, iMC, etc.
      
      In addition, it might be possible to compromise the integrity of
      an Intel Secure Guard eXtentions (SGX) system if the attacker
      has control of the host system and is able to inject errors
      which would be otherwise ignored when MCi_CTL bits are cleared.
      
      Hence on SGX enabled systems, if MCi_CTL is cleared, SGX gets
      disabled.
      Tested-by: NSerge Ayoun <serge.ayoun@intel.com>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      [ Cleanup text. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NTony Luck <tony.luck@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1441391390-16985-1-git-send-email-ashok.raj@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e06780a
  5. 13 8月, 2015 9 次提交
  6. 23 7月, 2015 1 次提交
  7. 07 7月, 2015 1 次提交
    • A
      x86/entry: Remove exception_enter() from most trap handlers · 8c84014f
      Andy Lutomirski 提交于
      On 64-bit kernels, we don't need it any more: we handle context
      tracking directly on entry from user mode and exit to user mode.
      
      On 32-bit kernels, we don't support context tracking at all, so
      these callbacks had no effect.
      
      Note: this doesn't change do_page_fault().  Before we do that,
      we need to make sure that there is no code that can page fault
      from kernel mode with CONTEXT_USER.  The 32-bit fast system call
      stack argument code is the only offender I'm aware of right now.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: paulmck@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/ae22f4dfebd799c916574089964592be218151f9.1435952415.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c84014f
  8. 06 7月, 2015 2 次提交
  9. 07 6月, 2015 2 次提交
  10. 28 5月, 2015 2 次提交
  11. 27 5月, 2015 1 次提交
  12. 18 5月, 2015 1 次提交
    • B
      x86/mce: Fix MCE severity messages · 17fea54b
      Borislav Petkov 提交于
      Derek noticed that a critical MCE gets reported with the wrong
      error type description:
      
        [Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 9: f200003f000100b0
        [Hardware Error]: RIP !INEXACT! 10:<ffffffff812e14c1> {intel_idle+0xb1/0x170}
        [Hardware Error]: TSC 49587b8e321cb
        [Hardware Error]: PROCESSOR 0:306e4 TIME 1431561296 SOCKET 1 APIC 29
        [Hardware Error]: Some CPUs didn't answer in synchronization
        [Hardware Error]: Machine check: Invalid
      				   ^^^^^^^
      
      The last line with 'Invalid' should have printed the high level
      MCE error type description we get from mce_severity, i.e.
      something like:
      
        [Hardware Error]: Machine check: Action required: data load error in a user process
      
      this happens due to the fact that mce_no_way_out() iterates over
      all MCA banks and possibly overwrites the @msg argument which is
      used in the panic printing later.
      
      Change behavior to take the message of only and the (last)
      critical MCE it detects.
      Reported-by: NDerek <denc716@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1431936437-25286-3-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      17fea54b
  13. 07 5月, 2015 6 次提交
  14. 03 4月, 2015 1 次提交
  15. 24 3月, 2015 2 次提交
  16. 23 3月, 2015 2 次提交
  17. 19 2月, 2015 4 次提交
  18. 10 2月, 2015 1 次提交
    • T
      x86/mce: Fix regression. All error records should report via /dev/mcelog · a2413d8b
      Tony Luck 提交于
      I'm getting complaints from validation teams that have updated their
      Linux kernels from ancient versions to current. They don't see the
      error logs they expect. I tell the to unload any EDAC drivers[1], and
      things start working again.  The problem is that we short-circuit
      the logging process if any function on the decoder chain claims to
      have dealt with the problem:
      
      	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
      	if (ret == NOTIFY_STOP)
      		return;
      
      The logic we used when we added this code was that we did not want
      to confuse users with double reports of the same error.
      
      But it turns out users are not confused - they are upset that they
      don't see a log where their tools used to find a log.
      
      I could also get into a long description of how the consumer of this
      log does more than just decode model specific details of the error.
      It keeps counts, tracks thresholds, takes actions and runs scripts
      that can alert administrators to problems.
      
      [1] We've recently compounded the problem because the acpi_extlog
      driver also registers for this notifier and also returns NOTIFY_STOP.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      a2413d8b