1. 26 10月, 2012 2 次提交
  2. 19 10月, 2012 1 次提交
  3. 28 9月, 2012 1 次提交
  4. 10 8月, 2012 1 次提交
    • C
      x86/mce: Add CMCI poll mode · 55babd8f
      Chen Gong 提交于
      On Intel systems corrected machine check interrupts (CMCI) may be sent to
      multiple logical processors; possibly to all processors on the affected
      socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface").  This means
      that a persistent error (such as a stuck bit in ECC memory) may cause
      a storm of interrupts that greatly hinders or prevents forward progress
      (probably on many processors).
      
      To solve this we keep track of the rate at which each processor sees
      CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
      polling the machine check banks. If the storm subsides (none of the
      affected processors see any more errors for a complete poll interval) we
      re-enable CMCI.
      
      [Tony: Added console messages when storm begins/ends and increased storm
      threshold from 5 to 15 so we have a few more logged entries before we
      disable interrupts and start dropping reports]
      Signed-off-by: NChen Gong <gong.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NChen Gong <gong.chen@linux.intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      55babd8f
  5. 04 8月, 2012 2 次提交
  6. 26 7月, 2012 1 次提交
  7. 20 7月, 2012 1 次提交
  8. 12 7月, 2012 1 次提交
    • T
      x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults · 6751ed65
      Tony Luck 提交于
      In commit dad1743e ("x86/mce: Only restart instruction after machine
      check recovery if it is safe") we fixed mce_notify_process() to force a
      signal to the current process if it was not restartable (RIPV bit not
      set in MCG_STATUS). But doing it here means that the process doesn't
      get told the virtual address of the fault via siginfo_t->si_addr. This
      would prevent application level recovery from the fault.
      
      Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
      that we will provide the right information with the signal.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org    # 3.4+
      6751ed65
  9. 06 6月, 2012 4 次提交
  10. 31 5月, 2012 1 次提交
  11. 24 5月, 2012 1 次提交
  12. 23 5月, 2012 1 次提交
  13. 15 5月, 2012 2 次提交
  14. 30 4月, 2012 1 次提交
  15. 20 4月, 2012 1 次提交
    • T
      x86/mce: Avoid reading every machine check bank register twice. · 95022b8c
      Tony Luck 提交于
      Reading machine check bank registers is slow. There is a trend of
      increasing the number of banks, and the number of cores. The main section
      of do_machine_check() is a serialized section where each cpu in turn
      checks every bank. Even on a little two socket SandyBridge-EP system
      that multiplies out as:
      
      	2 sockets * 8 cores * 2 hyperthreads * 20 banks = 640 MSRs
      
      We already scan the banks in parallel in mce_no_way_out() to see if there
      is a fatal error anywhere in the system. If we build a cache of VALID
      bits during this scan, we can avoid uselessly re-reading banks that have
      no data. Note that this cache is only a hint. If the valid bit is set in a
      shared bank, all cpus that share that bank will see it during the parallel
      scan, but the first to find it in the sequential scan will (usually) clear
      the bank.
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      95022b8c
  16. 07 3月, 2012 1 次提交
    • S
      x86, mce: Fix rcu splat in drain_mce_log_buffer() · b11e3d78
      Srivatsa S. Bhat 提交于
      While booting, the following message is seen:
      
      [   21.665087] ===============================
      [   21.669439] [ INFO: suspicious RCU usage. ]
      [   21.673798] 3.2.0-0.0.0.28.36b5ec9-default #2 Not tainted
      [   21.681353] -------------------------------
      [   21.685864] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage!
      [   21.695013]
      [   21.695014] other info that might help us debug this:
      [   21.695016]
      [   21.703488]
      [   21.703489] rcu_scheduler_active = 1, debug_locks = 1
      [   21.710426] 3 locks held by modprobe/2139:
      [   21.714754]  #0:  (&__lockdep_no_validate__){......}, at: [<ffffffff8133afd3>] __driver_attach+0x53/0xa0
      [   21.725020]  #1:
      [   21.725323] ioatdma: Intel(R) QuickData Technology Driver 4.00
      [   21.733206]  (&__lockdep_no_validate__){......}, at: [<ffffffff8133afe1>] __driver_attach+0x61/0xa0
      [   21.743015]  #2:  (i7core_edac_lock){+.+.+.}, at: [<ffffffffa01cfa5f>] i7core_probe+0x1f/0x5c0 [i7core_edac]
      [   21.753708]
      [   21.753709] stack backtrace:
      [   21.758429] Pid: 2139, comm: modprobe Not tainted 3.2.0-0.0.0.28.36b5ec9-default #2
      [   21.768253] Call Trace:
      [   21.770838]  [<ffffffff810977cd>] lockdep_rcu_suspicious+0xcd/0x100
      [   21.777366]  [<ffffffff8101aa41>] drain_mcelog_buffer+0x191/0x1b0
      [   21.783715]  [<ffffffff8101aa78>] mce_register_decode_chain+0x18/0x20
      [   21.790430]  [<ffffffffa01cf8db>] i7core_register_mci+0x2fb/0x3e4 [i7core_edac]
      [   21.798003]  [<ffffffffa01cfb14>] i7core_probe+0xd4/0x5c0 [i7core_edac]
      [   21.804809]  [<ffffffff8129566b>] local_pci_probe+0x5b/0xe0
      [   21.810631]  [<ffffffff812957c9>] __pci_device_probe+0xd9/0xe0
      [   21.816650]  [<ffffffff813362e4>] ? get_device+0x14/0x20
      [   21.822178]  [<ffffffff81296916>] pci_device_probe+0x36/0x60
      [   21.828061]  [<ffffffff8133ac8a>] really_probe+0x7a/0x2b0
      [   21.833676]  [<ffffffff8133af23>] driver_probe_device+0x63/0xc0
      [   21.839868]  [<ffffffff8133b01b>] __driver_attach+0x9b/0xa0
      [   21.845718]  [<ffffffff8133af80>] ? driver_probe_device+0xc0/0xc0
      [   21.852027]  [<ffffffff81339168>] bus_for_each_dev+0x68/0x90
      [   21.857876]  [<ffffffff8133aa3c>] driver_attach+0x1c/0x20
      [   21.863462]  [<ffffffff8133a64d>] bus_add_driver+0x16d/0x2b0
      [   21.869377]  [<ffffffff8133b6dc>] driver_register+0x7c/0x160
      [   21.875220]  [<ffffffff81296bda>] __pci_register_driver+0x6a/0xf0
      [   21.881494]  [<ffffffffa01fe000>] ? 0xffffffffa01fdfff
      [   21.886846]  [<ffffffffa01fe047>] i7core_init+0x47/0x1000 [i7core_edac]
      [   21.893737]  [<ffffffff810001ce>] do_one_initcall+0x3e/0x180
      [   21.899670]  [<ffffffff810a9b95>] sys_init_module+0xc5/0x220
      [   21.905542]  [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b
      
      Fix this by using ACCESS_ONCE() instead of rcu_dereference_check_mce()
      over mcelog.next. Since the access to each entry is controlled by the
      ->finished field, ACCESS_ONCE() should work just fine. An rcu_dereference
      is unnecessary here.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Suggested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      b11e3d78
  17. 23 2月, 2012 2 次提交
  18. 17 1月, 2012 1 次提交
  19. 14 1月, 2012 1 次提交
  20. 04 1月, 2012 4 次提交
    • T
      x86/mce: Handle "action required" errors · a8c321fb
      Tony Luck 提交于
      All non-urgent actions (reporting low severity errors and handling
      "action-optional" errors) are now handled by a work queue. This
      means that TIF_MCE_NOTIFY can be used to block execution for a
      thread experiencing an "action-required" fault until we get all
      cpus out of the machine check handler (and the thread that hit
      the fault into mce_notify_process().
      
      We use the new mce_{save,find,clear}_info() API to get information
      from do_machine_check() to mce_notify_process(), and then use the
      newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle
      the error (possibly signalling the process).
      
      Update some comments to make the new code flows clearer.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      a8c321fb
    • T
      x86/mce: Add mechanism to safely save information in MCE handler · af104e39
      Tony Luck 提交于
      Machine checks on Intel cpus interrupt execution on all cpus, regardless
      of interrupt masking.  We have a need to save some data about the cause
      of the machine check (physical address) in the machine check handler that
      can be retrieved later to attempt recovery in a more flexible execution
      state.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      af104e39
    • T
      x86/mce: Create helper function to save addr/misc when needed · 85f92694
      Tony Luck 提交于
      The MCI_STATUS_MISCV and MCI_STATUS_ADDRV bits in the bank status
      registers define whether the MISC and ADDR registers respectively
      contain valid data - provide a helper function to check these bits
      and read the registers when needed.
      
      In addition, processors that support software error recovery (as
      indicated by the MCG_SER_P bit in the MCG_CAP register) may include
      some undefined bits in the ADDR register - mask these out.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      85f92694
    • T
      HWPOISON: Clean up memory_failure() vs. __memory_failure() · cd42f4a3
      Tony Luck 提交于
      There is only one caller of memory_failure(), all other users call
      __memory_failure() and pass in the flags argument explicitly. The
      lone user of memory_failure() will soon need to pass flags too.
      
      Add flags argument to the callsite in mce.c. Delete the old memory_failure()
      function, and then rename __memory_failure() without the leading "__".
      
      Provide clearer message when action optional memory errors are ignored.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      cd42f4a3
  21. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  22. 21 12月, 2011 1 次提交
  23. 14 12月, 2011 2 次提交
  24. 08 11月, 2011 1 次提交
  25. 01 11月, 2011 3 次提交
  26. 19 10月, 2011 1 次提交
  27. 14 10月, 2011 1 次提交