1. 09 7月, 2013 1 次提交
    • N
      mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC · c3d1fb56
      Naveen N. Rao 提交于
      The Corrected Machine Check structure (CMC) in HEST has a flag which can be
      set by the firmware to indicate to the OS that it prefers to process the
      corrected error events first. In this scenario, the OS is expected to not
      monitor for corrected errors (through CMCI/polling). Instead, the firmware
      notifies the OS on corrected error events through GHES.
      
      Linux already has support for GHES. This patch adds support for parsing CMC
      structure and to disable CMCI/polling if the firmware first flag is set.
      
      Further, the list of machine check bank structures at the end of CMC is used
      to determine which MCA banks function in FF mode, so that we continue to
      monitor error events on the other banks.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      c3d1fb56
  2. 03 4月, 2013 1 次提交
    • S
      x86/mce: Rework cmci_rediscover() to play well with CPU hotplug · 7a0c819d
      Srivatsa S. Bhat 提交于
      Dave Jones reports that offlining a CPU leads to this trace:
      
      numa_remove_cpu cpu 1 node 0: mask now 0,2-3
      smpboot: CPU 1 is now offline
      BUG: using smp_processor_id() in preemptible [00000000] code:
      cpu-offline.sh/10591
      caller is cmci_rediscover+0x6a/0xe0
      Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
      Call Trace:
       [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
       [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
       [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
       [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
       [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
       [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
       [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
       [<ffffffff815ef082>] _cpu_down+0x302/0x350
       [<ffffffff815ef106>] cpu_down+0x36/0x50
       [<ffffffff815f1c9d>] store_online+0x8d/0xd0
       [<ffffffff813edc48>] dev_attr_store+0x18/0x30
       [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
       [<ffffffff811adfb2>] vfs_write+0xa2/0x170
       [<ffffffff811ae16c>] sys_write+0x4c/0xa0
       [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
      
      However, a look at cmci_rediscover shows that it can be simplified quite
      a bit, apart from solving the above issue. It invokes functions that
      take spin locks with interrupts disabled, and hence it can run in atomic
      context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
      is already dead and out of the cpu_online_mask. So take these points into
      account and simplify the code, and thereby also fix the above issue.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      7a0c819d
  3. 22 3月, 2013 2 次提交
  4. 21 1月, 2013 1 次提交
  5. 29 12月, 2012 1 次提交
    • T
      x86/mce: don't use [delayed_]work_pending() · 4d899be5
      Tejun Heo 提交于
      There's no need to test whether a (delayed) work item in pending
      before queueing, flushing or cancelling it.  Most uses are unnecessary
      and quite a few of them are buggy.
      
      Remove unnecessary pending tests from x86/mce.  Only compile tested.
      
      v2: Local var work removed from mce_schedule_work() as suggested by
          Borislav.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-edac@vger.kernel.org
      4d899be5
  6. 31 10月, 2012 1 次提交
    • T
      x86/mce: Do not change worker's running cpu in cmci_rediscover(). · 85b97637
      Tang Chen 提交于
      cmci_rediscover() used set_cpus_allowed_ptr() to change the current process's
      running cpu, and migrate itself to the dest cpu. But worker processes are not
      allowed to be migrated. If current is a worker, the worker will be migrated to
      another cpu, but the corresponding  worker_pool is still on the original cpu.
      
      In this case, the following BUG_ON in try_to_wake_up_local() will be triggered:
      BUG_ON(rq != this_rq());
      
      This will cause the kernel panic. The call trace is like the following:
      
      [ 6155.451107] ------------[ cut here ]------------
      [ 6155.452019] kernel BUG at kernel/sched/core.c:1654!
      ......
      [ 6155.452019] RIP: 0010:[<ffffffff810add15>]  [<ffffffff810add15>] try_to_wake_up_local+0x115/0x130
      ......
      [ 6155.452019] Call Trace:
      [ 6155.452019]  [<ffffffff8166fc14>] __schedule+0x764/0x880
      [ 6155.452019]  [<ffffffff81670059>] schedule+0x29/0x70
      [ 6155.452019]  [<ffffffff8166de65>] schedule_timeout+0x235/0x2d0
      [ 6155.452019]  [<ffffffff810db57d>] ? mark_held_locks+0x8d/0x140
      [ 6155.452019]  [<ffffffff810dd463>] ? __lock_release+0x133/0x1a0
      [ 6155.452019]  [<ffffffff81671c50>] ? _raw_spin_unlock_irq+0x30/0x50
      [ 6155.452019]  [<ffffffff810db8f5>] ? trace_hardirqs_on_caller+0x105/0x190
      [ 6155.452019]  [<ffffffff8166fefb>] wait_for_common+0x12b/0x180
      [ 6155.452019]  [<ffffffff810b0b30>] ? try_to_wake_up+0x2f0/0x2f0
      [ 6155.452019]  [<ffffffff8167002d>] wait_for_completion+0x1d/0x20
      [ 6155.452019]  [<ffffffff8110008a>] stop_one_cpu+0x8a/0xc0
      [ 6155.452019]  [<ffffffff810abd40>] ? __migrate_task+0x1a0/0x1a0
      [ 6155.452019]  [<ffffffff810a6ab8>] ? complete+0x28/0x60
      [ 6155.452019]  [<ffffffff810b0fd8>] set_cpus_allowed_ptr+0x128/0x130
      [ 6155.452019]  [<ffffffff81036785>] cmci_rediscover+0xf5/0x140
      [ 6155.452019]  [<ffffffff816643c0>] mce_cpu_callback+0x18d/0x19d
      [ 6155.452019]  [<ffffffff81676187>] notifier_call_chain+0x67/0x150
      [ 6155.452019]  [<ffffffff810a03de>] __raw_notifier_call_chain+0xe/0x10
      [ 6155.452019]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
      [ 6155.452019]  [<ffffffff810704a5>] cpu_notify_nofail+0x15/0x30
      [ 6155.452019]  [<ffffffff81655182>] _cpu_down+0x262/0x2e0
      [ 6155.452019]  [<ffffffff81655236>] cpu_down+0x36/0x50
      [ 6155.452019]  [<ffffffff813d3eaa>] acpi_processor_remove+0x50/0x11e
      [ 6155.452019]  [<ffffffff813a6978>] acpi_device_remove+0x90/0xb2
      [ 6155.452019]  [<ffffffff8143cbec>] __device_release_driver+0x7c/0xf0
      [ 6155.452019]  [<ffffffff8143cd6f>] device_release_driver+0x2f/0x50
      [ 6155.452019]  [<ffffffff813a7870>] acpi_bus_remove+0x32/0x6d
      [ 6155.452019]  [<ffffffff813a7932>] acpi_bus_trim+0x87/0xee
      [ 6155.452019]  [<ffffffff813a7a21>] acpi_bus_hot_remove_device+0x88/0x16b
      [ 6155.452019]  [<ffffffff813a33ee>] acpi_os_execute_deferred+0x27/0x34
      [ 6155.452019]  [<ffffffff81090589>] process_one_work+0x219/0x680
      [ 6155.452019]  [<ffffffff81090528>] ? process_one_work+0x1b8/0x680
      [ 6155.452019]  [<ffffffff813a33c7>] ? acpi_os_wait_events_complete+0x23/0x23
      [ 6155.452019]  [<ffffffff810923be>] worker_thread+0x12e/0x320
      [ 6155.452019]  [<ffffffff81092290>] ? manage_workers+0x110/0x110
      [ 6155.452019]  [<ffffffff81098396>] kthread+0xc6/0xd0
      [ 6155.452019]  [<ffffffff8167c4c4>] kernel_thread_helper+0x4/0x10
      [ 6155.452019]  [<ffffffff81671f30>] ? retint_restore_args+0x13/0x13
      [ 6155.452019]  [<ffffffff810982d0>] ? __init_kthread_worker+0x70/0x70
      [ 6155.452019]  [<ffffffff8167c4c0>] ? gs_change+0x13/0x13
      
      This patch removes the set_cpus_allowed_ptr() call, and put the cmci rediscover
      jobs onto all the other cpus using system_wq. This could bring some delay for
      the jobs.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      85b97637
  7. 30 10月, 2012 1 次提交
  8. 26 10月, 2012 4 次提交
  9. 19 10月, 2012 1 次提交
  10. 18 10月, 2012 1 次提交
  11. 28 9月, 2012 1 次提交
  12. 10 8月, 2012 2 次提交
    • C
      x86/mce: Add CMCI poll mode · 55babd8f
      Chen Gong 提交于
      On Intel systems corrected machine check interrupts (CMCI) may be sent to
      multiple logical processors; possibly to all processors on the affected
      socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface").  This means
      that a persistent error (such as a stuck bit in ECC memory) may cause
      a storm of interrupts that greatly hinders or prevents forward progress
      (probably on many processors).
      
      To solve this we keep track of the rate at which each processor sees
      CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
      polling the machine check banks. If the storm subsides (none of the
      affected processors see any more errors for a complete poll interval) we
      re-enable CMCI.
      
      [Tony: Added console messages when storm begins/ends and increased storm
      threshold from 5 to 15 so we have a few more logged entries before we
      disable interrupts and start dropping reports]
      Signed-off-by: NChen Gong <gong.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NChen Gong <gong.chen@linux.intel.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      55babd8f
    • T
      x86/mce: Make cmci_discover() quiet · 4670a300
      Tony Luck 提交于
      cmci_discover() works out which machine check banks support CMCI, and
      which of those are shared by multiple logical processors. It uses this
      information to ensure that exactly one cpu is designated the owner of
      each bank so that when interrupts are broadcast to multiple cpus, only one
      of them will look in a shared bank to log the error and clear the bank.
      
      At boot time cmci_discover() performs this task silently. But during
      certain cpu hotplug operations it prints out a set of summary lines
      like this:
      
      CPU 35 MCA banks CMCI:0 CMCI:1 CMCI:3 CMCI:5 CMCI:6 CMCI:7 CMCI:8 CMCI:9 CMCI:10 CMCI:11
      CPU 1 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 39 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 38 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 32 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 37 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 36 MCA banks CMCI:0 CMCI:1 CMCI:3
      CPU 34 MCA banks CMCI:0 CMCI:1 CMCI:3
      
      The value of these messages seems very low. A user might painstakingly
      cross-check against the data sheet for a processor to ensure that all
      CMCI supported banks are correctly reported, but this seems improbable.
      If users really wanted to do this, we should print the information at
      boot time too.
      
      Remove the messages.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      4670a300
  13. 04 8月, 2012 4 次提交
  14. 26 7月, 2012 2 次提交
  15. 20 7月, 2012 2 次提交
  16. 12 7月, 2012 1 次提交
    • T
      x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults · 6751ed65
      Tony Luck 提交于
      In commit dad1743e ("x86/mce: Only restart instruction after machine
      check recovery if it is safe") we fixed mce_notify_process() to force a
      signal to the current process if it was not restartable (RIPV bit not
      set in MCG_STATUS). But doing it here means that the process doesn't
      get told the virtual address of the fault via siginfo_t->si_addr. This
      would prevent application level recovery from the fault.
      
      Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
      that we will provide the right information with the signal.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org    # 3.4+
      6751ed65
  17. 07 6月, 2012 8 次提交
  18. 06 6月, 2012 4 次提交
  19. 31 5月, 2012 1 次提交
  20. 24 5月, 2012 1 次提交