1. 07 5月, 2015 1 次提交
    • A
      x86/mce: Add support for deferred errors on AMD · 7559e13f
      Aravind Gopalakrishnan 提交于
      Deferred errors indicate error conditions that were not corrected, but
      those errors have not been consumed yet. They require no action from
      S/W (or action is optional). These errors provide info about a latent
      uncorrectable MCE that can occur when a poisoned data is consumed by the
      processor.
      
      Newer AMD processors can generate deferred errors and can be configured
      to generate APIC interrupts on such events.
      
      SUCCOR stands for S/W UnCorrectable error COntainment and Recovery.
      It indicates support for data poisoning in HW and deferred error
      interrupts.
      
      Add new bitfield to mce_vendor_flags for this. We use this to verify
      presence of deferred error interrupts before we enable them in mce_amd.c
      
      While at it, clarify comments in mce_vendor_flags to provide an
      indication of usages of the bitfields.
      Signed-off-by: NAravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Cc: linux-edac <linux-edac@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1430913538-1415-4-git-send-email-Aravind.Gopalakrishnan@amd.com
      [ beef up commit message, do CPUID(8000_0007) only once. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      7559e13f
  2. 24 3月, 2015 2 次提交
  3. 19 2月, 2015 1 次提交
    • B
      x86/MCE/intel: Cleanup CMCI storm logic · 3f2f0680
      Borislav Petkov 提交于
      Initially, this started with the yet another report about a race
      condition in the CMCI storm adaptive period length thing. Yes, we have
      to admit, it is fragile and error prone. So let's simplify it.
      
      The simpler logic is: now, after we enter storm mode, we go straight to
      polling with CMCI_STORM_INTERVAL, i.e. once a second. We remain in storm
      mode as long as we see errors being logged while polling.
      
      Theoretically, if we see an uninterrupted error stream, we will remain
      in storm mode indefinitely and keep polling the MSRs.
      
      However, when the storm is actually a burst of errors, once we have
      logged them all, we back out of it after ~5 mins of polling and no more
      errors logged.
      
      If we encounter an error during those 5 minutes, we reset the polling
      interval to 5 mins.
      
      Making machine_check_poll() return a bool and denoting whether it has
      seen an error or not lets us simplify a bunch of code and move the storm
      handling private to mce_intel.c.
      
      Some minor cleanups while at it.
      Reported-by: NCalvin Owens <calvinowens@fb.com>
      Tested-by: NTony Luck <tony.luck@intel.com>
      Link: http://lkml.kernel.org/r/1417746575-23299-1-git-send-email-calvinowens@fb.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      3f2f0680
  4. 07 1月, 2015 1 次提交
  5. 20 11月, 2014 1 次提交
  6. 22 10月, 2014 1 次提交
  7. 05 6月, 2014 1 次提交
  8. 07 1月, 2014 1 次提交
  9. 24 10月, 2013 1 次提交
  10. 06 8月, 2013 1 次提交
  11. 09 7月, 2013 1 次提交
    • N
      mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC · c3d1fb56
      Naveen N. Rao 提交于
      The Corrected Machine Check structure (CMC) in HEST has a flag which can be
      set by the firmware to indicate to the OS that it prefers to process the
      corrected error events first. In this scenario, the OS is expected to not
      monitor for corrected errors (through CMCI/polling). Instead, the firmware
      notifies the OS on corrected error events through GHES.
      
      Linux already has support for GHES. This patch adds support for parsing CMC
      structure and to disable CMCI/polling if the firmware first flag is set.
      
      Further, the list of machine check bank structures at the end of CMC is used
      to determine which MCA banks function in FF mode, so that we continue to
      monitor error events on the other banks.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      c3d1fb56
  12. 13 6月, 2013 1 次提交
  13. 05 6月, 2013 1 次提交
  14. 03 4月, 2013 1 次提交
    • S
      x86/mce: Rework cmci_rediscover() to play well with CPU hotplug · 7a0c819d
      Srivatsa S. Bhat 提交于
      Dave Jones reports that offlining a CPU leads to this trace:
      
      numa_remove_cpu cpu 1 node 0: mask now 0,2-3
      smpboot: CPU 1 is now offline
      BUG: using smp_processor_id() in preemptible [00000000] code:
      cpu-offline.sh/10591
      caller is cmci_rediscover+0x6a/0xe0
      Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2
      Call Trace:
       [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100
       [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0
       [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae
       [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150
       [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10
       [<ffffffff8104c2e3>] cpu_notify+0x23/0x50
       [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20
       [<ffffffff815ef082>] _cpu_down+0x302/0x350
       [<ffffffff815ef106>] cpu_down+0x36/0x50
       [<ffffffff815f1c9d>] store_online+0x8d/0xd0
       [<ffffffff813edc48>] dev_attr_store+0x18/0x30
       [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150
       [<ffffffff811adfb2>] vfs_write+0xa2/0x170
       [<ffffffff811ae16c>] sys_write+0x4c/0xa0
       [<ffffffff81613019>] system_call_fastpath+0x16/0x1b
      
      However, a look at cmci_rediscover shows that it can be simplified quite
      a bit, apart from solving the above issue. It invokes functions that
      take spin locks with interrupts disabled, and hence it can run in atomic
      context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU
      is already dead and out of the cpu_online_mask. So take these points into
      account and simplify the code, and thereby also fix the above issue.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      7a0c819d
  15. 09 1月, 2013 1 次提交
    • B
      x86, MCE: Retract most UAPI exports · f51bde6f
      Borislav Petkov 提交于
      Retract back most macro definitions which went into the
      user-visible mce.h header. Even though those bits are mostly
      hardware-defined/-architectural, their naming is not. If we export them
      to userspace, any kernel unification/renaming/cleanup cannot be done
      anymore since those are effectively cast in stone. Besides, if userspace
      wants those definitions, they can write their own defines and go crazy.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      f51bde6f
  16. 15 12月, 2012 1 次提交
  17. 26 10月, 2012 4 次提交
  18. 28 9月, 2012 1 次提交
  19. 18 9月, 2012 1 次提交
  20. 26 7月, 2012 1 次提交
  21. 23 2月, 2012 1 次提交
  22. 17 1月, 2012 1 次提交
  23. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  24. 17 12月, 2011 1 次提交
  25. 14 12月, 2011 1 次提交
  26. 08 11月, 2011 1 次提交
  27. 27 7月, 2011 1 次提交
  28. 16 6月, 2011 2 次提交
  29. 21 4月, 2011 1 次提交
  30. 04 1月, 2011 1 次提交
  31. 11 6月, 2010 2 次提交
  32. 20 5月, 2010 1 次提交
    • H
      ACPI, APEI, Generic Hardware Error Source memory error support · d334a491
      Huang Ying 提交于
      Generic Hardware Error Source provides a way to report platform
      hardware errors (such as that from chipset). It works in so called
      "Firmware First" mode, that is, hardware errors are reported to
      firmware firstly, then reported to Linux by firmware. This way, some
      non-standard hardware error registers or non-standard hardware link
      can be checked by firmware to produce more valuable hardware error
      information for Linux.
      
      Now, only SCI notification type and memory errors are supported. More
      notification type and hardware error type will be added later. These
      memory errors are reported to user space through /dev/mcelog via
      faking a corrected Machine Check, so that the error memory page can be
      offlined by /sbin/mcelog if the error count for one page is beyond the
      threshold.
      
      On some machines, Machine Check can not report physical address for
      some corrected memory errors, but GHES can do that. So this simplified
      GHES is implemented firstly.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      d334a491
  33. 13 1月, 2010 1 次提交
  34. 10 11月, 2009 1 次提交
    • Y
      x86: Under BIOS control, restore AP's APIC_LVTTHMR to the BSP value · a2202aa2
      Yong Wang 提交于
      On platforms where the BIOS handles the thermal monitor interrupt,
      APIC_LVTTHMR on each logical CPU is programmed to generate a SMI
      and OS must not touch it.
      
      Unfortunately AP bringup sequence using INIT-SIPI-SIPI clears all
      the LVT entries except the mask bit. Essentially this results in
      all LVT entries including the thermal monitoring interrupt set
      to masked (clearing the bios programmed value for APIC_LVTTHMR).
      
      And this leads to kernel take over the thermal monitoring
      interrupt on AP's but not on BSP (leaving the bios programmed
      value only on BSP).
      
      As a result of this, we have seen system hangs when the thermal
      monitoring interrupt is generated.
      
      Fix this by reading the initial value of thermal LVT entry on
      BSP and if bios has taken over the control, then program the
      same value on all AP's and leave the thermal monitoring
      interrupt control on all the logical cpu's to the bios.
      Signed-off-by: NYong Wang <yong.y.wang@intel.com>
      Reviewed-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Arjan van de Ven <arjan@infradead.org>
      LKML-Reference: <20091110013824.GA24940@ywang-moblin2.bj.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org
      a2202aa2