1. 17 8月, 2009 1 次提交
    • B
      x86, mce: don't log boot MCEs on Pentium M (model == 13) CPUs · c7f6fa44
      Bartlomiej Zolnierkiewicz 提交于
      On my legacy Pentium M laptop (Acer Extensa 2900) I get bogus MCE on a cold
      boot with CONFIG_X86_NEW_MCE enabled, i.e. (after decoding it with mcelog):
      
      MCE 0
      HARDWARE ERROR. This is *NOT* a software problem!
      Please contact your hardware vendor
      CPU 0 BANK 1 MCG status:
      MCi status:
      Error overflow
      Uncorrected error
      Error enabled
      Processor context corrupt
      MCA: Data CACHE Level-1 UNKNOWN Error
      STATUS f200000000000195 MCGSTATUS 0
      
      [ The other STATUS values observed: f2000000000001b5 (... UNKNOWN error)
        and f200000000000115 (... READ Error).
      
        To verify that this is not a CONFIG_X86_NEW_MCE bug I also modified
        the CONFIG_X86_OLD_MCE code (which doesn't log any MCEs) to dump
        content of STATUS MSR before it is cleared during initialization. ]
      
      Since the bogus MCE results in a kernel taint (which in turn disables
      lockdep support) don't log boot MCEs on Pentium M (model == 13) CPUs
      by default ("mce=bootlog" boot parameter can be be used to get the old
      behavior).
      Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c7f6fa44
  2. 16 8月, 2009 1 次提交
  3. 11 8月, 2009 1 次提交
    • D
      x86, mce: therm_throt - change when we print messages · 0d01f314
      Dmitry Torokhov 提交于
      My Latitude d630 seems to be handling thermal events in SMI by
      lowering the max frequency of the CPU till it cools down but
      still leaks the "everything is normal" events.
      
      This spams the console and with high priority printks.
      
      Adjust therm_throt driver to only print messages about the fact
      that temperatire returned back to normal when leaving the
      throttling state.
      
      Also lower the severity of "back to normal" message from
      KERN_CRIT to KERN_INFO.
      Signed-off-by: NDmitry Torokhov <dtor@mail.ru>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      LKML-Reference: <20090810051513.0558F526EC9@mailhub.coreip.homeip.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d01f314
  4. 22 7月, 2009 1 次提交
  5. 09 7月, 2009 1 次提交
  6. 26 6月, 2009 1 次提交
  7. 18 6月, 2009 3 次提交
  8. 17 6月, 2009 19 次提交
  9. 11 6月, 2009 2 次提交
    • H
      x86, mce: Add boot options for corrected errors · 62fdac59
      Hidetoshi Seto 提交于
      This patch introduces three boot options (no_cmci, dont_log_ce
      and ignore_ce) to control handling for corrected errors.
      
      The "mce=no_cmci" boot option disables the CMCI feature.
      
      Since CMCI is a new feature so having boot controls to disable
      it will be a help if the hardware is misbehaving.
      
      The "mce=dont_log_ce" boot option disables logging for corrected
      errors. All reported corrected errors will be cleared silently.
      This option will be useful if you never care about corrected
      errors.
      
      The "mce=ignore_ce" boot option disables features for corrected
      errors, i.e. polling timer and cmci.  All corrected events are
      not cleared and kept in bank MSRs.
      
      Usually this disablement is not recommended, however it will be
      a help if there are some conflict with the BIOS or hardware
      monitoring applications etc., that clears corrected events in
      banks instead of OS.
      
      [ And trivial cleanup (space -> tab) for doc is included. ]
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      LKML-Reference: <4A30ACDF.5030408@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62fdac59
    • H
      x86, mce: Fix mce printing · 77e26cca
      Hidetoshi Seto 提交于
      This patch:
      
       - Adds print_mce_head() instead of first flag
       - Makes the header to be printed always
       - Stops double printing of corrected errors
      
      [ This portion originates from Huang Ying's patch ]
      
      Originally-From: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      LKML-Reference: <4A30AC83.5010708@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77e26cca
  10. 10 6月, 2009 1 次提交
    • A
      KVM: Add VT-x machine check support · a0861c02
      Andi Kleen 提交于
      VT-x needs an explicit MC vector intercept to handle machine checks in the
      hyper visor.
      
      It also has a special option to catch machine checks that happen
      during VT entry.
      
      Do these interceptions and forward them to the Linux machine check
      handler. Make it always look like user space is interrupted because
      the machine check handler treats kernel/user space differently.
      
      Thanks to Jiang Yunhong for help and testing.
      
      Cc: stable@kernel.org
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      a0861c02
  11. 09 6月, 2009 1 次提交
  12. 04 6月, 2009 8 次提交
    • A
      x86, mce: support action-optional machine checks · 9b1beaf2
      Andi Kleen 提交于
      Newer Intel CPUs support a new class of machine checks called recoverable
      action optional.
      
      Action Optional means that the CPU detected some form of corruption in
      the background and tells the OS about using a machine check
      exception. The OS can then take appropiate action, like killing the
      process with the corrupted data or logging the event properly to disk.
      
      This is done by the new generic high level memory failure handler added
      in a earlier patch. The high level handler takes the address with the
      failed memory and does the appropiate action, like killing the process.
      
      In this version of the patch the high level handler is stubbed out
      with a weak function to not create a direct dependency on the hwpoison
      branch.
      
      The high level handler cannot be directly called from the machine check
      exception though, because it has to run in a defined process context to
      be able to sleep when taking VM locks (it is not expected to sleep for a
      long time, just do so in some exceptional cases like lock contention)
      
      Thus the MCE handler has to queue a work item for process context,
      trigger process context and then call the high level handler from there.
      
      This patch adds two path to process context: through a per thread kernel
      exit notify_user() callback or through a high priority work item.
      The first runs when the process exits back to user space, the other when
      it goes to sleep and there is no higher priority process.
      
      The machine check handler will schedule both, and whoever runs first
      will grab the event. This is done because quick reaction to this
      event is critical to avoid a potential more fatal machine check
      when the corruption is consumed.
      
      There is a simple lock less ring buffer to queue the corrupted
      addresses between the exception handler and the process context handler.
      Then in process context it just calls the high level VM code with
      the corrupted PFNs.
      
      The code adds the required code to extract the failed address from
      the CPU's machine check registers. It doesn't try to handle all
      possible cases -- the specification has 6 different ways to specify
      memory address -- but only the linear address.
      
      Most of the required checking has been already done earlier in the
      mce_severity rule checking engine.  Following the Intel
      recommendations Action Optional errors are only enabled for known
      situations (encoded in MCACODs). The errors are ignored otherwise,
      because they are action optional.
      
      v2: Improve comment, disable preemption while processing ring buffer
          (reported by Ying Huang)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9b1beaf2
    • A
      x86, mce: rename mce_notify_user to mce_notify_irq · 9ff36ee9
      Andi Kleen 提交于
      Rename the mce_notify_user function to mce_notify_irq. The next
      patch will split the wakeup handling of interrupt context
      and of process context and it's better to give it a clearer
      name for this.
      
      Contains a fix from Ying Huang
      
      [ Impact: cleanup ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9ff36ee9
    • H
      x86, mce: export MCE severities coverage via debugfs · 4611a6fa
      Huang Ying 提交于
      The MCE severity judgement code is data-driven, so code coverage tools
      such as gcov can not be used for measuring coverage. Instead a dedicated
      coverage mechanism is implemented.  The kernel keeps track of rules
      executed and reports them in debugfs.
      
      This is useful for increasing coverage of the mce-test testsuite.
      
      Right now it's unconditionally enabled because it's very little code.
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      4611a6fa
    • A
      x86, mce: implement new status bits · ed7290d0
      Andi Kleen 提交于
      The x86 architecture recently added some new machine check status bits:
      S(ignalled) and AR (Action-Required). Signalled allows to check
      if a specific event caused an exception or was just logged through CMCI.
      AR allows the kernel to decide if an event needs immediate action
      or can be delayed or ignored.
      
      Implement support for these new status bits. mce_severity() uses
      the new bits to grade the machine check correctly and decide what
      to do. The exception handler uses AR to decide to kill or not.
      The S bit is used to separate events between the poll/CMCI handler
      and the exception handler.
      
      Classical UC always leads to panic. That was true before anyways
      because the existing CPUs always passed a PCC with it.
      
      Also corrects the rules whether to kill in user or kernel context
      and how to handle missing RIPV.
      
      The machine check handler largely uses the mce-severity grading
      engine now instead of making its own decisions. This means the logic
      is centralized in one place.  This is useful because it has to be
      evaluated multiple times.
      
      v2: Some rule fixes; Add AO events
      Fix RIPV, RIPV|EIPV order (Ying Huang)
      Fix UCNA with AR=1 message (Ying Huang)
      Add comment about panicing in m_c_p.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ed7290d0
    • A
      x86, mce: print header/footer only once for multiple MCEs · 86503560
      Andi Kleen 提交于
      When multiple MCEs are printed print the "HARDWARE ERROR" header
      and "This is not a software error" footer only once. This
      makes the output much more compact with many CPUs.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      86503560
    • A
      x86, mce: default to panic timeout for machine checks · 29b0f591
      Andi Kleen 提交于
      Fatal machine checks can be logged to disk after boot, but only if
      the system did a warm reboot. That's unfortunately difficult with the
      default panic behaviour, which waits forever and the admin has to
      press the power button because modern systems usually miss a reset button.
      This clears the machine checks in the registers and make
      it impossible to log them.
      
      This patch changes the default for machine check panic to always
      reboot after 30s. Then the mce can be successfully logged after
      reboot.
      
      I believe this will improve machine check experience for any
      system running the X server.
      
      This is dependent on successfull boot logging of MCEs. This currently
      only works on Intel systems, on AMD there are quite a lot of systems
      around which leave junk in the machine check registers after boot,
      so it's disabled here. These systems will continue to default
      to endless waiting panic.
      
      v2: Only force panic timeout when it's shorter (H.Seto)
      v3: Only force timeout when there is no timeout
      (based on comment H.Seto)
      
      [ Fix changelog - HS ]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      29b0f591
    • H
      x86, mce: improve mce_get_rip · 1b2797dc
      Huang Ying 提交于
      Assume IP on the stack is valid when either EIPV or RIPV are set.
      This influences whether the machine check exception handler decides
      to return or panic.
      
      This fixes a test case in the mce-test suite and is more compliant
      to the specification.
      
      This currently only makes a difference in a artificial testing
      scenario with the mce-test test suite.
      
      Also in addition do not force the EIPV to be valid with the exact
      register MSRs, and keep in trust the CS value on stack even if MSR
      is available.
      
      [AK: combination of patches from Huang Ying and Hidetoshi Seto, with
      new description by me]
      [add some description, no code changed - HS]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      1b2797dc
    • A
      x86, mce: make non Monarch panic message "Fatal machine check" too · ac960375
      Andi Kleen 提交于
      ... instead of "Machine check". This is for consistency with the Monarch
      panic message.
      
      Based on a report from Ying Huang.
      
      v2: But add a descriptive postfix so that the test suite can distingush.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      ac960375