1. 22 7月, 2007 3 次提交
    • T
      x86_64: mcelog tolerant level cleanup · bd78432c
      Tim Hockin 提交于
      Background:
       The MCE handler has several paths that it can take, depending on various
       conditions of the MCE status and the value of the 'tolerant' knob.  The
       exact semantics are not well defined and the code is a bit twisty.
      
      Description:
       This patch makes the MCE handler's behavior more clear by documenting the
       behavior for various 'tolerant' levels.  It also fixes or enhances
       several small things in the handler.  Specifically:
           * If RIPV is set it is not safe to restart, so set the 'no way out'
             flag rather than the 'kill it' flag.
           * Don't panic() on correctable MCEs.
           * If the _OVER bit is set *and* the _UC bit is set (meaning possibly
             dropped uncorrected errors), set the 'no way out' flag.
           * Use EIPV for testing whether an app can be killed (SIGBUS) rather
             than RIPV.  According to docs, EIPV indicates that the error is
             related to the IP, while RIPV simply means the IP is valid to
             restart from.
           * Don't clear the MCi_STATUS registers until after the panic() path.
             This leaves the status bits set after the panic() so clever BIOSes
             can find them (and dumb BIOSes can do nothing).
      
       This patch also calls nonseekable_open() in mce_open (as suggested by akpm).
      
      Result:
       Tolerant levels behave almost identically to how they always have, but
       not it's well defined.  There's a slightly higher chance of panic()ing
       when multiple errors happen (a good thing, IMHO).  If you take an MBE and
       panic(), the error status bits are not cleared.
      
      Alternatives:
       None.
      
      Testing:
       I used software to inject correctable and uncorrectable errors.  With
       tolerant = 3, the system usually survives.  With tolerant = 2, the system
       usually panic()s (PCC) but not always.  With tolerant = 1, the system
       always panic()s.  When the system panic()s, the BIOS is able to detect
       that the cause of death was an MC4.  I was not able to reproduce the
       case of a non-PCC error in userspace, with EIPV, with (tolerant < 3).
       That will be rare at best.
      Signed-off-by: NTim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd78432c
    • T
      x86_64: support poll() on /dev/mcelog · e02e68d3
      Tim Hockin 提交于
      Background:
       /dev/mcelog is typically polled manually.  This is less than optimal for
       situations where accurate accounting of MCEs is important.  Calling
       poll() on /dev/mcelog does not work.
      
      Description:
       This patch adds support for poll() to /dev/mcelog.  This results in
       immediate wakeup of user apps whenever the poller finds MCEs.  Because
       the exception handler can not take any locks, it can not call the wakeup
       itself.  Instead, it uses a thread_info flag (TIF_MCE_NOTIFY) which is
       caught at the next return from interrupt or exit from idle, calling the
       mce_user_notify() routine.  This patch also disables the "fake panic"
       path of the mce_panic(), because it results in printk()s in the exception
       handler and crashy systems.
      
       This patch also does some small cleanup for essentially unused variables,
       and moves the user notification into the body of the poller, so it is
       only called once per poll, rather than once per CPU.
      
      Result:
       Applications can now poll() on /dev/mcelog.  When an error is logged
       (whether through the poller or through an exception) the applications are
       woken up promptly.  This should not affect any previous behaviors.  If no
       MCEs are being logged, there is no overhead.
      
      Alternatives:
       I considered simply supporting poll() through the poller and not using
       TIF_MCE_NOTIFY at all.  However, the time between an uncorrectable error
       happening and the user application being notified is *the*most* critical
       window for us.  Many uncorrectable errors can be logged to the network if
       given a chance.
      
       I also considered doing the MCE poll directly from the idle notifier, but
       decided that was overkill.
      
      Testing:
       I used an error-injecting DIMM to create lots of correctable DRAM errors
       and verified that my user app is woken up in sync with the polling interval.
       I also used the northbridge to inject uncorrectable ECC errors, and
       verified (printk() to the rescue) that the notify routine is called and the
       user app does wake up.  I built with PREEMPT on and off, and verified
       that my machine survives MCEs.
      
      [wli@holomorphy.com: build fix]
      Signed-off-by: NTim Hockin <thockin@google.com>
      Signed-off-by: NWilliam Irwin <bill.irwin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02e68d3
    • T
      x86_64: O_EXCL on /dev/mcelog · f528e7ba
      Tim Hockin 提交于
      Background:
       /dev/mcelog is a clear-on-read interface.  It is currently possible for
       multiple users to open and read() the device.  Users are protected from
       each other during any one read, but not across reads.
      
      Description:
       This patch adds support for O_EXCL to /dev/mcelog.  If a user opens the
       device with O_EXCL, no other user may open the device (EBUSY).  Likewise,
       any user that tries to open the device with O_EXCL while another user has
       the device will fail (EBUSY).
      
      Result:
       Applications can get exclusive access to /dev/mcelog.  Applications that
       do not care will be unchanged.
      
      Alternatives:
       A simpler choice would be to only allow one open() at all, regardless of
       O_EXCL.
      
      Testing:
       I wrote an application that opens /dev/mcelog with O_EXCL and observed
       that any other app that tried to open /dev/mcelog would fail until the
       exclusive app had closed the device.
      
      Caveats:
       None.
      Signed-off-by: NTim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f528e7ba
  2. 18 7月, 2007 1 次提交
    • J
      usermodehelper: Tidy up waiting · 86313c48
      Jeremy Fitzhardinge 提交于
      Rather than using a tri-state integer for the wait flag in
      call_usermodehelper_exec, define a proper enum, and use that.  I've
      preserved the integer values so that any callers I've missed should
      still work OK.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Howells <dhowells@redhat.com>
      86313c48
  3. 24 6月, 2007 1 次提交
    • J
      x86_64: fix misplaced `continue' in mce.c · 4f84e4be
      Joshua Wise 提交于
      Background:
        When a userspace application wants to know about machine check events, it
        opens /dev/mcelog and does a read(). Usually, we found that this interface
        works well, but in some cases, when the system was taking large numbers of
        machine check exceptions, the read() would hang. The system would output a
        soft-lockup warning, and the daemon reading from /dev/mcelog would suck up
        as much of a single CPU as it could spinning in system space.
      
      Description:
        This patch fixes this bug. In particular, there was a "continue" inside a
        timeout loop that presumably was intended to break out of the outer loop,
        but instead caused the inner loop to continue. This patch also makes the
        condition for the break-out a little more evident by changing a
        !time_before to a time_after_eq.
      
      Result:
        The read() no longer hangs in this test case.
      
      Testing:
        On my system, I could replicate the bug with the following command:
          # for i in `seq 15000`; do ./inject_sbe.sh; done
        where inject_sbe.sh contains commands to inject a single-bit error into the
        next memory write transaction.
      
      Patch:
        This patch is against git f1518a08.
      Signed-off-by: NJoshua Wise <jwise@google.com>
      Signed-off-by: NTim Hockin <thockin@google.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f84e4be
  4. 10 5月, 2007 1 次提交
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
  5. 09 5月, 2007 1 次提交
    • C
      move die notifier handling to common code · 1eeb66a1
      Christoph Hellwig 提交于
      This patch moves the die notifier handling to common code.  Previous
      various architectures had exactly the same code for it.  Note that the new
      code is compiled unconditionally, this should be understood as an appel to
      the other architecture maintainer to implement support for it aswell (aka
      sprinkling a notify_die or two in the proper place)
      
      arm had a notifiy_die that did something totally different, I renamed it to
      arm_notify_die as part of the patch and made it static to the file it's
      declared and used at.  avr32 used to pass slightly less information through
      this interface and I brought it into line with the other architectures.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
      [bryan.wu@analog.com: fix vmalloc_sync_all in nommu]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NBryan Wu <bryan.wu@analog.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eeb66a1
  6. 03 5月, 2007 1 次提交
    • T
      [PATCH] x86-64: Dynamically adjust machine check interval · 8a336b0a
      Tim Hockin 提交于
      Background:
       We've found that MCEs (specifically DRAM SBEs) tend to come in bunches,
       especially when we are trying really hard to stress the system out.  The
       current MCE poller uses a static interval which does not care whether it
       has or has not found MCEs recently.
      
      Description:
       This patch makes the MCE poller adjust the polling interval dynamically.
       If we find an MCE, poll 2x faster (down to 10 ms).  When we stop finding
       MCEs, poll 2x slower (up to check_interval seconds).  The check_interval
       tunable becomes the max polling interval.  The "Machine check events
       logged" printk() is rate limited to the check_interval, which should be
       identical behavior to the old functionality.
      
      Result:
       If you start to take a lot of correctable errors (not exceptions), you
       log them faster and more accurately (less chance of overflowing the MCA
       registers).  If you don't take a lot of errors, you will see no change.
      
      Alternatives:
       I considered simply reducing the polling interval to 10 ms immediately
       and keeping it there as long as we continue to find errors.  This felt a
       bit heavy handed, but does perform significantly better for the default
       check_interval of 5 minutes (we're using a few seconds when testing for
       DRAM errors).  I could be convinced to go with this, if anyone felt it
       was not too aggressive.
      
      Testing:
       I used an error-injecting DIMM to create lots of correctable DRAM errors
       and verified that the polling interval accelerates.  The printk() only
       happens once per check_interval seconds.
      
      Patch:
       This patch is against 2.6.21-rc7.
      Signed-Off-By: NTim Hockin <thockin@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      8a336b0a
  7. 13 2月, 2007 2 次提交
  8. 08 12月, 2006 1 次提交
  9. 07 12月, 2006 1 次提交
  10. 22 11月, 2006 2 次提交
    • D
      WorkStruct: Pass the work_struct pointer instead of context data · 65f27f38
      David Howells 提交于
      Pass the work_struct pointer to the work function rather than context data.
      The work function can use container_of() to work out the data.
      
      For the cases where the container of the work_struct may go away the moment the
      pending bit is cleared, it is made possible to defer the release of the
      structure by deferring the clearing of the pending bit.
      
      To make this work, an extra flag is introduced into the management side of the
      work_struct.  This governs auto-release of the structure upon execution.
      
      Ordinarily, the work queue executor would release the work_struct for further
      scheduling or deallocation by clearing the pending bit prior to jumping to the
      work function.  This means that, unless the driver makes some guarantee itself
      that the work_struct won't go away, the work function may not access anything
      else in the work_struct or its container lest they be deallocated..  This is a
      problem if the auxiliary data is taken away (as done by the last patch).
      
      However, if the pending bit is *not* cleared before jumping to the work
      function, then the work function *may* access the work_struct and its container
      with no problems.  But then the work function must itself release the
      work_struct by calling work_release().
      
      In most cases, automatic release is fine, so this is the default.  Special
      initiators exist for the non-auto-release case (ending in _NAR).
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      65f27f38
    • D
      WorkStruct: Separate delayable and non-delayable events. · 52bad64d
      David Howells 提交于
      Separate delayable work items from non-delayable work items be splitting them
      into a separate structure (delayed_work), which incorporates a work_struct and
      the timer_list removed from work_struct.
      
      The work_struct struct is huge, and this limits it's usefulness.  On a 64-bit
      architecture it's nearly 100 bytes in size.  This reduces that by half for the
      non-delayable type of event.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      52bad64d
  11. 26 9月, 2006 2 次提交
    • D
      [PATCH] x86: Refactor thermal throttle processing · 15d5f839
      Dmitriy Zavin 提交于
      Refactor the event processing (syslog messaging and rate limiting)
      into separate file therm_throt.c. This allows consistent reporting
      of CPU thermal throttle events.
      
      After ACK'ing the interrupt, if the event is current, the user
      (p4.c/mce_intel.c) calls therm_throt_process to log (and rate limit)
      the event. If that function returns 1, the user has the option to log
      things further (such as to mce_log in x86_64).
      
      AK: minor cleanup
      Signed-off-by: NDmitriy Zavin <dmitriyz@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      15d5f839
    • A
      [PATCH] Remove safe_smp_processor_id() · 151f8cc1
      Andi Kleen 提交于
      And replace all users with ordinary smp_processor_id.  The function
      was originally added to get some basic oops information out even
      if the GS register was corrupted. However that didn't
      work for some anymore because printk is needed to print the oops
      and it uses smp_processor_id() already. Also GS register corruptions
      are not particularly common anymore.
      
      This also helps the Xen port which would otherwise need to
      do this in a special way because it can't access the local APIC.
      
      Cc: Chris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      151f8cc1
  12. 01 8月, 2006 1 次提交
  13. 28 6月, 2006 2 次提交
  14. 27 6月, 2006 1 次提交
  15. 26 4月, 2006 1 次提交
  16. 10 4月, 2006 1 次提交
  17. 01 4月, 2006 1 次提交
    • O
      [PATCH] Don't pass boot parameters to argv_init[] · 9b41046c
      OGAWA Hirofumi 提交于
      The boot cmdline is parsed in parse_early_param() and
      parse_args(,unknown_bootoption).
      
      And __setup() is used in obsolete_checksetup().
      
      	start_kernel()
      		-> parse_args()
      			-> unknown_bootoption()
      				-> obsolete_checksetup()
      
      If __setup()'s callback (->setup_func()) returns 1 in
      obsolete_checksetup(), obsolete_checksetup() thinks a parameter was
      handled.
      
      If ->setup_func() returns 0, obsolete_checksetup() tries other
      ->setup_func().  If all ->setup_func() that matched a parameter returns 0,
      a parameter is seted to argv_init[].
      
      Then, when runing /sbin/init or init=app, argv_init[] is passed to the app.
      If the app doesn't ignore those arguments, it will warning and exit.
      
      This patch fixes a wrong usage of it, however fixes obvious one only.
      Signed-off-by: NOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9b41046c
  18. 24 3月, 2006 1 次提交
    • A
      [PATCH] x86_64: {set,clear,test}_bit() related cleanup and pci_mmcfg_init() fix · 3d1712c9
      Akinobu Mita 提交于
      While working on these patch set, I found several possible cleanup on x86-64
      and ia64.
      
      akpm: I stole this from Andi's queue.
      
      Not only does it clean up bitops.  It also unrelatedly changes the prototype
      of pci_mmcfg_init() and removes its arch_initcall().  It seems that the wrong
      two patches got joined together, but this is the one which has been tested.
      
      This patch fixes the current x86_64 build error (the pci_mmcfg_init()
      declaration in arch/i386/pci/pci.h disagrees with the definition in
      arch/x86_64/pci/mmconfig.c)
      
      This also means that x86_64's pci_mmcfg_init() gets called in the same (new)
      manner as x86's: from arch/i386/pci/init.c:pci_access_init(), rather than via
      initcall.
      
      The bitops cleanups came along for free.
      
      All this worked OK in -mm testing (since 2.6.16-rc4-mm1) because x86_64 was
      tested with both patches applied.
      Signed-off-by: NAkinobu Mita <mita@miraclelinux.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Con Kolivas <kernel@kolivas.org>
      Cc: Jean Delvare <khali@linux-fr.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3d1712c9
  19. 05 2月, 2006 1 次提交
  20. 12 1月, 2006 5 次提交
  21. 15 11月, 2005 2 次提交
    • A
      [PATCH] x86_64: Log machine checks from boot on Intel systems · e583538f
      Andi Kleen 提交于
      The logging for boot errors was turned off because it was broken
      on some AMD systems. But give Intel EM64T systems a chance because they are
      supposed to be correct there.
      
      The advantage is that there is a chance to actually log uncorrected
      machine checks after the reset.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e583538f
    • J
      [PATCH] x86_64: Support for AMD specific MCE Threshold. · 89b831ef
      Jacob Shin 提交于
      MC4_MISC - DRAM Errors Threshold Register realized under AMD K8 Rev F.
      This register is used to count correctable and uncorrectable ECC errors that occur during DRAM read operations.
      The user may interface through sysfs files in order to change the threshold configuration.
      
      bank%d/error_count - reads current error count, write to clear.
      bank%d/interrupt_enable - set/clear interrupt enable.
      bank%d/threshold_limit - read/write the threshold limit.
      
      APIC vector 0xF9 in hw_irq.h.
      5 software defined bank ids in mce.h.
      new apic.c function to setup threshold apic lvt.
      defaults to interrupt off, count enabled, and threshold limit max.
      sysfs interface created on /sys/devices/system/threshold.
      
      AK: added some ifdefs to make it compile on UP
      Signed-off-by: NJacob Shin <jacob.shin@amd.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89b831ef
  22. 30 9月, 2005 1 次提交
    • M
      [PATCH] x86_64: Fix mce_log · 7644143c
      Mike Waychison 提交于
      The attempt to fixup the lockless mce log buffer introduced an infinite loop
      when trying to find a free entry.
      
      And:
      
      Using rcu_dereference() to load mcelog.next doesn't seem to be sufficient
      enough to ensure that mcelog.next is loaded each time around the loop in
      mce_log().  Instead, use an explicit rmb() to ensure that the compiler gets it
      right.
      
      AK: turned the smp_wmbs into true wmbs to make sure they are not
      reordered by the compiler on UP.
      Signed-off-by: NMike Waychison <mikew@google.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7644143c
  23. 13 9月, 2005 4 次提交
  24. 08 8月, 2005 1 次提交
  25. 29 7月, 2005 1 次提交
  26. 26 6月, 2005 1 次提交
    • P
      [PATCH] RCU: clean up a few remaining synchronize_kernel() calls · b2b18660
      Paul E. McKenney 提交于
      2.6.12-rc6-mm1 has a few remaining synchronize_kernel()s, some (but not
      all) in comments.  This patch changes these synchronize_kernel() calls (and
      comments) to synchronize_rcu() or synchronize_sched() as follows:
      
      - arch/x86_64/kernel/mce.c mce_read(): change to synchronize_sched() to
        handle races with machine-check exceptions (synchronize_rcu() would not cut
        it given RCU implementations intended for hardcore realtime use.
      
      - drivers/input/serio/i8042.c i8042_stop(): change to synchronize_sched() to
        handle races with i8042_interrupt() interrupt handler.  Again,
        synchronize_rcu() would not cut it given RCU implementations intended for
        hardcore realtime use.
      
      - include/*/kdebug.h comments: change to synchronize_sched() to handle races
        with NMIs.  As before, synchronize_rcu() would not cut it...
      
      - include/linux/list.h comment: change to synchronize_rcu(), since this
        comment is for list_del_rcu().
      
      - security/keys/key.c unregister_key_type(): change to synchronize_rcu(),
        since this is interacting with RCU read side.
      
      - security/keys/process_keys.c install_session_keyring(): change to
        synchronize_rcu(), since this is interacting with RCU read side.
      Signed-off-by: N"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b2b18660