1. 08 9月, 2005 6 次提交
    • A
      [PATCH] optimize writer path in time_interpolator_get_counter() · 486d46ae
      Alex Williamson 提交于
            Christoph Lameter <clameter@engr.sgi.com>
      
      When using a time interpolator that is susceptible to jitter there's
      potentially contention over a cmpxchg used to prevent time from going
      backwards.  This is unnecessary when the caller holds the xtime write
      seqlock as all readers will be blocked from returning until the write is
      complete.  We can therefore allow writers to insert a new value and exit
      rather than fight with CPUs who only hold a reader lock.
      Signed-off-by: NAlex Williamson <alex.williamson@hp.com>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      486d46ae
    • D
      [PATCH] Provide better printk() support for SMP machines · fe21773d
      David Howells 提交于
      The attached patch prevents oopses interleaving with characters from
      other printks on other CPUs by only breaking the lock if the oops is
      happening on the machine holding the lock.
      
      It might be better if the oops generator got the lock and then called an
      inner vprintk routine that assumed the caller holds the lock, thus
      making oops reports "atomic".
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fe21773d
    • I
      [PATCH] detect soft lockups · 8446f1d3
      Ingo Molnar 提交于
      This patch adds a new kernel debug feature: CONFIG_DETECT_SOFTLOCKUP.
      
      When enabled then per-CPU watchdog threads are started, which try to run
      once per second.  If they get delayed for more than 10 seconds then a
      callback from the timer interrupt detects this condition and prints out a
      warning message and a stack dump (once per lockup incident).  The feature
      is otherwise non-intrusive, it doesnt try to unlock the box in any way, it
      only gets the debug info out, automatically, and on all CPUs affected by
      the lockup.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-Off-By: NMatthias Urlichs <smurf@smurf.noris.de>
      Signed-off-by: NRichard Purdie <rpurdie@rpsys.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8446f1d3
    • J
      [PATCH] FUTEX_WAKE_OP: pthread_cond_signal() speedup · 4732efbe
      Jakub Jelinek 提交于
      ATM pthread_cond_signal is unnecessarily slow, because it wakes one waiter
      (which at least on UP usually means an immediate context switch to one of
      the waiter threads).  This waiter wakes up and after a few instructions it
      attempts to acquire the cv internal lock, but that lock is still held by
      the thread calling pthread_cond_signal.  So it goes to sleep and eventually
      the signalling thread is scheduled in, unlocks the internal lock and wakes
      the waiter again.
      
      Now, before 2003-09-21 NPTL was using FUTEX_REQUEUE in pthread_cond_signal
      to avoid this performance issue, but it was removed when locks were
      redesigned to the 3 state scheme (unlocked, locked uncontended, locked
      contended).
      
      Following scenario shows why simply using FUTEX_REQUEUE in
      pthread_cond_signal together with using lll_mutex_unlock_force in place of
      lll_mutex_unlock is not enough and probably why it has been disabled at
      that time:
      
      The number is value in cv->__data.__lock.
              thr1            thr2            thr3
      0       pthread_cond_wait
      1       lll_mutex_lock (cv->__data.__lock)
      0       lll_mutex_unlock (cv->__data.__lock)
      0       lll_futex_wait (&cv->__data.__futex, futexval)
      0                       pthread_cond_signal
      1                       lll_mutex_lock (cv->__data.__lock)
      1                                       pthread_cond_signal
      2                                       lll_mutex_lock (cv->__data.__lock)
      2                                         lll_futex_wait (&cv->__data.__lock, 2)
      2                       lll_futex_requeue (&cv->__data.__futex, 0, 1, &cv->__data.__lock)
                                # FUTEX_REQUEUE, not FUTEX_CMP_REQUEUE
      2                       lll_mutex_unlock_force (cv->__data.__lock)
      0                         cv->__data.__lock = 0
      0                         lll_futex_wake (&cv->__data.__lock, 1)
      1       lll_mutex_lock (cv->__data.__lock)
      0       lll_mutex_unlock (cv->__data.__lock)
                # Here, lll_mutex_unlock doesn't know there are threads waiting
                # on the internal cv's lock
      
      Now, I believe it is possible to use FUTEX_REQUEUE in pthread_cond_signal,
      but it will cost us not one, but 2 extra syscalls and, what's worse, one of
      these extra syscalls will be done for every single waiting loop in
      pthread_cond_*wait.
      
      We would need to use lll_mutex_unlock_force in pthread_cond_signal after
      requeue and lll_mutex_cond_lock in pthread_cond_*wait after lll_futex_wait.
      
      Another alternative is to do the unlocking pthread_cond_signal needs to do
      (the lock can't be unlocked before lll_futex_wake, as that is racy) in the
      kernel.
      
      I have implemented both variants, futex-requeue-glibc.patch is the first
      one and futex-wake_op{,-glibc}.patch is the unlocking inside of the kernel.
       The kernel interface allows userland to specify how exactly an unlocking
      operation should look like (some atomic arithmetic operation with optional
      constant argument and comparison of the previous futex value with another
      constant).
      
      It has been implemented just for ppc*, x86_64 and i?86, for other
      architectures I'm including just a stub header which can be used as a
      starting point by maintainers to write support for their arches and ATM
      will just return -ENOSYS for FUTEX_WAKE_OP.  The requeue patch has been
      (lightly) tested just on x86_64, the wake_op patch on ppc64 kernel running
      32-bit and 64-bit NPTL and x86_64 kernel running 32-bit and 64-bit NPTL.
      
      With the following benchmark on UP x86-64 I get:
      
      for i in nptl-orig nptl-requeue nptl-wake_op; do echo time elf/ld.so --library-path .:$i /tmp/bench; \
      for j in 1 2; do echo ( time elf/ld.so --library-path .:$i /tmp/bench ) 2>&1; done; done
      time elf/ld.so --library-path .:nptl-orig /tmp/bench
      real 0m0.655s user 0m0.253s sys 0m0.403s
      real 0m0.657s user 0m0.269s sys 0m0.388s
      time elf/ld.so --library-path .:nptl-requeue /tmp/bench
      real 0m0.496s user 0m0.225s sys 0m0.271s
      real 0m0.531s user 0m0.242s sys 0m0.288s
      time elf/ld.so --library-path .:nptl-wake_op /tmp/bench
      real 0m0.380s user 0m0.176s sys 0m0.204s
      real 0m0.382s user 0m0.175s sys 0m0.207s
      
      The benchmark is at:
      http://sourceware.org/ml/libc-alpha/2005-03/txt00001.txt
      Older futex-requeue-glibc.patch version is at:
      http://sourceware.org/ml/libc-alpha/2005-03/txt00002.txt
      Older futex-wake_op-glibc.patch version is at:
      http://sourceware.org/ml/libc-alpha/2005-03/txt00003.txt
      Will post a new version (just x86-64 fixes so that the patch
      applies against pthread_cond_signal.S) to libc-hacker ml soon.
      
      Attached is the kernel FUTEX_WAKE_OP patch as well as a simple-minded
      testcase that will not test the atomicity of the operation, but at least
      check if the threads that should have been woken up are woken up and
      whether the arithmetic operation in the kernel gave the expected results.
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Jamie Lokier <jamie@shareable.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NYoichi Yuasa <yuasa@hh.iij4u.or.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4732efbe
    • P
      [PATCH] swsusp: update documentation · d7ae79c7
      Pavel Machek 提交于
      This updates documentation a bit (mostly removing obsolete stuff), and
      marks swsusp as no longer experimental in config.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d7ae79c7
    • A
      [PATCH] x86/x86_64: deferred handling of writes to /proc/irqxx/smp_affinity · 54d5d424
      Ashok Raj 提交于
      When handling writes to /proc/irq, current code is re-programming rte
      entries directly. This is not recommended and could potentially cause
      chipset's to lockup, or cause missing interrupts.
      
      CONFIG_IRQ_BALANCE does this correctly, where it re-programs only when the
      interrupt is pending. The same needs to be done for /proc/irq handling as well.
      Otherwise user space irq balancers are really not doing the right thing.
      
      - Changed pending_irq_balance_cpumask to pending_irq_migrate_cpumask for
        lack of a generic name.
      - added move_irq out of IRQ_BALANCE, and added this same to X86_64
      - Added new proc handler for write, so we can do deferred write at irq
        handling time.
      - Display of /proc/irq/XX/smp_affinity used to display CPU_MASKALL, instead
        it now shows only active cpu masks, or exactly what was set.
      - Provided a common move_irq implementation, instead of duplicating
        when using generic irq framework.
      
      Tested on i386/x86_64 and ia64 with CONFIG_PCI_MSI turned on and off.
      Tested UP builds as well.
      
      MSI testing: tbd: I have cards, need to look for a x-over cable, although I
      did test an earlier version of this patch.  Will test in a couple days.
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Acked-by: NZwane Mwaikambo <zwane@holomorphy.com>
      Grudgingly-acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NCoywolf Qi Hunt <coywolf@lovecn.org>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      54d5d424
  2. 05 9月, 2005 9 次提交
    • L
      [PATCH] UML Support - Ptrace: adds the host SYSEMU support, for UML and general usage · ed75e8d5
      Laurent Vivier 提交于
            Jeff Dike <jdike@addtoit.com>,
            Paolo 'Blaisorblade' Giarrusso <blaisorblade_spam@yahoo.it>,
            Bodo Stroesser <bstroesser@fujitsu-siemens.com>
      
      Adds a new ptrace(2) mode, called PTRACE_SYSEMU, resembling PTRACE_SYSCALL
      except that the kernel does not execute the requested syscall; this is useful
      to improve performance for virtual environments, like UML, which want to run
      the syscall on their own.
      
      In fact, using PTRACE_SYSCALL means stopping child execution twice, on entry
      and on exit, and each time you also have two context switches; with SYSEMU you
      avoid the 2nd stop and so save two context switches per syscall.
      
      Also, some architectures don't have support in the host for changing the
      syscall number via ptrace(), which is currently needed to skip syscall
      execution (UML turns any syscall into getpid() to avoid it being executed on
      the host).  Fixing that is hard, while SYSEMU is easier to implement.
      
      * This version of the patch includes some suggestions of Jeff Dike to avoid
        adding any instructions to the syscall fast path, plus some other little
        changes, by myself, to make it work even when the syscall is executed with
        SYSENTER (but I'm unsure about them). It has been widely tested for quite a
        lot of time.
      
      * Various fixed were included to handle the various switches between
        various states, i.e. when for instance a syscall entry is traced with one of
        PT_SYSCALL / _SYSEMU / _SINGLESTEP and another one is used on exit.
        Basically, this is done by remembering which one of them was used even after
        the call to ptrace_notify().
      
      * We're combining TIF_SYSCALL_EMU with TIF_SYSCALL_TRACE or TIF_SINGLESTEP
        to make do_syscall_trace() notice that the current syscall was started with
        SYSEMU on entry, so that no notification ought to be done in the exit path;
        this is a bit of a hack, so this problem is solved in another way in next
        patches.
      
      * Also, the effects of the patch:
      "Ptrace - i386: fix Syscall Audit interaction with singlestep"
      are cancelled; they are restored back in the last patch of this series.
      
      Detailed descriptions of the patches doing this kind of processing follow (but
      I've already summed everything up).
      
      * Fix behaviour when changing interception kind #1.
      
        In do_syscall_trace(), we check the status of the TIF_SYSCALL_EMU flag
        only after doing the debugger notification; but the debugger might have
        changed the status of this flag because he continued execution with
        PTRACE_SYSCALL, so this is wrong.  This patch fixes it by saving the flag
        status before calling ptrace_notify().
      
      * Fix behaviour when changing interception kind #2:
        avoid intercepting syscall on return when using SYSCALL again.
      
        A guest process switching from using PTRACE_SYSEMU to PTRACE_SYSCALL
        crashes.
      
        The problem is in arch/i386/kernel/entry.S.  The current SYSEMU patch
        inhibits the syscall-handler to be called, but does not prevent
        do_syscall_trace() to be called after this for syscall completion
        interception.
      
        The appended patch fixes this.  It reuses the flag TIF_SYSCALL_EMU to
        remember "we come from PTRACE_SYSEMU and now are in PTRACE_SYSCALL", since
        the flag is unused in the depicted situation.
      
      * Fix behaviour when changing interception kind #3:
        avoid intercepting syscall on return when using SINGLESTEP.
      
        When testing 2.6.9 and the skas3.v6 patch, with my latest patch and had
        problems with singlestepping on UML in SKAS with SYSEMU.  It looped
        receiving SIGTRAPs without moving forward.  EIP of the traced process was
        the same for all SIGTRAPs.
      
      What's missing is to handle switching from PTRACE_SYSCALL_EMU to
      PTRACE_SINGLESTEP in a way very similar to what is done for the change from
      PTRACE_SYSCALL_EMU to PTRACE_SYSCALL_TRACE.
      
      I.e., after calling ptrace(PTRACE_SYSEMU), on the return path, the debugger is
      notified and then wake ups the process; the syscall is executed (or skipped,
      when do_syscall_trace() returns 0, i.e.  when using PTRACE_SYSEMU), and
      do_syscall_trace() is called again.  Since we are on the return path of a
      SYSEMU'd syscall, if the wake up is performed through ptrace(PTRACE_SYSCALL),
      we must still avoid notifying the parent of the syscall exit.  Now, this
      behaviour is extended even to resuming with PTRACE_SINGLESTEP.
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ed75e8d5
    • P
      [PATCH] pm: clean up /sys/power/disk · 57c4ce3c
      Pavel Machek 提交于
      Clean code up a bit, and only show suspend to disk as available when
      it is configured in.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      57c4ce3c
    • P
      [PATCH] pm: fix process freezing · 6161b2ce
      Pavel Machek 提交于
      If process freezing fails, some processes are frozen, and rest are left in
      "were asked to be frozen" state.  Thats wrong, we should leave it in some
      consistent state.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6161b2ce
    • P
      [PATCH] swsusp: fix error handling and cleanups · 99dc7d63
      Pavel Machek 提交于
      Drop printing during normal boot (when no image exists in swap), print
      message when drivers fail, fix error paths and consolidate near-identical
      functions in disk.c (and functions with just one statement).
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      99dc7d63
    • S
      [PATCH] swsusp: add locking to software_resume · dd5d666b
      Shaohua Li 提交于
      It is trying to protect swsusp_resume_device and software_resume() from two
      users banging it from userspace at the same time.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dd5d666b
    • M
      [PATCH] swsusp: simpler calculation of number of pages in PBE list · 56057e1a
      Michal Schmidt 提交于
      The function calc_nr uses an iterative algorithm to calculate the number of
      pages needed for the image and the pagedir.  Exactly the same result can be
      obtained with a one-line expression.
      
      Note that this was even proved correct ;-).
      Signed-off-by: NMichal Schmidt <xschmi00@stud.feec.vutbr.cz>
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      56057e1a
    • A
      [PATCH] encrypt suspend data for easy wiping · c2ff18f4
      Andreas Steinmetz 提交于
      The patch protects from leaking sensitive data after resume from suspend.
      During suspend a temporary key is created and this key is used to encrypt the
      data written to disk.  When, during resume, the data was read back into memory
      the temporary key is destroyed which simply means that all data written to
      disk during suspend are then inaccessible so they can't be stolen lateron.
      
      Think of the following: you suspend while an application is running that keeps
      sensitive data in memory.  The application itself prevents the data from being
      swapped out.  Suspend, however, must write these data to swap to be able to
      resume lateron.  Without suspend encryption your sensitive data are then
      stored in plaintext on disk.  This means that after resume your sensitive data
      are accessible to all applications having direct access to the swap device
      which was used for suspend.  If you don't need swap after resume these data
      can remain on disk virtually forever.  Thus it can happen that your system
      gets broken in weeks later and sensitive data which you thought were encrypted
      and protected are retrieved and stolen from the swap device.
      Signed-off-by: NAndreas Steinmetz <ast@domdv.de>
      Acked-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c2ff18f4
    • P
      [PATCH] remove busywait in refrigerator · 2a23b5d1
      Pavel Machek 提交于
      This should make refrigerator sleep properly, not busywait after the first
      schedule() returns.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2a23b5d1
    • H
      [PATCH] swap: update swsusp use of swap_info · dae06ac4
      Hugh Dickins 提交于
      Aha, swsusp dips into swap_info[], better update it to swap_lock.  It's
      bitflipping flags with 0xFF, so get_swap_page will allocate from only the one
      chosen device: let's change that to flip SWP_WRITEOK.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dae06ac4
  3. 30 8月, 2005 3 次提交
  4. 27 8月, 2005 2 次提交
    • P
      [PATCH] completely disable cpu_exclusive sched domain · 212d6d22
      Paul Jackson 提交于
      At the suggestion of Nick Piggin and Dinakar, totally disable
      the facility to allow cpu_exclusive cpusets to define dynamic
      sched domains in Linux 2.6.13, in order to avoid problems
      first reported by John Hawkes (corrupt sched data structures
      and kernel oops).
      
      This has been built for ppc64, i386, ia64, x86_64, sparc, alpha.
      It has been built, booted and tested for cpuset functionality
      on an SN2 (ia64).
      
      Dinakar or Nick - could you verify that it for sure does avoid
      the problems Hawkes reported.  Hawkes is out of town, and I don't
      have the recipe to reproduce what he found.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      212d6d22
    • P
      [PATCH] undo partial cpu_exclusive sched domain disabling · ca2f3daf
      Paul Jackson 提交于
      The partial disabling of Dinakar's new facility to allow
      cpu_exclusive cpusets to define dynamic sched domains
      doesn't go far enough.  At the suggestion of Nick Piggin
      and Dinakar, let us instead totally disable this facility
      for 2.6.13, in order to avoid problems first reported
      by John Hawkes (corrupt sched data structures and kernel oops).
      
      This patch removes the partial disabling code in 2.6.13-rc7,
      in anticipation of the next patch, which will totally disable
      it instead.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ca2f3daf
  5. 25 8月, 2005 1 次提交
    • P
      [PATCH] cpu_exclusive sched domains build fix · 3725822f
      Paul Jackson 提交于
      As reported by Paul Mackerras <paulus@samba.org>, the previous patch
      "cpu_exclusive sched domains fix" broke the ppc64 build with
      CONFIC_CPUSET, yielding error messages:
      
      kernel/cpuset.c: In function 'update_cpu_domains':
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      kernel/cpuset.c:648: error: invalid lvalue in unary '&'
      
      On some arch's, the node_to_cpumask() is a function, returning
      a cpumask_t.  But the for_each_cpu_mask() requires an lvalue mask.
      
      The following patch fixes this build failure by making a copy
      of the cpumask_t on the stack.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3725822f
  6. 24 8月, 2005 2 次提交
    • P
      [PATCH] cpu_exclusive sched domains on partial nodes temp fix · d10689b6
      Paul Jackson 提交于
      This keeps the kernel/cpuset.c routine update_cpu_domains() from
      invoking the sched.c routine partition_sched_domains() if the cpuset in
      question doesn't fall on node boundaries.
      
      I have boot tested this on an SN2, and with the help of a couple of ad
      hoc printk's, determined that it does indeed avoid calling the
      partition_sched_domains() routine on partial nodes.
      
      I did not directly verify that this avoids setting up bogus sched
      domains or avoids the oops that Hawkes saw.
      
      This patch imposes a silent artificial constraint on which cpusets can
      be used to define dynamic sched domains.
      
      This patch should allow proceeding with this new feature in 2.6.13 for
      the configurations in which it is useful (node alligned sched domains)
      while avoiding trying to setup sched domains in the less useful cases
      that can cause the kernel corruption and oops.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Acked-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d10689b6
    • D
      [PATCH] preempt race in getppid · 4c5640cb
      David Meybohm 提交于
      With CONFIG_PREEMPT && !CONFIG_SMP, it's possible for sys_getppid to
      return a bogus value if the parent's task_struct gets reallocated after
      current->group_leader->real_parent is read:
      
              asmlinkage long sys_getppid(void)
              {
                      int pid;
                      struct task_struct *me = current;
                      struct task_struct *parent;
      
                      parent = me->group_leader->real_parent;
      RACE HERE =>    for (;;) {
                              pid = parent->tgid;
              #ifdef CONFIG_SMP
              {
                              struct task_struct *old = parent;
      
                              /*
                               * Make sure we read the pid before re-reading the
                               * parent pointer:
                               */
                              smp_rmb();
                              parent = me->group_leader->real_parent;
                              if (old != parent)
                                      continue;
              }
              #endif
                              break;
                      }
                      return pid;
              }
      
      If the process gets preempted at the indicated point, the parent process
      can go ahead and call exit() and then get wait()'d on to reap its
      task_struct. When the preempted process gets resumed, it will not do any
      further checks of the parent pointer on !CONFIG_SMP: it will read the
      bad pid and return.
      
      So, the same algorithm used when SMP is enabled should be used when
      preempt is enabled, which will recheck ->real_parent in this case.
      Signed-off-by: NDavid Meybohm <dmeybohmlkml@bellsouth.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c5640cb
  7. 19 8月, 2005 1 次提交
  8. 18 8月, 2005 1 次提交
    • B
      [PATCH] NPTL signal delivery deadlock fix · dd12f48d
      Bhavesh P. Davda 提交于
      This bug is quite subtle and only happens in a very interesting
      situation where a real-time threaded process is in the middle of a
      coredump when someone whacks it with a SIGKILL.  However, this deadlock
      leaves the system pretty hosed and you have to reboot to recover.
      
      Not good for real-time priority-preemption applications like our
      telephony application, with 90+ real-time (SCHED_FIFO and SCHED_RR)
      processes, many of them multi-threaded, interacting with each other for
      high volume call processing.
      Acked-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dd12f48d
  9. 11 8月, 2005 1 次提交
    • J
      [PATCH] remove name length check in a workqueue · 60686744
      James Bottomley 提交于
      We have a chek in there to make sure that the name won't overflow
      task_struct.comm[], but it's triggering for scsi with lots of HBAs, only
      scsi is using single-threaded workqueues which don't append the "/%d"
      anyway.
      
      All too hard.  Just kill the BUG_ON.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      
      [ kthread_create() uses vsnprintf() and limits the thing, so no
        actual overflow can actually happen regardless ]
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      60686744
  10. 10 8月, 2005 1 次提交
    • P
      [PATCH] cpuset release ABBA deadlock fix · 3077a260
      Paul Jackson 提交于
      Fix possible cpuset_sem ABBA deadlock if 'notify_on_release' set.
      
      For a particular usage pattern, creating and destroying cpusets fairly
      frequently using notify_on_release, on a very large system, this deadlock
      can be seen every few days.  If you are not using the cpuset
      notify_on_release feature, you will never see this deadlock.
      
      The existing code, on task exit (or cpuset deletion) did:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          call_usermodehelper() forks /sbin/cpuset_release_agent with path
        drop cpuset_sem
      
      Unfortunately, the fork in call_usermodehelper can allocate memory, and
      allocating memory can require cpuset_sem, if the mems_generation values
      changed in the interim.  This results in an ABBA deadlock, trying to obtain
      cpuset_sem when it is already held by the current task.
      
      To fix this, I put the cpuset path (which must be computed while holding
      cpuset_sem) in a temporary buffer, to be used in the call_usermodehelper
      call of /sbin/cpuset_release_agent only _after_ dropping cpuset_sem.
      
      So the new logic is:
      
        get cpuset_sem
        if cpuset marked notify_on_release and is ready to release:
          compute cpuset path relative to /dev/cpuset mount point
          stash path in kmalloc'd buffer
        drop cpuset_sem
        call_usermodehelper() forks /sbin/cpuset_release_agent with path
        free path
      
      The sharp eyed reader might notice that this patch does not contain any
      calls to kmalloc.  The existing code in the check_for_release() routine was
      already kmalloc'ing a buffer to hold the cpuset path.  In the old code, it
      just held the buffer for a few lines, over the cpuset_release_agent() call
      that in turn invoked call_usermodehelper().  In the new code, with the
      application of this patch, it returns that buffer via the new char
      **ppathbuf parameter, for later use and freeing in cpuset_release_agent(),
      which is called after cpuset_sem is dropped.  Whereas the old code has just
      one call to cpuset_release_agent(), right in the check_for_release()
      routine, the new code has three calls to cpuset_release_agent(), from the
      various places that a cpuset can be released.
      
      This patch has been build and booted on SN2, and passed a stress test that
      previously hit the deadlock within a few seconds.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3077a260
  11. 05 8月, 2005 1 次提交
  12. 04 8月, 2005 1 次提交
    • B
      [PATCH] Remove suspend() calls from shutdown path · c36f19e0
      Benjamin Herrenschmidt 提交于
      This removes the calls to device_suspend() from the shutdown path that
      were added sometime during 2.6.13-rc*.  They aren't working properly on
      a number of configs (I got reports from both ppc powerbook users and x86
      users) causing the system to not shutdown anymore.
      
      I think it isn't the right approach at the moment anyway.  We have
      already a shutdown() callback for the drivers that actually care about
      shutdown and the suspend() code isn't yet in a good enough shape to be
      so much generalized.  Also, the semantics of suspend and shutdown are
      slightly different on a number of setups and the way this was patched in
      provides little way for drivers to cleanly differenciate.  It should
      have been at least a different message.
      
      For 2.6.13, I think we should revert to 2.6.12 behaviour and have a
      working suspend back.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c36f19e0
  13. 02 8月, 2005 2 次提交
    • R
      [PATCH] Module per-cpu alignment cannot always be met · 842bbaaa
      Rusty Russell 提交于
      The module code assumes noone will ever ask for a per-cpu area more than
      SMP_CACHE_BYTES aligned.  However, as these cases show, gcc asks sometimes
      asks for 32-byte alignment for the per-cpu section on a module, and if
      CONFIG_X86_L1_CACHE_SHIFT is 4, we hit that BUG_ON().  This is obviously an
      unusual combination, as there have been few reports, but better to warn
      than die.
      
      See:
      	http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/0768.html
      
      And more recently:
      	http://bugs.gentoo.org/show_bug.cgi?id=97006Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      842bbaaa
    • I
      [PATCH] remove sys_set_zone_reclaim() · 6cb54819
      Ingo Molnar 提交于
      This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
      trying to solve a real problem, we must not hard-code an incomplete and
      insufficient approach into a syscall, because syscalls are pretty much
      for eternity.  I am quite strongly convinced that this syscall must not
      hit v2.6.13 in its current form.
      
      Firstly, the syscall lacks basic syscall design: e.g. it allows the
      global setting of VM policy for unprivileged users. (!) [ Imagine an
      Oracle installation and a SAP installation on the same NUMA box fighting
      over the 'optimal' setting for this flag. What will they do? Will they
      try to set the flag to their own preferred value every second or so? ]
      
      Secondly, it was added based on a single datapoint from Martin:
      
       http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2
      
      where Martin characterizes the numbers the following way:
      
       ' Run-to-run variability for "make -j" is huge, so these numbers aren't
         terribly useful except to see that with reclaim the benchmark still
         finishes in a reasonable amount of time. '
      
      in other words: the fundamental problem has likely not been solved, only
      a tendential move into the right direction has been observed, and a
      handful of numbers were picked out of a set of hugely variable results,
      without showing the variability data. How much variance is there
      run-to-run?
      
      I'd really suggest to first walk the walk and see what's needed to get
      stable & predictable kernel compilation numbers on that NUMA box, before
      adding random syscalls to tune a particular aspect of the VM ... which
      approach might not even matter once the whole picture has been analyzed
      and understood!
      
      The third, most important point is that the syscall exposes VM tuning
      internals in a completely unstructured way. What sense does it make to
      have a _GLOBAL_ per-node setting for 'should we go to another node for
      reclaim'? If then it might make sense to do this per-app, via numalib or
      so.
      
      The change is minimalistic in that it doesnt remove the syscall and the
      underlying infrastructure changes, only the user-visible changes.  We
      could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
      /proc/sys/vm/swappiness, but even that looks quite counterproductive
      when the generic approach is that we are trying to reduce the number of
      external factors in the VM balance picture.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6cb54819
  14. 31 7月, 2005 1 次提交
  15. 30 7月, 2005 1 次提交
  16. 29 7月, 2005 2 次提交
  17. 28 7月, 2005 5 次提交