1. 11 7月, 2008 1 次提交
    • H
      nohz: don't stop idle tick if softirqs are pending. · 857f3fd7
      Heiko Carstens 提交于
      In case a cpu goes idle but softirqs are pending only an error message is
      printed to the console. It may take a very long time until the pending
      softirqs will finally be executed. Worst case would be a hanging system.
      
      With this patch the timer tick just continues and the softirqs will be
      executed after the next interrupt. Still a delay but better than a
      hanging system.
      
      Currently we have at least two device drivers on s390 which under certain
      circumstances schedule a tasklet from process context. This is a reason
      why we can end up with pending softirqs when going idle. Fixing these
      drivers seems to be non-trivial.
      However there is no question that the drivers should be fixed.
      This patch shouldn't be considered as a bug fix. It just is intended to
      keep a system running even if device drivers are buggy.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Glauber <jan.glauber@de.ibm.com>
      Cc: Stefan Weinhuber <wein@de.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      857f3fd7
  2. 30 5月, 2008 1 次提交
  3. 29 5月, 2008 7 次提交
  4. 28 5月, 2008 1 次提交
  5. 27 5月, 2008 2 次提交
  6. 25 5月, 2008 3 次提交
  7. 23 5月, 2008 3 次提交
    • C
      stop_machine: make stop_machine_run more virtualization friendly · 3401a61e
      Christian Borntraeger 提交于
      On kvm I have seen some rare hangs in stop_machine when I used more guest
      cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
      hang quite often. I could also reproduce the problem on a 4 way z/VM host with
      a 64 way guest.
      
      It turned out that the guest was consuming all available cpus mostly for
      spinning on scheduler locks like rq->lock. This is expected as the threads are
      calling yield all the time.
      The problem is now, that the host scheduling decisings together with the guest
      scheduling decisions and spinlocks not being fair managed to create an
      interesting scenario similar to a live lock. (Sometimes the hang resolved
      itself after some minutes)
      
      Changing stop_machine to yield the cpu to the hypervisor when yielding inside
      the guest fixed the problem for me. While I am not completely happy with this
      patch, I think it causes no harm and it really improves the situation for me.
      
      I used cpu_relax for yielding to the hypervisor, does that work on all
      architectures?
      
      p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use
      stop_machine_run and both triggered the problem after some retries.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      CC: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      3401a61e
    • D
      modules: proper cleanup of kobject without CONFIG_SYSFS · 34e4e2fe
      Denis V. Lunev 提交于
      kobject: '<NULL>' (ffffffffa0104050): is not initialized, yet kobject_put() is being called.
      ------------[ cut here ]------------
      WARNING: at /home/den/src/linux-netns26/lib/kobject.c:583 kobject_put+0x53/0x55()
      Modules linked in: ipv6 nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs ide_cd_mod cdrom button [last unloaded: pktgen]
      comm: rmmod Tainted: G        W 2.6.26-rc3 #585
      Call Trace:
        [<ffffffff802359ab>] warn_on_slowpath+0x58/0x7a
        [<ffffffff80236aca>] ? printk+0x67/0x69
        [<ffffffff80236aca>] ? printk+0x67/0x69
        [<ffffffff80324289>] kobject_put+0x53/0x55
        [<ffffffff8025e2ee>] free_module+0x87/0xfa
        [<ffffffff8025fee5>] sys_delete_module+0x178/0x1e1
        [<ffffffff804b1e70>] ? lockdep_sys_exit_thunk+0x35/0x67
        [<ffffffff804b1dff>] ? trace_hardirqs_on_thunk+0x35/0x3a
        [<ffffffff8020c0bb>] system_call_after_swapgs+0x7b/0x80
      ---[ end trace 8f5aafa7f6406cf8 ]---
      
      mod->mkobj.kobj is not initialized without CONFIG_SYSFS. Do not call
      kobject_put in this case.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      34e4e2fe
    • C
  8. 17 5月, 2008 4 次提交
  9. 15 5月, 2008 2 次提交
  10. 12 5月, 2008 1 次提交
    • L
      Add new 'cond_resched_bkl()' helper function · c3921ab7
      Linus Torvalds 提交于
      It acts exactly like a regular 'cond_resched()', but will not get
      optimized away when CONFIG_PREEMPT is set.
      
      Normal kernel code is already preemptable in the presense of
      CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
      02b67cc3 "sched: do not do
      cond_resched() when CONFIG_PREEMPT").
      
      But when wanting to conditionally reschedule while holding a lock, you
      need to use "cond_sched_lock(lock)", and the new function is the BKL
      equivalent of that.
      
      Also make fs/locks.c use it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3921ab7
  11. 11 5月, 2008 2 次提交
    • L
      BKL: revert back to the old spinlock implementation · 8e3e076c
      Linus Torvalds 提交于
      The generic semaphore rewrite had a huge performance regression on AIM7
      (and potentially other BKL-heavy benchmarks) because the generic
      semaphores had been rewritten to be simple to understand and fair.  The
      latter, in particular, turns a semaphore-based BKL implementation into a
      mess of scheduling.
      
      The attempt to fix the performance regression failed miserably (see the
      previous commit 00b41ec2 'Revert
      "semaphore: fix"'), and so for now the simple and sane approach is to
      instead just go back to the old spinlock-based BKL implementation that
      never had any issues like this.
      
      This patch also has the advantage of being reported to fix the
      regression completely according to Yanmin Zhang, unlike the semaphore
      hack which still left a couple percentage point regression.
      
      As a spinlock, the BKL obviously has the potential to be a latency
      issue, but it's not really any different from any other spinlock in that
      respect.  We do want to get rid of the BKL asap, but that has been the
      plan for several years.
      
      These days, the biggest users are in the tty layer (open/release in
      particular) and Alan holds out some hope:
      
        "tty release is probably a few months away from getting cured - I'm
         afraid it will almost certainly be the very last user of the BKL in
         tty to get fixed as it depends on everything else being sanely locked."
      
      so while we're not there yet, we do have a plan of action.
      Tested-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Alexander Viro <viro@ftp.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e3e076c
    • L
      Revert "semaphore: fix" · 00b41ec2
      Linus Torvalds 提交于
      This reverts commit bf726eab, as it has
      been reported to cause a regression with processes stuck in __down(),
      apparently because some missing wakeup.
      
      Quoth Sven Wegener:
       "I'm currently investigating a regression that has showed up with my
        last git pull yesterday.  Bisecting the commits showed bf726e
        "semaphore: fix" to be the culprit, reverting it fixed the issue.
      
        Symptoms: During heavy filesystem usage (e.g.  a kernel compile) I get
        several compiler processes in uninterruptible sleep, blocking all i/o
        on the filesystem.  System is an Intel Core 2 Quad running a 64bit
        kernel and userspace.  Filesystem is xfs on top of lvm.  See below for
        the output of sysrq-w."
      
      See
      
      	http://lkml.org/lkml/2008/5/10/45
      
      for full report.
      
      In the meantime, we can just fix the BKL performance regression by
      reverting back to the good old BKL spinlock implementation instead,
      since any sleeping lock will generally perform badly, especially if it
      tries to be fair.
      Reported-by: NSven Wegener <sven.wegener@stealer.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00b41ec2
  12. 09 5月, 2008 3 次提交
  13. 08 5月, 2008 3 次提交
    • M
      sched: fix weight calculations · 46151122
      Mike Galbraith 提交于
      The conversion between virtual and real time is as follows:
      
        dvt = rw/w * dt <=> dt = w/rw * dvt
      
      Since we want the fair sleeper granularity to be in real time, we actually
      need to do:
      
        dvt = - rw/w * l
      
      This bug could be related to the regression reported by Yanmin Zhang:
      
      | Comparing with kernel 2.6.25, sysbench+mysql(oltp, readonly) has lots
      | of regressions with 2.6.26-rc1:
      |
      | 1) 8-core stoakley: 28%;
      | 2) 16-core tigerton: 20%;
      | 3) Itanium Montvale: 50%.
      Reported-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      46151122
    • I
      semaphore: fix · bf726eab
      Ingo Molnar 提交于
      Yanmin Zhang reported:
      
      | Comparing with kernel 2.6.25, AIM7 (use tmpfs) has more th
      | regression under 2.6.26-rc1 on my 8-core stoakley, 16-core tigerton,
      | and Itanium Montecito. Bisect located the patch below:
      |
      | 64ac24e7 is first bad commit
      | commit 64ac24e7
      | Author: Matthew Wilcox <matthew@wil.cx>
      | Date:   Fri Mar 7 21:55:58 2008 -0500
      |
      |     Generic semaphore implementation
      |
      | After I manually reverted the patch against 2.6.26-rc1 while fixing
      | lots of conflicts/errors, aim7 regression became less than 2%.
      
      i reproduced the AIM7 workload and can confirm Yanmin's findings that
      -.26-rc1 regresses over .25 - by over 67% here.
      
      Looking at the workload i found and fixed what i believe to be the real
      bug causing the AIM7 regression: it was inefficient wakeup / scheduling
      / locking behavior of the new generic semaphore code, causing suboptimal
      performance.
      
      The problem comes from the following code. The new semaphore code does
      this on down():
      
              spin_lock_irqsave(&sem->lock, flags);
              if (likely(sem->count > 0))
                      sem->count--;
              else
                      __down(sem);
              spin_unlock_irqrestore(&sem->lock, flags);
      
      and this on up():
      
              spin_lock_irqsave(&sem->lock, flags);
              if (likely(list_empty(&sem->wait_list)))
                      sem->count++;
              else
                      __up(sem);
              spin_unlock_irqrestore(&sem->lock, flags);
      
      where __up() does:
      
              list_del(&waiter->list);
              waiter->up = 1;
              wake_up_process(waiter->task);
      
      and where __down() does this in essence:
      
              list_add_tail(&waiter.list, &sem->wait_list);
              waiter.task = task;
              waiter.up = 0;
              for (;;) {
                      [...]
                      spin_unlock_irq(&sem->lock);
                      timeout = schedule_timeout(timeout);
                      spin_lock_irq(&sem->lock);
                      if (waiter.up)
                              return 0;
              }
      
      the fastpath looks good and obvious, but note the following property of
      the contended path: if there's a task on the ->wait_list, the up() of
      the current owner will "pass over" ownership to that waiting task, in a
      wake-one manner, via the waiter->up flag and by removing the waiter from
      the wait list.
      
      That is all and fine in principle, but as implemented in
      kernel/semaphore.c it also creates a nasty, hidden source of contention!
      
      The contention comes from the following property of the new semaphore
      code: the new owner owns the semaphore exclusively, even if it is not
      running yet.
      
      So if the old owner, even if just a few instructions later, does a
      down() [lock_kernel()] again, it will be blocked and will have to wait
      on the new owner to eventually be scheduled (possibly on another CPU)!
      Or if another task gets to lock_kernel() sooner than the "new owner"
      scheduled, it will be blocked unnecessarily and for a very long time
      when there are 2000 tasks running.
      
      I.e. the implementation of the new semaphores code does wake-one and
      lock ownership in a very restrictive way - it does not allow
      opportunistic re-locking of the lock at all and keeps the scheduler from
      picking task order intelligently.
      
      This kind of scheduling, with 2000 AIM7 processes running, creates awful
      cross-scheduling between those 2000 tasks, causes reduced parallelism, a
      throttled runqueue length and a lot of idle time. With increasing number
      of CPUs it causes an exponentially worse behavior in AIM7, as the chance
      for a newly woken new-owner task to actually run anytime soon is less
      and less likely.
      
      Note that it takes just a tiny bit of contention for the 'new-semaphore
      catastrophy' to happen: the wakeup latencies get added to whatever small
      contention there is, and quickly snowball out of control!
      
      I believe Yanmin's findings and numbers support this analysis too.
      
      The best fix for this problem is to use the same scheduling logic that
      the kernel/mutex.c code uses: keep the wake-one behavior (that is OK and
      wanted because we do not want to over-schedule), but also allow
      opportunistic locking of the lock even if a wakee is already "in
      flight".
      
      The patch below implements this new logic. With this patch applied the
      AIM7 regression is largely fixed on my quad testbox:
      
        # v2.6.25 vanilla:
        ..................
        Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
        2000    56096.4         91      207.5   789.7   0.4675
        2000    55894.4         94      208.2   792.7   0.4658
      
        # v2.6.26-rc1-166-gc0a18111 vanilla:
        ...................................
        Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
        2000    33230.6         83      350.3   784.5   0.2769
        2000    31778.1         86      366.3   783.6   0.2648
      
        # v2.6.26-rc1-166-gc0a18111 + semaphore-speedup:
        ...............................................
        Tasks   Jobs/Min        JTI     Real    CPU     Jobs/sec/task
        2000    55707.1         92      209.0   795.6   0.4642
        2000    55704.4         96      209.0   796.0   0.4642
      
      i.e. a 67% speedup. We are now back to within 1% of the v2.6.25
      performance levels and have zero idle time during the test, as expected.
      
      Btw., interactivity also improved dramatically with the fix - for
      example console-switching became almost instantaneous during this
      workload (which after all is running 2000 tasks at once!), without the
      patch it was stuck for a minute at times.
      
      There's another nice side-effect of this speedup patch, the new generic
      semaphore code got even smaller:
      
         text    data     bss     dec     hex filename
         1241       0       0    1241     4d9 semaphore.o.before
         1207       0       0    1207     4b7 semaphore.o.after
      
      (because the waiter.up complication got removed.)
      
      Longer-term we should look into using the mutex code for the generic
      semaphore code as well - but i's not easy due to legacies and it's
      outside of the scope of v2.6.26 and outside the scope of this patch as
      well.
      Bisected-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bf726eab
    • J
      Revert "relay: fix splice problem" · 75065ff6
      Jens Axboe 提交于
      This reverts commit c3270e57.
      75065ff6
  14. 06 5月, 2008 7 次提交