1. 23 2月, 2009 1 次提交
  2. 22 2月, 2009 5 次提交
  3. 19 2月, 2009 5 次提交
  4. 18 2月, 2009 2 次提交
    • J
      block: fix bad definition of BIO_RW_SYNC · 93dbb393
      Jens Axboe 提交于
      We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
      and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
      213d9417.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      93dbb393
    • F
      tracing/function-graph-tracer: trace the idle tasks · 5b058bcd
      Frederic Weisbecker 提交于
      When the function graph tracer is activated, it iterates over the task_list
      to allocate a stack to store the return addresses.
      
      But the per cpu idle tasks are not iterated by using
      do_each_thread / while_each_thread.
      
      So we have to iterate on them manually.
      
      This fixes somes weirdness in the traces and many losses of traces.
      Examples on two cpus:
      
       0)   Xorg-4287    |   2.906 us    |              }
       0)   Xorg-4287    |   3.965 us    |            }
       0)   Xorg-4287    |   5.302 us    |          }
       ------------------------------------------
       0)   Xorg-4287    =>    <idle>-0
       ------------------------------------------
      
       0)    <idle>-0    |   2.861 us    |                        }
       0)    <idle>-0    |   0.526 us    |                        set_normalized_timespec();
       0)    <idle>-0    |   7.201 us    |                      }
       0)    <idle>-0    |   8.214 us    |                    }
       0)    <idle>-0    |               |                    clockevents_program_event() {
       0)    <idle>-0    |               |                      lapic_next_event() {
       0)    <idle>-0    |   0.510 us    |                        native_apic_mem_write();
       0)    <idle>-0    |   1.546 us    |                      }
       0)    <idle>-0    |   2.583 us    |                    }
       0)    <idle>-0    | + 12.435 us   |                  }
       0)    <idle>-0    | + 13.470 us   |                }
       0)    <idle>-0    |   0.608 us    |                _spin_unlock_irqrestore();
       0)    <idle>-0    | + 23.270 us   |              }
       0)    <idle>-0    | + 24.336 us   |            }
       0)    <idle>-0    | + 25.417 us   |          }
       0)    <idle>-0    |   0.593 us    |          _spin_unlock();
       0)    <idle>-0    | + 41.869 us   |        }
       0)    <idle>-0    | + 42.906 us   |      }
       0)    <idle>-0    | + 95.035 us   |    }
       0)    <idle>-0    |   0.540 us    |    menu_reflect();
       0)    <idle>-0    | ! 100.404 us  |  }
       0)    <idle>-0    |   0.564 us    |  mce_idle_callback();
       0)    <idle>-0    |               |  enter_idle() {
       0)    <idle>-0    |   0.526 us    |    mce_idle_callback();
       0)    <idle>-0    |   1.757 us    |  }
       0)    <idle>-0    |               |  cpuidle_idle_call() {
       0)    <idle>-0    |               |    menu_select() {
       0)    <idle>-0    |   0.525 us    |      pm_qos_requirement();
       0)    <idle>-0    |   0.518 us    |      tick_nohz_get_sleep_length();
       0)    <idle>-0    |   2.621 us    |    }
      [...]
       1)    <idle>-0    |   0.518 us    |              touch_softlockup_watchdog();
       1)    <idle>-0    | + 14.355 us   |            }
       1)    <idle>-0    | + 22.840 us   |          }
       1)    <idle>-0    | + 25.949 us   |        }
       1)    <idle>-0    |               |        handle_irq() {
       1)    <idle>-0    |   0.511 us    |          irq_to_desc();
       1)    <idle>-0    |               |          handle_edge_irq() {
       1)    <idle>-0    |   0.638 us    |            _spin_lock();
       1)    <idle>-0    |               |            ack_apic_edge() {
       1)    <idle>-0    |   0.510 us    |              irq_to_desc();
       1)    <idle>-0    |               |              move_native_irq() {
       1)    <idle>-0    |   0.510 us    |                irq_to_desc();
       1)    <idle>-0    |   1.532 us    |              }
       1)    <idle>-0    |   0.511 us    |              native_apic_mem_write();
       ------------------------------------------
       1)    <idle>-0    =>    cat-5073
       ------------------------------------------
      
       1)    cat-5073    |   3.731 us    |                    }
       1)    cat-5073    |               |                    run_local_timers() {
       1)    cat-5073    |   0.533 us    |                      hrtimer_run_queues();
       1)    cat-5073    |               |                      raise_softirq() {
       1)    cat-5073    |               |                        __raise_softirq_irqoff() {
       1)    cat-5073    |               |                          /* nr: 1 */
       1)    cat-5073    |   2.718 us    |                        }
       1)    cat-5073    |   3.814 us    |                      }
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5b058bcd
  5. 16 2月, 2009 2 次提交
  6. 14 2月, 2009 1 次提交
  7. 13 2月, 2009 1 次提交
    • P
      timers: more consistently use clock vs timer · 3997ad31
      Peter Zijlstra 提交于
      While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
      
      Make sure that all timer functions call cpu_timer_sample_group() and not
      cpu_clock_sample_group(). This ensures that we enable the process wide timer
      in time, and therefore pay the O(n) thread group cost from the syscall.
      
      Not doing it here, will result in the first jiffy tick after setting the timer
      doing this, resulting in a very expensive tick (but only once) and a delay in
      actually starting the timer.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3997ad31
  8. 12 2月, 2009 4 次提交
  9. 11 2月, 2009 4 次提交
  10. 10 2月, 2009 1 次提交
  11. 09 2月, 2009 5 次提交
  12. 07 2月, 2009 1 次提交
  13. 06 2月, 2009 3 次提交
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
    • A
      revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY" · 60fd760f
      Andrew Morton 提交于
      Revert commit 0c2d64fb because it causes
      (arguably poorly designed) existing userspace to spend interminable
      periods closing billions of not-open file descriptors.
      
      We could bring this back, with some sort of opt-in tunable in /proc, which
      defaults to "off".
      
      Peter's alanysis follows:
      
      : I spent several hours trying to get to the bottom of a serious
      : performance issue that appeared on one of our servers after upgrading to
      : 2.6.28.  In the end it's what could be considered a userspace bug that
      : was triggered by a change in 2.6.28.  Since this might also affect other
      : people I figured I'd at least document what I found here, and maybe we
      : can even do something about it:
      :
      :
      : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
      : the team maintaining our ftp archive complained that one of their
      : scripts that previously ran in a few minutes still hadn't even come
      : close to being done after an hour or so.  Downgrading to 2.6.27 fixed
      : that.
      :
      : Turns out that script is forking a lot and something in it or python or
      : whereever closes all the file descriptors it doesn't want to pass on.
      : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
      : closes them all with a few exceptions.
      :
      : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
      :
      : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
      : now a thousand times that.
      :
      : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
      : RLIM_INFINITY" (0c2d64fb)[1] that
      : allows, as the title implies, to set the limit for number of files to
      : infinity.
      :
      : Closer investigation showed that the broken default ulimit did not apply
      : to "system" processes (like stuff started from init).  In the end I
      : could establish that all processes that passed through pam_limit at one
      : point had the bad resource limit.
      :
      : Apparently the pam library in Debian etch (4.0) initializes the limits
      : to some default values when it doesn't have any settings in limit.conf
      : to override them.  Turns out that for nofiles this is RLIM_INFINITY.
      : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
      : package version 0.79-5 fixes that - tho I'm not sure what side effects
      : that has.
      :
      : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
      : uses a different pam (version).
      Reported-by: NPeter Palfrader <weasel@debian.org>
      Cc: Adam Tkac <vonsch@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fd760f
    • A
      kernel/async.c: fix printk warnings · 58763a29
      Andrew Morton 提交于
      alpha:
      
      kernel/async.c: In function 'run_one_entry':
      kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64'
      kernel/async.c: In function 'async_synchronize_cookie_special':
      kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64'
      
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58763a29
  14. 05 2月, 2009 3 次提交
    • P
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra 提交于
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd4c1b4
    • P
      signal: re-add dead task accumulation stats. · 32bd671d
      Peter Zijlstra 提交于
      We're going to split the process wide cpu accounting into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context.
      
       - timers; which need constant time tracing but can affort the overhead
                 because they're default off -- and rare.
      
      The clock readout will go back to a full sum of the thread group, for this
      we need to re-add the exit stats that were removed in the initial itimer
      rework (f06febc9: timers: fix itimer/many thread hang).
      
      Furthermore, since that full sum can be rather slow for large thread groups
      and we have the complete dead task stats, revert the do_notify_parent time
      computation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      32bd671d
    • S
      sched: fix nohz load balancer on cpu offline · 483b4ee6
      Suresh Siddha 提交于
      Christian Borntraeger reports:
      
      > After a logical cpu offline, even on a complete idle system, there
      > is one cpu with full ticks. It turns out that nohz.cpu_mask has the
      > the offlined cpu still set.
      >
      > In select_nohz_load_balancer() we check if the system is completely
      > idle to turn of load balancing. We compare cpu_online_map with
      > nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
      > but nohz.cpu_mask is not, the check fails and the scheduler believes
      > that we need an "idle load balancer" even on a fully idle system.
      > Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.
      
      Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
      while a cpu is going offline.
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      483b4ee6
  15. 04 2月, 2009 1 次提交
  16. 03 2月, 2009 1 次提交
    • E
      modules: Use a better scheme for refcounting · 720eba31
      Eric Dumazet 提交于
      Current refcounting for modules (done if CONFIG_MODULE_UNLOAD=y) is
      using a lot of memory.
      
      Each 'struct module' contains an [NR_CPUS] array of full cache lines.
      
      This patch uses existing infrastructure (percpu_modalloc() &
      percpu_modfree()) to allocate percpu space for the refcount storage.
      
      Instead of wasting NR_CPUS*128 bytes (on i386), we now use
      nr_cpu_ids*sizeof(local_t) bytes.
      
      On a typical distro, where NR_CPUS=8, shiping 2000 modules, we reduce
      size of module files by about 2 Mbytes. (1Kb per module)
      
      Instead of having all refcounters in the same memory node - with TLB misses
      because of vmalloc() - this new implementation permits to have better
      NUMA properties, since each  CPU will use storage on its preferred node,
      thanks to percpu storage.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720eba31