1. 26 2月, 2009 1 次提交
    • P
      rcu: Teach RCU that idle task is not quiscent state at boot · a6826048
      Paul E. McKenney 提交于
      This patch fixes a bug located by Vegard Nossum with the aid of
      kmemcheck, updated based on review comments from Nick Piggin,
      Ingo Molnar, and Andrew Morton.  And cleans up the variable-name
      and function-name language.  ;-)
      
      The boot CPU runs in the context of its idle thread during boot-up.
      During this time, idle_cpu(0) will always return nonzero, which will
      fool Classic and Hierarchical RCU into deciding that a large chunk of
      the boot-up sequence is a big long quiescent state.  This in turn causes
      RCU to prematurely end grace periods during this time.
      
      This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
      function to ignore the idle task as a quiescent state until the
      system has started up the scheduler in rest_init(), introducing a
      new non-API function rcu_idle_now_means_idle() to inform RCU of this
      transition.  RCU maintains an internal rcu_idle_cpu_truthful variable
      to track this state, which is then used by rcu_check_callback() to
      determine if it should believe idle_cpu().
      
      Because this patch has the effect of disallowing RCU grace periods
      during long stretches of the boot-up sequence, this patch also introduces
      Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
      no-op if num_online_cpus() returns 1.  This allows boot-time code that
      calls synchronize_rcu() to proceed normally.  Note, however, that RCU
      callbacks registered by call_rcu() will likely queue up until later in
      the boot sequence.  Although rcuclassic and rcutree can also use this
      same optimization after boot completes, rcupreempt must restrict its
      use of this optimization to the portion of the boot sequence before the
      scheduler starts up, given that an rcupreempt RCU read-side critical
      section may be preeempted.
      
      In addition, this patch takes Nick Piggin's suggestion to make the
      system_state global variable be __read_mostly.
      
      Changes since v4:
      
      o	Changes the name of the introduced function and variable to
      	be less emotional.  ;-)
      
      Changes since v3:
      
      o	WARN_ON(nr_context_switches() > 0) to verify that RCU
      	switches out of boot-time mode before the first context
      	switch, as suggested by Nick Piggin.
      
      Changes since v2:
      
      o	Created rcu_blocking_is_gp() internal-to-RCU API that
      	determines whether a call to synchronize_rcu() is itself
      	a grace period.
      
      o	The definition of rcu_blocking_is_gp() for rcuclassic and
      	rcutree checks to see if but a single CPU is online.
      
      o	The definition of rcu_blocking_is_gp() for rcupreempt
      	checks to see both if but a single CPU is online and if
      	the system is still in early boot.
      
      	This allows rcupreempt to again work correctly if running
      	on a single CPU after booting is complete.
      
      o	Added check to rcupreempt's synchronize_sched() for there
      	being but one online CPU.
      
      Tested all three variants both SMP and !SMP, booted fine, passed a short
      rcutorture test on both x86 and Power.
      Located-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a6826048
  2. 23 2月, 2009 1 次提交
  3. 22 2月, 2009 5 次提交
  4. 19 2月, 2009 5 次提交
  5. 18 2月, 2009 2 次提交
    • J
      block: fix bad definition of BIO_RW_SYNC · 93dbb393
      Jens Axboe 提交于
      We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
      and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
      213d9417.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      93dbb393
    • F
      tracing/function-graph-tracer: trace the idle tasks · 5b058bcd
      Frederic Weisbecker 提交于
      When the function graph tracer is activated, it iterates over the task_list
      to allocate a stack to store the return addresses.
      
      But the per cpu idle tasks are not iterated by using
      do_each_thread / while_each_thread.
      
      So we have to iterate on them manually.
      
      This fixes somes weirdness in the traces and many losses of traces.
      Examples on two cpus:
      
       0)   Xorg-4287    |   2.906 us    |              }
       0)   Xorg-4287    |   3.965 us    |            }
       0)   Xorg-4287    |   5.302 us    |          }
       ------------------------------------------
       0)   Xorg-4287    =>    <idle>-0
       ------------------------------------------
      
       0)    <idle>-0    |   2.861 us    |                        }
       0)    <idle>-0    |   0.526 us    |                        set_normalized_timespec();
       0)    <idle>-0    |   7.201 us    |                      }
       0)    <idle>-0    |   8.214 us    |                    }
       0)    <idle>-0    |               |                    clockevents_program_event() {
       0)    <idle>-0    |               |                      lapic_next_event() {
       0)    <idle>-0    |   0.510 us    |                        native_apic_mem_write();
       0)    <idle>-0    |   1.546 us    |                      }
       0)    <idle>-0    |   2.583 us    |                    }
       0)    <idle>-0    | + 12.435 us   |                  }
       0)    <idle>-0    | + 13.470 us   |                }
       0)    <idle>-0    |   0.608 us    |                _spin_unlock_irqrestore();
       0)    <idle>-0    | + 23.270 us   |              }
       0)    <idle>-0    | + 24.336 us   |            }
       0)    <idle>-0    | + 25.417 us   |          }
       0)    <idle>-0    |   0.593 us    |          _spin_unlock();
       0)    <idle>-0    | + 41.869 us   |        }
       0)    <idle>-0    | + 42.906 us   |      }
       0)    <idle>-0    | + 95.035 us   |    }
       0)    <idle>-0    |   0.540 us    |    menu_reflect();
       0)    <idle>-0    | ! 100.404 us  |  }
       0)    <idle>-0    |   0.564 us    |  mce_idle_callback();
       0)    <idle>-0    |               |  enter_idle() {
       0)    <idle>-0    |   0.526 us    |    mce_idle_callback();
       0)    <idle>-0    |   1.757 us    |  }
       0)    <idle>-0    |               |  cpuidle_idle_call() {
       0)    <idle>-0    |               |    menu_select() {
       0)    <idle>-0    |   0.525 us    |      pm_qos_requirement();
       0)    <idle>-0    |   0.518 us    |      tick_nohz_get_sleep_length();
       0)    <idle>-0    |   2.621 us    |    }
      [...]
       1)    <idle>-0    |   0.518 us    |              touch_softlockup_watchdog();
       1)    <idle>-0    | + 14.355 us   |            }
       1)    <idle>-0    | + 22.840 us   |          }
       1)    <idle>-0    | + 25.949 us   |        }
       1)    <idle>-0    |               |        handle_irq() {
       1)    <idle>-0    |   0.511 us    |          irq_to_desc();
       1)    <idle>-0    |               |          handle_edge_irq() {
       1)    <idle>-0    |   0.638 us    |            _spin_lock();
       1)    <idle>-0    |               |            ack_apic_edge() {
       1)    <idle>-0    |   0.510 us    |              irq_to_desc();
       1)    <idle>-0    |               |              move_native_irq() {
       1)    <idle>-0    |   0.510 us    |                irq_to_desc();
       1)    <idle>-0    |   1.532 us    |              }
       1)    <idle>-0    |   0.511 us    |              native_apic_mem_write();
       ------------------------------------------
       1)    <idle>-0    =>    cat-5073
       ------------------------------------------
      
       1)    cat-5073    |   3.731 us    |                    }
       1)    cat-5073    |               |                    run_local_timers() {
       1)    cat-5073    |   0.533 us    |                      hrtimer_run_queues();
       1)    cat-5073    |               |                      raise_softirq() {
       1)    cat-5073    |               |                        __raise_softirq_irqoff() {
       1)    cat-5073    |               |                          /* nr: 1 */
       1)    cat-5073    |   2.718 us    |                        }
       1)    cat-5073    |   3.814 us    |                      }
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5b058bcd
  6. 16 2月, 2009 2 次提交
  7. 14 2月, 2009 1 次提交
  8. 13 2月, 2009 1 次提交
    • P
      timers: more consistently use clock vs timer · 3997ad31
      Peter Zijlstra 提交于
      While reviewing the manpages, I noticed I'd missed some clock vs timer sites.
      
      Make sure that all timer functions call cpu_timer_sample_group() and not
      cpu_clock_sample_group(). This ensures that we enable the process wide timer
      in time, and therefore pay the O(n) thread group cost from the syscall.
      
      Not doing it here, will result in the first jiffy tick after setting the timer
      doing this, resulting in a very expensive tick (but only once) and a delay in
      actually starting the timer.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3997ad31
  9. 12 2月, 2009 4 次提交
  10. 11 2月, 2009 4 次提交
  11. 10 2月, 2009 1 次提交
  12. 09 2月, 2009 5 次提交
  13. 07 2月, 2009 1 次提交
  14. 06 2月, 2009 3 次提交
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
    • A
      revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY" · 60fd760f
      Andrew Morton 提交于
      Revert commit 0c2d64fb because it causes
      (arguably poorly designed) existing userspace to spend interminable
      periods closing billions of not-open file descriptors.
      
      We could bring this back, with some sort of opt-in tunable in /proc, which
      defaults to "off".
      
      Peter's alanysis follows:
      
      : I spent several hours trying to get to the bottom of a serious
      : performance issue that appeared on one of our servers after upgrading to
      : 2.6.28.  In the end it's what could be considered a userspace bug that
      : was triggered by a change in 2.6.28.  Since this might also affect other
      : people I figured I'd at least document what I found here, and maybe we
      : can even do something about it:
      :
      :
      : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
      : the team maintaining our ftp archive complained that one of their
      : scripts that previously ran in a few minutes still hadn't even come
      : close to being done after an hour or so.  Downgrading to 2.6.27 fixed
      : that.
      :
      : Turns out that script is forking a lot and something in it or python or
      : whereever closes all the file descriptors it doesn't want to pass on.
      : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
      : closes them all with a few exceptions.
      :
      : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
      :
      : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
      : now a thousand times that.
      :
      : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
      : RLIM_INFINITY" (0c2d64fb)[1] that
      : allows, as the title implies, to set the limit for number of files to
      : infinity.
      :
      : Closer investigation showed that the broken default ulimit did not apply
      : to "system" processes (like stuff started from init).  In the end I
      : could establish that all processes that passed through pam_limit at one
      : point had the bad resource limit.
      :
      : Apparently the pam library in Debian etch (4.0) initializes the limits
      : to some default values when it doesn't have any settings in limit.conf
      : to override them.  Turns out that for nofiles this is RLIM_INFINITY.
      : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
      : package version 0.79-5 fixes that - tho I'm not sure what side effects
      : that has.
      :
      : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
      : uses a different pam (version).
      Reported-by: NPeter Palfrader <weasel@debian.org>
      Cc: Adam Tkac <vonsch@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fd760f
    • A
      kernel/async.c: fix printk warnings · 58763a29
      Andrew Morton 提交于
      alpha:
      
      kernel/async.c: In function 'run_one_entry':
      kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64'
      kernel/async.c: In function 'async_synchronize_cookie_special':
      kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64'
      
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58763a29
  15. 05 2月, 2009 3 次提交
    • P
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra 提交于
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd4c1b4
    • P
      signal: re-add dead task accumulation stats. · 32bd671d
      Peter Zijlstra 提交于
      We're going to split the process wide cpu accounting into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context.
      
       - timers; which need constant time tracing but can affort the overhead
                 because they're default off -- and rare.
      
      The clock readout will go back to a full sum of the thread group, for this
      we need to re-add the exit stats that were removed in the initial itimer
      rework (f06febc9: timers: fix itimer/many thread hang).
      
      Furthermore, since that full sum can be rather slow for large thread groups
      and we have the complete dead task stats, revert the do_notify_parent time
      computation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      32bd671d
    • S
      sched: fix nohz load balancer on cpu offline · 483b4ee6
      Suresh Siddha 提交于
      Christian Borntraeger reports:
      
      > After a logical cpu offline, even on a complete idle system, there
      > is one cpu with full ticks. It turns out that nohz.cpu_mask has the
      > the offlined cpu still set.
      >
      > In select_nohz_load_balancer() we check if the system is completely
      > idle to turn of load balancing. We compare cpu_online_map with
      > nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
      > but nohz.cpu_mask is not, the check fails and the scheduler believes
      > that we need an "idle load balancer" even on a fully idle system.
      > Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.
      
      Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
      while a cpu is going offline.
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      483b4ee6
  16. 04 2月, 2009 1 次提交