1. 03 4月, 2010 9 次提交
    • P
      sched: Fix nr_uninterruptible count · cc87f76a
      Peter Zijlstra 提交于
      The cpuload calculation in calc_load_account_active() assumes
      rq->nr_uninterruptible will not change on an offline cpu after
      migrate_nr_uninterruptible(). However the recent migrate on wakeup
      changes broke that and would result in decrementing the offline cpu's
      rq->nr_uninterruptible.
      
      Fix this by accounting the nr_uninterruptible on the waking cpu.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc87f76a
    • P
      sched: Optimize task_rq_lock() · 65cc8e48
      Peter Zijlstra 提交于
      Now that we hold the rq->lock over set_task_cpu() again, we can do
      away with most of the TASK_WAKING checks and reduce them again to
      set_cpus_allowed_ptr().
      
      Removes some conditionals from scheduling hot-paths.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      65cc8e48
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: _cpu_down(): Don't play with current->cpus_allowed · 6a1bdc1b
      Oleg Nesterov 提交于
      _cpu_down() changes the current task's affinity and then recovers it at
      the end. The problems are well known: we can't restore old_allowed if it
      was bound to the now-dead-cpu, and we can race with the userspace which
      can change cpu-affinity during unplug.
      
      _cpu_down() should not play with current->cpus_allowed at all. Instead,
      take_cpu_down() can migrate the caller of _cpu_down() after __cpu_disable()
      removes the dying cpu from cpu_online_mask.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091023.GA9148@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6a1bdc1b
    • O
      sched: sched_exec(): Remove the select_fallback_rq() logic · 30da688e
      Oleg Nesterov 提交于
      sched_exec()->select_task_rq() reads/updates ->cpus_allowed lockless.
      This can race with other CPUs updating our ->cpus_allowed, and this
      looks meaningless to me.
      
      The task is current and running, it must have online cpus in ->cpus_allowed,
      the fallback mode is bogus. And, if ->sched_class returns the "wrong" cpu,
      this likely means we raced with set_cpus_allowed() which was called
      for reason, why should sched_exec() retry and call ->select_task_rq()
      again?
      
      Change the code to call sched_class->select_task_rq() directly and do
      nothing if the returned cpu is wrong after re-checking under rq->lock.
      
      From now task_struct->cpus_allowed is always stable under TASK_WAKING,
      select_fallback_rq() is always called under rq-lock or the caller or
      the caller owns TASK_WAKING (select_task_rq).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091019.GA9141@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      30da688e
    • O
      sched: move_task_off_dead_cpu(): Remove retry logic · c1804d54
      Oleg Nesterov 提交于
      The previous patch preserved the retry logic, but it looks unneeded.
      
      __migrate_task() can only fail if we raced with migration after we dropped
      the lock, but in this case the caller of set_cpus_allowed/etc must initiate
      migration itself if ->on_rq == T.
      
      We already fixed p->cpus_allowed, the changes in active/online masks must
      be visible to racer, it should migrate the task to online cpu correctly.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091014.GA9138@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c1804d54
    • O
      sched: move_task_off_dead_cpu(): Take rq->lock around select_fallback_rq() · 1445c08d
      Oleg Nesterov 提交于
      move_task_off_dead_cpu()->select_fallback_rq() reads/updates ->cpus_allowed
      lockless. We can race with set_cpus_allowed() running in parallel.
      
      Change it to take rq->lock around select_fallback_rq(). Note that it is not
      trivial to move this spin_lock() into select_fallback_rq(), we must recheck
      the task was not migrated after we take the lock and other callers do not
      need this lock.
      
      To avoid the races with other callers of select_fallback_rq() which rely on
      TASK_WAKING, we also check p->state != TASK_WAKING and do nothing otherwise.
      The owner of TASK_WAKING must update ->cpus_allowed and choose the correct
      CPU anyway, and the subsequent __migrate_task() is just meaningless because
      p->se.on_rq must be false.
      
      Alternatively, we could change select_task_rq() to take rq->lock right
      after it calls sched_class->select_task_rq(), but this looks a bit ugly.
      
      Also, change it to not assume irqs are disabled and absorb __migrate_task_irq().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091010.GA9131@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1445c08d
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
  2. 17 3月, 2010 1 次提交
  3. 16 3月, 2010 1 次提交
  4. 15 3月, 2010 1 次提交
    • K
      sched: sched_getaffinity(): Allow less than NR_CPUS length · cd3d8031
      KOSAKI Motohiro 提交于
      [ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]
      
      Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
      Unfortunately, glibc sched interface has the following definition:
      
      	# define __CPU_SETSIZE  1024
      	# define __NCPUBITS     (8 * sizeof (__cpu_mask))
      	typedef unsigned long int __cpu_mask;
      	typedef struct
      	{
      	  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
      	} cpu_set_t;
      
      It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
      ABI issue ...
      
      More recently, Sharyathi Nagesh reported following test program makes
      misterious syscall failure:
      
       -----------------------------------------------------------------------
       #define _GNU_SOURCE
       #include<stdio.h>
       #include<errno.h>
       #include<sched.h>
      
       int main()
       {
           cpu_set_t set;
           if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
               printf("\n Call is failing with:%d", errno);
       }
       -----------------------------------------------------------------------
      
      Because the kernel assumes len argument of sched_getaffinity() is bigger
      than NR_CPUS. But now it is not correct.
      
      Now we are faced with the following annoying dilemma, due to
      the limitations of the glibc interface built in years ago:
      
       (1) if we change glibc's __CPU_SETSIZE definition, we lost
           binary compatibility of _all_ application.
      
       (2) if we don't change it, we also lost binary compatibility of
           Sharyathi's use case.
      
      Then, I would propse to change the rule of the len argument of
      sched_getaffinity().
      
      Old:
      	len should be bigger than NR_CPUS
      New:
      	len should be bigger than maximum possible cpu id
      
      This creates the following behavior:
      
       (A) In the real 4096 cpus machine, the above test program still
           return -EINVAL.
      
       (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
           all machines in the world), the above can run successfully.
      
      Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
      they can rebuild their programs.
      
      IOW we hope they are not annoyed by this issue ...
      Reported-by: NSharyathi Nagesh <sharyath@in.ibm.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NUlrich Drepper <drepper@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cd3d8031
  5. 12 3月, 2010 5 次提交
    • M
      sched: Remove SYNC_WAKEUPS feature · c6ee36c4
      Mike Galbraith 提交于
      Sync wakeups are critical functionality with a long history.  Remove it, we don't
      need the branch or icache footprint.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301817.6785.47.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c6ee36c4
    • M
      sched: Cleanup/optimize clock updates · a64692a3
      Mike Galbraith 提交于
      Now that we no longer depend on the clock being updated prior to enqueueing
      on migratory wakeup, we can clean up a bit, placing calls to update_rq_clock()
      exactly where they are needed, ie on enqueue, dequeue and schedule events.
      
      In the case of a freshly enqueued task immediately preempting, we can skip the
      update during preemption, as the clock was just updated by the enqueue event.
      We also save an unneeded call during a migratory wakeup by not updating the
      previous runqueue, where update_curr() won't be invoked.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301199.6785.32.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a64692a3
    • M
      sched: Remove avg_overlap · e12f31d3
      Mike Galbraith 提交于
      Both avg_overlap and avg_wakeup had an inherent problem in that their accuracy
      was detrimentally affected by cross-cpu wakeups, this because we are missing
      the necessary call to update_curr().  This can't be fixed without increasing
      overhead in our already too fat fastpath.
      
      Additionally, with recent load balancing changes making us prefer to place tasks
      in an idle cache domain (which is good for compute bound loads), communicating
      tasks suffer when a sync wakeup, which would enable affine placement, is turned
      into a non-sync wakeup by SYNC_LESS.  With one task on the runqueue, wake_affine()
      rejects the affine wakeup request, leaving the unfortunate where placed, taking
      frequent cache misses.
      
      Remove it, and recover some fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301121.6785.30.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e12f31d3
    • M
      sched: Remove avg_wakeup · b42e0c41
      Mike Galbraith 提交于
      Testing the load which led to this heuristic (nfs4 kbuild) shows that it has
      outlived it's usefullness.  With intervening load balancing changes, I cannot
      see any difference with/without, so recover there fastpath cycles.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301062.6785.29.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b42e0c41
    • M
      sched: Rate-limit nohz · 39c0cbe2
      Mike Galbraith 提交于
      Entering nohz code on every micro-idle is costing ~10% throughput for netperf
      TCP_RR when scheduling cross-cpu.  Rate limiting entry fixes this, but raises
      ticks a bit.  On my Q6600, an idle box goes from ~85 interrupts/sec to 128.
      
      The higher the context switch rate, the more nohz entry costs.  With this patch
      and some cycle recovery patches in my tree, max cross cpu context switch rate is
      improved by ~16%, a large portion of which of which is this ratelimiting.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1268301003.6785.28.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      39c0cbe2
  6. 11 3月, 2010 2 次提交
  7. 08 3月, 2010 1 次提交
    • A
      sysdev: Pass attribute in sysdev_class attributes show/store · c9be0a36
      Andi Kleen 提交于
      Passing the attribute to the low level IO functions allows all kinds
      of cleanups, by sharing low level IO code without requiring
      an own function for every piece of data.
      
      Also drivers can extend the attributes with own data fields
      and use that in the low level function.
      
      Similar to sysdev_attributes and normal attributes.
      
      This is a tree-wide sweep, converting everything in one go.
      
      No functional changes in this patch other than passing the new
      argument everywhere.
      
      Tested on x86, the non x86 parts are uncompiled.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      
      c9be0a36
  8. 07 3月, 2010 1 次提交
  9. 25 2月, 2010 2 次提交
    • P
      sched: Better name for for_each_domain_rd · 497f0ab3
      Paul E. McKenney 提交于
      As suggested by Peter Ziljstra, make better choice of name
      for for_each_domain_rd(), containing "rcu_dereference", given
      that it is but a wrapper for rcu_dereference_check().  The name
      rcu_dereference_check_sched_domain() does that and provides a
      separate per-subsystem name space.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-7-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      497f0ab3
    • P
      sched: Use lockdep-based checking on rcu_dereference() · d11c563d
      Paul E. McKenney 提交于
      Update the rcu_dereference() usages to take advantage of the new
      lockdep-based checking.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-6-git-send-email-paulmck@linux.vnet.ibm.com>
      [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d11c563d
  10. 17 2月, 2010 2 次提交
  11. 16 2月, 2010 2 次提交
    • P
      sched: Fix race between ttwu() and task_rq_lock() · 0970d299
      Peter Zijlstra 提交于
      Thomas found that due to ttwu() changing a task's cpu without holding
      the rq->lock, task_rq_lock() might end up locking the wrong rq.
      
      Avoid this by serializing against TASK_WAKING.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266241712.15770.420.camel@laptop>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      0970d299
    • S
      sched: Fix SMT scheduler regression in find_busiest_queue() · 9000f05c
      Suresh Siddha 提交于
      Fix a SMT scheduler performance regression that is leading to a scenario
      where SMT threads in one core are completely idle while both the SMT threads
      in another core (on the same socket) are busy.
      
      This is caused by this commit (with the problematic code highlighted)
      
         commit bdb94aa5
         Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
         Date:   Tue Sep 1 10:34:38 2009 +0200
      
         sched: Try to deal with low capacity
      
         @@ -4203,15 +4223,18 @@ find_busiest_queue()
         ...
      	for_each_cpu(i, sched_group_cpus(group)) {
         +	unsigned long power = power_of(i);
      
         ...
      
         -	wl = weighted_cpuload(i);
         +	wl = weighted_cpuload(i) * SCHED_LOAD_SCALE;
         +	wl /= power;
      
         -	if (rq->nr_running == 1 && wl > imbalance)
         +	if (capacity && rq->nr_running == 1 && wl > imbalance)
      		continue;
      
      On a SMT system, power of the HT logical cpu will be 589 and
      the scheduler load imbalance (for scenarios like the one mentioned above)
      can be approximately 1024 (SCHED_LOAD_SCALE). The above change of scaling
      the weighted load with the power will result in "wl > imbalance" and
      ultimately resulting in find_busiest_queue() return NULL, causing
      load_balance() to think that the load is well balanced. But infact
      one of the tasks can be moved to the idle core for optimal performance.
      
      We don't need to use the weighted load (wl) scaled by the cpu power to
      compare with  imabalance. In that condition, we already know there is only a
      single task "rq->nr_running == 1" and the comparison between imbalance,
      wl is to make sure that we select the correct priority thread which matches
      imbalance. So we really need to compare the imabalnce with the original
      weighted load of the cpu and not the scaled load.
      
      But in other conditions where we want the most hammered(busiest) cpu, we can
      use scaled load to ensure that we consider the cpu power in addition to the
      actual load on that cpu, so that we can move the load away from the
      guy that is getting most hammered with respect to the actual capacity,
      as compared with the rest of the cpu's in that busiest group.
      
      Fix it.
      Reported-by: NMa Ling <ling.ma@intel.com>
      Initial-Analysis-by: NZhang, Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1266023662.2808.118.camel@sbs-t61.sc.intel.com>
      Cc: stable@kernel.org [2.6.32.x]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      9000f05c
  12. 08 2月, 2010 2 次提交
    • A
      sched: cpuacct: Use bigger percpu counter batch values for stats counters · fa535a77
      Anton Blanchard 提交于
      When CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_CGROUP_CPUACCT are
      enabled we can call cpuacct_update_stats with values much larger
      than percpu_counter_batch.  This means the call to
      percpu_counter_add will always add to the global count which is
      protected by a spinlock and we end up with a global spinlock in
      the scheduler.
      
      Based on an idea by KOSAKI Motohiro, this patch scales the batch
      value by cputime_one_jiffy such that we have the same batch
      limit as we would if CONFIG_VIRT_CPU_ACCOUNTING was disabled.
      His patch did this once at boot but that initialisation happened
      too early on PowerPC (before time_init) and it was never updated
      at runtime as a result of a hotplug cpu add/remove.
      
      This patch instead scales percpu_counter_batch by
      cputime_one_jiffy at runtime, which keeps the batch correct even
      after cpu hotplug operations.  We cap it at INT_MAX in case of
      overflow.
      
      For architectures that do not support
      CONFIG_VIRT_CPU_ACCOUNTING, cputime_one_jiffy is the constant 1
      and gcc is smart enough to optimise min(s32
      percpu_counter_batch, INT_MAX) to just percpu_counter_batch at
      least on x86 and PowerPC.  So there is no need to add an #ifdef.
      
      On a 64 thread PowerPC box with CONFIG_VIRT_CPU_ACCOUNTING and
      CONFIG_CGROUP_CPUACCT enabled, a context switch microbenchmark
      is 234x faster and almost matches a CONFIG_CGROUP_CPUACCT
      disabled kernel:
      
       CONFIG_CGROUP_CPUACCT disabled:   16906698 ctx switches/sec
       CONFIG_CGROUP_CPUACCT enabled:       61720 ctx switches/sec
       CONFIG_CGROUP_CPUACCT + patch:	   16663217 ctx switches/sec
      
      Tested with:
      
       wget http://ozlabs.org/~anton/junkcode/context_switch.c
       make context_switch
       for i in `seq 0 63`; do taskset -c $i ./context_switch & done
       vmstat 1
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fa535a77
    • A
      kernel/sched.c: Suppress unused var warning · 50200df4
      Andrew Morton 提交于
      On UP:
      
       kernel/sched.c: In function 'wake_up_new_task':
       kernel/sched.c:2631: warning: unused variable 'cpu'
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      50200df4
  13. 04 2月, 2010 1 次提交
  14. 02 2月, 2010 1 次提交
  15. 23 1月, 2010 2 次提交
  16. 22 1月, 2010 1 次提交
    • P
      sched: Fix fork vs hotplug vs cpuset namespaces · fabf318e
      Peter Zijlstra 提交于
      There are a number of issues:
      
      1) TASK_WAKING vs cgroup_clone (cpusets)
      
      copy_process():
      
        sched_fork()
          child->state = TASK_WAKING; /* waiting for wake_up_new_task() */
        if (current->nsproxy != p->nsproxy)
           ns_cgroup_clone()
             cgroup_clone()
               mutex_lock(inode->i_mutex)
               mutex_lock(cgroup_mutex)
               cgroup_attach_task()
      	   ss->can_attach()
                 ss->attach() [ -> cpuset_attach() ]
                   cpuset_attach_task()
                     set_cpus_allowed_ptr();
                       while (child->state == TASK_WAKING)
                         cpu_relax();
      will deadlock the system.
      
      
      2) cgroup_clone (cpusets) vs copy_process
      
      So even if the above would work we still have:
      
      copy_process():
      
        if (current->nsproxy != p->nsproxy)
           ns_cgroup_clone()
             cgroup_clone()
               mutex_lock(inode->i_mutex)
               mutex_lock(cgroup_mutex)
               cgroup_attach_task()
      	   ss->can_attach()
                 ss->attach() [ -> cpuset_attach() ]
                   cpuset_attach_task()
                     set_cpus_allowed_ptr();
        ...
      
        p->cpus_allowed = current->cpus_allowed
      
      over-writing the modified cpus_allowed.
      
      
      3) fork() vs hotplug
      
        if we unplug the child's cpu after the sanity check when the child
        gets attached to the task_list but before wake_up_new_task() shit
        will meet with fan.
      
      Solve all these issues by moving fork cpu selection into
      wake_up_new_task().
      Reported-by: NSerge E. Hallyn <serue@us.ibm.com>
      Tested-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1264106190.4283.1314.camel@laptop>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      fabf318e
  17. 21 1月, 2010 4 次提交
  18. 13 1月, 2010 1 次提交
  19. 28 12月, 2009 1 次提交