1. 17 10月, 2007 7 次提交
    • O
      migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock() · d2da272a
      Oleg Nesterov 提交于
      Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
      task_rq_lock(rq->idle), rq->idle can't change its task_rq().
      
      This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
      which uses spin_lock_irq/spin_unlock_irq.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Cliff Wickman <cpw@sgi.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2da272a
    • O
      do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist) · f7b4cddc
      Oleg Nesterov 提交于
      Currently move_task_off_dead_cpu() is called under
      write_lock_irq(tasklist).  This means it can't use task_lock() which is
      needed to improve migrating to take task's ->cpuset into account.
      
      Change the code to call move_task_off_dead_cpu() with irqs enabled, and
      change migrate_live_tasks() to use read_lock(tasklist).
      
      This all is a preparation for the futher changes proposed by Cliff Wickman, see
      	http://marc.info/?t=117327786100003Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Cliff Wickman <cpw@sgi.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7b4cddc
    • S
      sched: fix new task startup crash · b9dca1e0
      Srivatsa Vaddagiri 提交于
      Child task may be added on a different cpu that the one on which parent
      is running. In which case, task_new_fair() should check whether the new
      born task's parent entity should be added as well on the cfs_rq.
      
      Patch below fixes the problem in task_new_fair.
      
      This could fix the put_prev_task_fair() crashes reported.
      Reported-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Reported-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b9dca1e0
    • K
      sched: fix improper load balance across sched domain · 908a7c1b
      Ken Chen 提交于
      We recently discovered a nasty performance bug in the kernel CPU load
      balancer where we were hit by 50% performance regression.
      
      When tasks are assigned to a subset of CPUs that span across
      sched_domains (either ccNUMA node or the new multi-core domain) via
      cpu affinity, kernel fails to perform proper load balance at
      these domains, due to several logic in find_busiest_group() miss
      identified busiest sched group within a given domain. This leads to
      inadequate load balance and causes 50% performance hit.
      
      To give you a concrete example, on a dual-core, 2 socket numa system,
      there are 4 logical cpu, organized as:
      
      CPU0 attaching sched-domain:
       domain 0: span 0003  groups: 0001 0002
       domain 1: span 000f  groups: 0003 000c
      CPU1 attaching sched-domain:
       domain 0: span 0003  groups: 0002 0001
       domain 1: span 000f  groups: 0003 000c
      CPU2 attaching sched-domain:
       domain 0: span 000c  groups: 0004 0008
       domain 1: span 000f  groups: 000c 0003
      CPU3 attaching sched-domain:
       domain 0: span 000c  groups: 0008 0004
       domain 1: span 000f  groups: 000c 0003
      
      If I run 2 tasks with CPU affinity set to 0x5.  There are situation
      where cpu0 has run queue length of 2, and cpu2 will be idle.  The
      kernel load balancer is unable to balance out these two tasks over
      cpu0 and cpu2 due to at least three logics in find_busiest_group()
      that heavily bias load balance towards power saving mode. e.g. while
      determining "busiest" variable, kernel only set it when
      "sum_nr_running > group_capacity".  This test is flawed that
      "sum_nr_running" is not necessary same as
      sum-tasks-allowed-to-run-within-the sched-group.  The end result is
      that kernel "think" everything is balanced, but in reality we have an
      imbalance and thus causing one CPU to be over-subscribed and leaving
      other idle.  There are two other logic in the same function will also
      causing similar effect.  The nastiness of this bug is that kernel not
      be able to get unstuck in this unfortunate broken state.  From what
      we've seen in our environment, kernel will stuck in imbalanced state
      for extended period of time and it is also very easy for the kernel to
      stuck into that state (it's pretty much 100% reproducible for us).
      
      So proposing the following fix: add addition logic in
      find_busiest_group to detect intrinsic imbalance within the busiest
      group.  When such condition is detected, load balance goes into spread
      mode instead of default grouping mode.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      908a7c1b
    • M
      sched: more robust sd-sysctl entry freeing · cd790076
      Milton Miller 提交于
      It occurred to me this morning that the procname field was dynamically
      allocated and needed to be freed.  I started to put in break statements
      when allocation failed but it was approaching 50% error handling code.
      
      I came up with this alternative of looping while entry->mode is set and
      checking proc_handler instead of ->table.  Alternatively, the string
      version of the domain name and cpu number could be stored the structs.
      
      I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
      counts after taking a cpuset exclusive and back.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cd790076
    • P
      cpuset: remove sched domain hooks from cpusets · 607717a6
      Paul Jackson 提交于
      Remove the cpuset hooks that defined sched domains depending on the setting
      of the 'cpu_exclusive' flag.
      
      The cpu_exclusive flag can only be set on a child if it is set on the
      parent.
      
      This made that flag painfully unsuitable for use as a flag defining a
      partitioning of a system.
      
      It was entirely unobvious to a cpuset user what partitioning of sched
      domains they would be causing when they set that one cpu_exclusive bit on
      one cpuset, because it depended on what CPUs were in the remainder of that
      cpusets siblings and child cpusets, after subtracting out other
      cpu_exclusive cpusets.
      
      Furthermore, there was no way on production systems to query the
      result.
      
      Using the cpu_exclusive flag for this was simply wrong from the get go.
      
      Fortunately, it was sufficiently borked that so far as I know, almost no
      successful use has been made of this.  One real time group did use it to
      affectively isolate CPUs from any load balancing efforts.  They are willing
      to adapt to alternative mechanisms for this, such as someway to manipulate
      the list of isolated CPUs on a running system.  They can do without this
      present cpu_exclusive based mechanism while we develop an alternative.
      
      There is a real risk, to the best of my understanding, of users
      accidentally setting up a partitioned scheduler domains, inhibiting desired
      load balancing across all their CPUs, due to the nonobvious (from the
      cpuset perspective) side affects of the cpu_exclusive flag.
      
      Furthermore, since there was no way on a running system to see what one was
      doing with sched domains, this change will be invisible to any using code.
      Unless they have real insight to the scheduler load balancing choices, they
      will be unable to detect that this change has been made in the kernel's
      behaviour.
      
      Initial discussion on lkml of this patch has generated much comment.  My
      (probably controversial) take on that discussion is that it has reached a
      rough concensus that the current cpuset cpu_exclusive mechanism for
      defining sched domains is borked.  There is no concensus on the
      replacement.  But since we can remove this mechanism, and since its
      continued presence risks causing unwanted partitioning of the schedulers
      load balancing, we should remove it while we can, as we proceed to work the
      replacement scheduler domain mechanisms.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      607717a6
    • M
      Convert cpu_sibling_map to be a per cpu variable · d5a7430d
      Mike Travis 提交于
      Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
      variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
      from startup and CPU HOTPLUG functions.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5a7430d
  2. 15 10月, 2007 33 次提交