1. 28 6月, 2006 15 次提交
    • S
      [PATCH] sched: mc/smt power savings sched policy · 5c45bf27
      Siddha, Suresh B 提交于
      sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
      /sys/devices/system/cpu/ control the MC/SMT power savings policy for the
      scheduler.
      
      Based on the values (1-enable, 0-disable) for these controls, sched groups
      cpu power will be determined for different domains.  When power savings
      policy is enabled and under light load conditions, scheduler will minimize
      the physical packages/cpu cores carrying the load and thus conserving
      power(with a perf impact based on the workload characteristics...  see OLS
      2005 CMP kernel scheduler paper for more details..)
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Con Kolivas <kernel@kolivas.org>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5c45bf27
    • S
      [PATCH] sched_domai: Allocate sched_group structures dynamically · 36938169
      Srivatsa Vaddagiri 提交于
      As explained here:
      	http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2
      
      there is a problem with sharing sched_group structures between two
      separate sched_group structures for different sched_domains.
      
      The patch has been tested and found to avoid the kernel lockup problem
      described in above URL.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      36938169
    • S
      [PATCH] sched_domai: Use kmalloc_node · 15f0b676
      Srivatsa Vaddagiri 提交于
      The sched group structures used to represent various nodes need to be
      allocated from respective nodes (as suggested here also:
      
      	http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      15f0b676
    • S
      [PATCH] sched_domai: Don't use GFP_ATOMIC · d3a5aa98
      Srivatsa Vaddagiri 提交于
      Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
      allocation.
      
      Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d3a5aa98
    • S
      [PATCH] sched_domain: handle kmalloc failure · 51888ca2
      Srivatsa Vaddagiri 提交于
      Try to handle mem allocation failures in build_sched_domains by bailing out
      and cleaning up thus-far allocated memory.  The patch has a direct consequence
      that we disable load balancing completely (even at sibling level) upon *any*
      memory allocation failure.
      
      [Lee.Schermerhorn@hp.com: bugfix]
      Signed-off-by: NSrivatsa Vaddagir <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      51888ca2
    • P
      [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks() · 615052dc
      Peter Williams 提交于
      Problem:
      
      To help distribute high priority tasks evenly across the available CPUs
      move_tasks() does not, under some circumstances, skip tasks whose load
      weight is bigger than the designated amount.  Because the highest priority
      task on the busiest queue may be on the expired array it may be moved as a
      result of this mechanism.  Apart from not being the most desirable way to
      redistribute the high priority tasks (we'd rather move the second highest
      priority task), there is a risk that this could set up a loop with this
      task bouncing backwards and forwards between the two queues.  (This latter
      possibility can be demonstrated by running a nice==-20 CPU bound task on an
      otherwise quiet 2 CPU system.)
      
      Solution:
      
      Modify the mechanism so that it does not override skip for the highest
      priority task on the CPU.  Of course, if there are more than one tasks at
      the highest priority then it will allow the override for one of them as
      this is a desirable redistribution of high priority tasks.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      615052dc
    • P
      [PATCH] sched: modify move_tasks() to improve load balancing outcomes · 50ddd969
      Peter Williams 提交于
      Problem:
      
      The move_tasks() function is designed to move UP TO the amount of load it
      is asked to move and in doing this it skips over tasks looking for ones
      whose load weights are less than or equal to the remaining load to be
      moved.  This is (in general) a good thing but it has the unfortunate result
      of breaking one of the original load balancer's good points: namely, that
      (within the limits imposed by the active/expired array model and the fact
      the expired is processed first) it moves high priority tasks before low
      priority ones and this means there's a good chance (see active/expired
      problem for why it's only a chance) that the highest priority task on the
      queue but not actually on the CPU will be moved to the other CPU where (as
      a high priority task) it may preempt the current task.
      
      Solution:
      
      Modify move_tasks() so that high priority tasks are not skipped when moving
      them will make them the highest priority task on their new run queue.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      50ddd969
    • P
      [PATCH] sched: implement smpnice · 2dd73a4f
      Peter Williams 提交于
      Problem:
      
      The introduction of separate run queues per CPU has brought with it "nice"
      enforcement problems that are best described by a simple example.
      
      For the sake of argument suppose that on a single CPU machine with a
      nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
      task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
      suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
      2 nice==0 hard spinners running.  The user of this system would be entitled
      to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
      tasks only get 5% each.  However, whether this expectation is met is pretty
      much down to luck as there are four equally likely distributions of the
      tasks to the CPUs that the load balancing code will consider to be balanced
      with loads of 2.0 for each CPU.  Two of these distributions involve one
      nice==0 and one nice==19 task per CPU and in these circumstances the users
      expectations will be met.  The other two distributions both involve both
      nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
      each task will get 50% of a CPU and the user's expectations will not be
      met.
      
      Solution:
      
      The solution to this problem that is implemented in the attached patch is
      to use weighted loads when determining if the system is balanced and, when
      an imbalance is detected, to move an amount of weighted load between run
      queues (as opposed to a number of tasks) to restore the balance.  Once
      again, the easiest way to explain why both of these measures are necessary
      is to use a simple example.  Suppose that (in a slight variation of the
      above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
      hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
      the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
      CPUs would be 4.0 and 0.2 respectively and the load balancing code would
      move 2 tasks resulting in one CPU with a load of 2.0 and the other with
      load of 2.2.  If this was considered to be a big enough imbalance to
      justify moving a task and that task was moved using the current
      move_tasks() then it would move the highest priority task that it found and
      this would result in one CPU with a load of 3.0 and the other with a load
      of 1.2 which would result in the movement of a task in the opposite
      direction and so on -- infinite loop.  If, on the other hand, an amount of
      load to be moved is calculated from the imbalance (in this case 0.1) and
      move_tasks() skips tasks until it find ones whose contributions to the
      weighted load are less than this amount it would move two of the nice==19
      tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
      loads of 2.1 for each CPU.
      
      One of the advantages of this mechanism is that on a system where all tasks
      have nice==0 the load balancing calculations would be mathematically
      identical to the current load balancing code.
      
      Notes:
      
      struct task_struct:
      
      has a new field load_weight which (in a trade off of space for speed)
      stores the contribution that this task makes to a CPU's weighted load when
      it is runnable.
      
      struct runqueue:
      
      has a new field raw_weighted_load which is the sum of the load_weight
      values for the currently runnable tasks on this run queue.  This field
      always needs to be updated when nr_running is updated so two new inline
      functions inc_nr_running() and dec_nr_running() have been created to make
      sure that this happens.  This also offers a convenient way to optimize away
      this part of the smpnice mechanism when CONFIG_SMP is not defined.
      
      int try_to_wake_up():
      
      in this function the value SCHED_LOAD_BALANCE is used to represent the load
      contribution of a single task in various calculations in the code that
      decides which CPU to put the waking task on.  While this would be a valid
      on a system where the nice values for the runnable tasks were distributed
      evenly around zero it will lead to anomalous load balancing if the
      distribution is skewed in either direction.  To overcome this problem
      SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
      or by the average load_weight per task for the queue in question (as
      appropriate).
      
      int move_tasks():
      
      The modifications to this function were complicated by the fact that
      active_load_balance() uses it to move exactly one task without checking
      whether an imbalance actually exists.  This precluded the simple
      overloading of max_nr_move with max_load_move and necessitated the addition
      of the latter as an extra argument to the function.  The internal
      implementation is then modified to move up to max_nr_move tasks and
      max_load_move of weighted load.  This slightly complicates the code where
      move_tasks() is called and if ever active_load_balance() is changed to not
      use move_tasks() the implementation of move_tasks() should be simplified
      accordingly.
      
      struct sched_group *find_busiest_group():
      
      Similar to try_to_wake_up(), there are places in this function where
      SCHED_LOAD_SCALE is used to represent the load contribution of a single
      task and the same issues are created.  A similar solution is adopted except
      that it is now the average per task contribution to a group's load (as
      opposed to a run queue) that is required.  As this value is not directly
      available from the group it is calculated on the fly as the queues in the
      groups are visited when determining the busiest group.
      
      A key change to this function is that it is no longer to scale down
      *imbalance on exit as move_tasks() uses the load in its scaled form.
      
      void set_user_nice():
      
      has been modified to update the task's load_weight field when it's nice
      value and also to ensure that its run queue's raw_weighted_load field is
      updated if it was runnable.
      
      From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      
      With smpnice, sched groups with highest priority tasks can mask the imbalance
      between the other sched groups with in the same domain.  This patch fixes some
      of the listed down scenarios by not considering the sched groups which are
      lightly loaded.
      
      a) on a simple 4-way MP system, if we have one high priority and 4 normal
         priority tasks, with smpnice we would like to see the high priority task
         scheduled on one cpu, two other cpus getting one normal task each and the
         fourth cpu getting the remaining two normal tasks.  but with current
         smpnice extra normal priority task keeps jumping from one cpu to another
         cpu having the normal priority task.  This is because of the
         busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
         cpu with high priority task in max_load calculations but including that in
         total and avg_load calcuations..  leading to max_load < avg_load and load
         balance between cpus running normal priority tasks(2 Vs 1) will always show
         imbalanace as one normal priority and the extra normal priority task will
         keep moving from one cpu to another cpu having normal priority task..
      
      b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
         highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
         priority task each..  P2 and P3 are idle.  With this patch, one of the
         normal priority tasks on P1 will be moved to P2 or P3..
      
      c) With the current weighted smp nice calculations, it doesn't always make
         sense to look at the highest weighted runqueue in the busy group..
         Consider a load balance scenario on a DP with HT system, with Package-0
         containing one high priority and one low priority, Package-1 containing one
         low priority(with other thread being idle)..  Package-1 thinks that it need
         to take the low priority thread from Package-0.  And find_busiest_queue()
         returns the cpu thread with highest priority task..  And ultimately(with
         help of active load balance) we move high priority task to Package-1.  And
         same continues with Package-0 now, moving high priority task from package-1
         to package-0..  Even without the presence of active load balance, load
         balance will fail to balance the above scenario..  Fix find_busiest_queue
         to use "imbalance" when it is lightly loaded.
      
      [kernel@kolivas.org: sched: store weighted load on up]
      [kernel@kolivas.org: sched: add discrete weighted cpu load function]
      [suresh.b.siddha@intel.com: sched: remove dead code]
      Signed-off-by: NPeter Williams <pwil3058@bigpond.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Cc: John Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2dd73a4f
    • K
      [PATCH] sched: CPU hotplug race vs. set_cpus_allowed() · efc30814
      Kirill Korotaev 提交于
      There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
      __migrate_task() doesn't report any err code, so task can be left on its
      runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
      possible target.  Also, chaning cpus_allowed mask requires rq->lock being
      held.
      Signed-off-by: NKirill Korotaev <dev@openvz.org>
      Acked-By: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      efc30814
    • S
      [PATCH] unnecessary long index i in sched · cc94abfc
      Steven Rostedt 提交于
      Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
      as a long long here.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cc94abfc
    • C
      [PATCH] sched: fix interactive ceiling code · 72d2854d
      Con Kolivas 提交于
      The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
      and not explicit enough.  The sleep boost is not supposed to be any larger
      than without this code and the comment is not clear enough about what
      exactly it does, just the reason it does it.  Fix it.
      
      There is a ceiling to the priority beyond which tasks that only ever sleep
      for very long periods cannot surpass.  Fix it.
      
      Prevent the on-runqueue bonus logic from defeating the idle sleep logic.
      
      Opportunity to micro-optimise.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      72d2854d
    • S
      d444886e
    • C
      [PATCH] sched: fix smt nice lock contention and optimization · c96d145e
      Chen, Kenneth W 提交于
      Initial report and lock contention fix from Chris Mason:
      
      Recent benchmarks showed some performance regressions between 2.6.16 and
      2.6.5.  We tracked down one of the regressions to lock contention in
      schedule heavy workloads (~70,000 context switches per second)
      
      kernel/sched.c:dependent_sleeper() was responsible for most of the lock
      contention, hammering on the run queue locks.  The patch below is more of a
      discussion point than a suggested fix (although it does reduce lock
      contention significantly).  The dependent_sleeper code looks very expensive
      to me, especially for using a spinlock to bounce control between two
      different siblings in the same cpu.
      
      It is further optimized:
      
      * perform dependent_sleeper check after next task is determined
      * convert wake_sleeping_dependent to use trylock
      * skip smt runqueue check if trylock fails
      * optimize double_rq_lock now that smt nice is converted to trylock
      * early exit in searching first SD_SHARE_CPUPOWER domain
      * speedup fast path of dependent_sleeper
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NCon Kolivas <kernel@kolivas.org>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NChris Mason <mason@suse.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c96d145e
    • C
      [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only · 26c2143b
      Chandra Seetharaman 提交于
      Make notifier_calls associated with cpu_notifier as __cpuinit.
      
      __cpuinit makes sure that the function is init time only unless
      CONFIG_HOTPLUG_CPU is defined.
      
      [akpm@osdl.org: section fix]
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      26c2143b
    • C
      [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 · 054cc8a2
      Chandra Seetharaman 提交于
      This patch reverts notifier_block changes made in 2.6.17
      Signed-off-by: NChandra Seetharaman <sekharan@us.ibm.com>
      Cc: Ashok Raj <ashok.raj@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      054cc8a2
  2. 27 6月, 2006 2 次提交
    • A
      [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status · 495ab9c0
      Andi Kleen 提交于
      During some profiling I noticed that default_idle causes a lot of
      memory traffic. I think that is caused by the atomic operations
      to clear/set the polling flag in thread_info. There is actually
      no reason to make this atomic - only the idle thread does it
      to itself, other CPUs only read it. So I moved it into ti->status.
      
      Converted i386/x86-64/ia64 for now because that was the easiest
      way to fix ACPI which also manipulates these flags in its idle
      function.
      
      Cc: Nick Piggin <npiggin@novell.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      495ab9c0
    • P
      [PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval() · b78709cf
      Peter Williams 提交于
      The introduction of SCHED_BATCH scheduling class with a value of 3 means
      that the expression (p->policy & SCHED_FIFO) will return true if policy
      is SCHED_BATCH or SCHED_FIFO.
      
      Unfortunately, this expression is used in sys_sched_rr_get_interval()
      and in the absence of a comment to say that this is intentional I
      presume that it is unintentional and erroneous.
      
      The fix is to change the expression to (p->policy == SCHED_FIFO).
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b78709cf
  3. 26 6月, 2006 2 次提交
  4. 23 6月, 2006 2 次提交
  5. 22 5月, 2006 1 次提交
  6. 26 4月, 2006 1 次提交
  7. 11 4月, 2006 2 次提交
  8. 01 4月, 2006 7 次提交
  9. 29 3月, 2006 1 次提交
  10. 28 3月, 2006 4 次提交
    • S
      [PATCH] sched: fix group power for allnodes_domains · 08069033
      Siddha, Suresh B 提交于
      Current sched groups power calculation for allnodes_domains is wrong.  We
      should really be using cumulative power of the physical packages in that
      group (similar to the calculation in node_domains)
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      08069033
    • S
      [PATCH] sched: new sched domain for representing multi-core · 1e9f28fa
      Siddha, Suresh B 提交于
      Add a new sched domain for representing multi-core with shared caches
      between cores.  Consider a dual package system, each package containing two
      cores and with last level cache shared between cores with in a package.  If
      there are two runnable processes, with this appended patch those two
      processes will be scheduled on different packages.
      
      On such systems, with this patch we have observed 8% perf improvement with
      specJBB(2 warehouse) benchmark and 35% improvement with CFP2000 rate(with 2
      users).
      
      This new domain will come into play only on multi-core systems with shared
      caches.  On other systems, this sched domain will be removed by domain
      degeneration code.  This new domain can be also used for implementing power
      savings policy (see OLS 2005 CMP kernel scheduler paper for more details..
      I will post another patch for power savings policy soon)
      
      Most of the arch/* file changes are for cpu_coregroup_map() implementation.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e9f28fa
    • A
      [PATCH] Small schedule() optimization · 77e4bfbc
      Andreas Mohr 提交于
      small schedule() microoptimization.
      Signed-off-by: NAndreas Mohr <andi@lisas.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      77e4bfbc
    • M
      [PATCH] sched: fix task interactivity calculation · 013d3868
      Martin Andersson 提交于
      Is a truncation error in kernel/sched.c triggered when the nice value is
      negative.  The affected code is used in the TASK_INTERACTIVE macro.
      
      The code is:
      #define SCALE(v1,v1_max,v2_max) \
      	(v1) * (v2_max) / (v1_max)
      
      which is used in this way:
      SCALE(TASK_NICE(p), 40, MAX_BONUS)
      
      Comments in the code says:
        * This part scales the interactivity limit depending on niceness.
        *
        * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
        * Here are a few examples of different nice levels:
        *
        *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
        *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
        *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
        *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
        *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
        *
        * (the X axis represents the possible -5 ... 0 ... +5 dynamic
        *  priority range a task can explore, a value of '1' means the
        *  task is rated interactive.)
      
      However, the current code does not scale it linearly and the result differs
      from the given examples.  If the mathematical function "floor" is used when
      the nice value is negative instead of the truncation one gets when using
      integer division, the result conforms to the documentation.
      
      Output of TASK_INTERACTIVE when using the kernel code:
      nice    dynamic priorities
      -20     1     1     1     1     1     1     1     1     1     0     0
      -19     1     1     1     1     1     1     1     1     0     0     0
      -18     1     1     1     1     1     1     1     1     0     0     0
      -17     1     1     1     1     1     1     1     1     0     0     0
      -16     1     1     1     1     1     1     1     1     0     0     0
      -15     1     1     1     1     1     1     1     0     0     0     0
      -14     1     1     1     1     1     1     1     0     0     0     0
      -13     1     1     1     1     1     1     1     0     0     0     0
      -12     1     1     1     1     1     1     1     0     0     0     0
      -11     1     1     1     1     1     1     0     0     0     0     0
      -10     1     1     1     1     1     1     0     0     0     0     0
        -9     1     1     1     1     1     1     0     0     0     0     0
        -8     1     1     1     1     1     1     0     0     0     0     0
        -7     1     1     1     1     1     0     0     0     0     0     0
        -6     1     1     1     1     1     0     0     0     0     0     0
        -5     1     1     1     1     1     0     0     0     0     0     0
        -4     1     1     1     1     1     0     0     0     0     0     0
        -3     1     1     1     1     0     0     0     0     0     0     0
        -2     1     1     1     1     0     0     0     0     0     0     0
        -1     1     1     1     1     0     0     0     0     0     0     0
        0      1     1     1     1     0     0     0     0     0     0     0
        1      1     1     1     1     0     0     0     0     0     0     0
        2      1     1     1     1     0     0     0     0     0     0     0
        3      1     1     1     1     0     0     0     0     0     0     0
        4      1     1     1     0     0     0     0     0     0     0     0
        5      1     1     1     0     0     0     0     0     0     0     0
        6      1     1     1     0     0     0     0     0     0     0     0
        7      1     1     1     0     0     0     0     0     0     0     0
        8      1     1     0     0     0     0     0     0     0     0     0
        9      1     1     0     0     0     0     0     0     0     0     0
      10      1     1     0     0     0     0     0     0     0     0     0
      11      1     1     0     0     0     0     0     0     0     0     0
      12      1     0     0     0     0     0     0     0     0     0     0
      13      1     0     0     0     0     0     0     0     0     0     0
      14      1     0     0     0     0     0     0     0     0     0     0
      15      1     0     0     0     0     0     0     0     0     0     0
      16      0     0     0     0     0     0     0     0     0     0     0
      17      0     0     0     0     0     0     0     0     0     0     0
      18      0     0     0     0     0     0     0     0     0     0     0
      19      0     0     0     0     0     0     0     0     0     0     0
      
      Output of TASK_INTERACTIVE when using "floor"
      nice    dynamic priorities
      -20     1     1     1     1     1     1     1     1     1     0     0
      -19     1     1     1     1     1     1     1     1     1     0     0
      -18     1     1     1     1     1     1     1     1     1     0     0
      -17     1     1     1     1     1     1     1     1     1     0     0
      -16     1     1     1     1     1     1     1     1     0     0     0
      -15     1     1     1     1     1     1     1     1     0     0     0
      -14     1     1     1     1     1     1     1     1     0     0     0
      -13     1     1     1     1     1     1     1     1     0     0     0
      -12     1     1     1     1     1     1     1     0     0     0     0
      -11     1     1     1     1     1     1     1     0     0     0     0
      -10     1     1     1     1     1     1     1     0     0     0     0
        -9     1     1     1     1     1     1     1     0     0     0     0
        -8     1     1     1     1     1     1     0     0     0     0     0
        -7     1     1     1     1     1     1     0     0     0     0     0
        -6     1     1     1     1     1     1     0     0     0     0     0
        -5     1     1     1     1     1     1     0     0     0     0     0
        -4     1     1     1     1     1     0     0     0     0     0     0
        -3     1     1     1     1     1     0     0     0     0     0     0
        -2     1     1     1     1     1     0     0     0     0     0     0
        -1     1     1     1     1     1     0     0     0     0     0     0
         0     1     1     1     1     0     0     0     0     0     0     0
         1     1     1     1     1     0     0     0     0     0     0     0
         2     1     1     1     1     0     0     0     0     0     0     0
         3     1     1     1     1     0     0     0     0     0     0     0
         4     1     1     1     0     0     0     0     0     0     0     0
         5     1     1     1     0     0     0     0     0     0     0     0
         6     1     1     1     0     0     0     0     0     0     0     0
         7     1     1     1     0     0     0     0     0     0     0     0
         8     1     1     0     0     0     0     0     0     0     0     0
         9     1     1     0     0     0     0     0     0     0     0     0
        10     1     1     0     0     0     0     0     0     0     0     0
        11     1     1     0     0     0     0     0     0     0     0     0
        12     1     0     0     0     0     0     0     0     0     0     0
        13     1     0     0     0     0     0     0     0     0     0     0
        14     1     0     0     0     0     0     0     0     0     0     0
        15     1     0     0     0     0     0     0     0     0     0     0
        16     0     0     0     0     0     0     0     0     0     0     0
        17     0     0     0     0     0     0     0     0     0     0     0
        18     0     0     0     0     0     0     0     0     0     0     0
        19     0     0     0     0     0     0     0     0     0     0     0
      Signed-off-by: NMartin Andersson <martin.andersson@control.lth.se>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Cc: Con Kolivas <kernel@kolivas.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      013d3868
  11. 27 3月, 2006 1 次提交
    • B
      [PATCH] kretprobe instance recycled by parent process · c6fd91f0
      bibo mao 提交于
      When kretprobe probes the schedule() function, if the probed process exits
      then schedule() will never return, so some kretprobe instances will never
      be recycled.
      
      In this patch the parent process will recycle retprobe instances of the
      probed function and there will be no memory leak of kretprobe instances.
      Signed-off-by: Nbibo mao <bibo.mao@intel.com>
      Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c6fd91f0
  12. 23 3月, 2006 2 次提交
    • I
      [PATCH] make bug messages more consistent · 91368d73
      Ingo Molnar 提交于
      Consolidate all kernel bug printouts to begin with the "BUG: " string.
      Makes it easier to find them in large bootup logs.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      91368d73
    • A
      [PATCH] fix scheduler deadlock · e9028b0f
      Anton Blanchard 提交于
      We have noticed lockups during boot when stress testing kexec on ppc64.
      Two cpus would deadlock in scheduler code trying to grab already taken
      spinlocks.
      
      The double_rq_lock code uses the address of the runqueue to order the
      taking of multiple locks.  This address is a per cpu variable:
      
      	if (rq1 < rq2) {
      		spin_lock(&rq1->lock);
      		spin_lock(&rq2->lock);
      	} else {
      		spin_lock(&rq2->lock);
      		spin_lock(&rq1->lock);
      	}
      
      On the other hand, the code in wake_sleeping_dependent uses the cpu id
      order to grab locks:
      
      	for_each_cpu_mask(i, sibling_map)
      		spin_lock(&cpu_rq(i)->lock);
      
      This means we rely on the address of per cpu data increasing as cpu ids
      increase.  While this will be true for the generic percpu implementation it
      may not be true for arch specific implementations.
      
      One way to solve this is to always take runqueues in cpu id order. To do
      this we add a cpu variable to the runqueue and check it in the
      double runqueue locking functions.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e9028b0f