1. 06 2月, 2009 1 次提交
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
  2. 01 2月, 2009 2 次提交
  3. 15 1月, 2009 2 次提交
    • P
      sched: SCHED_IDLE weight change · cce7ade8
      Peter Zijlstra 提交于
      Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
      vruntime numbers.
      
      time advanced in 100ms:
      
       weight=2
      
       64765.988352
       67012.881408
       88501.412352
      
       weight=3
      
       35496.181411
       34130.971298
       35497.411573
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cce7ade8
    • P
      sched: fix bandwidth validation for UID grouping · 98a4826b
      Peter Zijlstra 提交于
      Impact: make rt-limit tunables work again
      
      Mark Glines reported:
      
      > I've got an issue on x86-64 where I can't configure the system to allow
      > RT tasks for a non-root user.
      >
      > In 2.6.26.5, I was able to do the following to set things up nicely:
      > echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
      > echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
      >
      > Seems like every value I try to echo into the /sys files returns EINVAL.
      
      For UID grouping we initialize the root group with infinite bandwidth
      which by default is actually more than the global limit, therefore the
      bandwidth check always fails.
      
      Because the root group is a phantom group (for UID grouping) we cannot
      runtime adjust it, therefore we let it reflect the global bandwidth
      settings.
      Reported-by: NMark Glines <mark@glines.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      98a4826b
  4. 14 1月, 2009 3 次提交
  5. 12 1月, 2009 1 次提交
  6. 11 1月, 2009 2 次提交
  7. 07 1月, 2009 1 次提交
    • P
      sched: fix possible recursive rq->lock · da8d5089
      Peter Zijlstra 提交于
      Vaidyanathan Srinivasan reported:
      
       > =============================================
       > [ INFO: possible recursive locking detected ]
       > 2.6.28-autotest-tip-sv #1
       > ---------------------------------------------
       > klogd/5062 is trying to acquire lock:
       >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
       >
       > but task is already holding lock:
       >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31
      
      With sched_mc at 2. (it is default-off)
      
      Strictly speaking we'll not deadlock, because ttwu will not be able to
      place the migration task on our rq, but since the code can deal with
      both rqs getting unlocked, this seems the easiest way out.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      da8d5089
  8. 06 1月, 2009 2 次提交
  9. 05 1月, 2009 2 次提交
  10. 04 1月, 2009 1 次提交
    • M
      sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc
      Mike Travis 提交于
      Impact: prevents panic from stack overflow on numa-capable machines.
      
      Some of the "removal of stack hogs" changes in kernel/sched.c by using
      node_to_cpumask_ptr were undone by the early cpumask API updates, and
      causes a panic due to stack overflow.  This patch undoes those changes
      by using cpumask_of_node() which returns a 'const struct cpumask *'.
      
      In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
      reducing stack usage.  (Both of these updates removed 9 FIXME's!)
      
      Also:
         Pick up some remaining changes from the old 'cpumask_t' functions to
         the new 'struct cpumask *' functions.
      
         Optimize memory traffic by allocating each percpu local_cpu_mask on the
         same node as the referring cpu.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ca09dfc
  11. 31 12月, 2008 2 次提交
    • M
      [PATCH] idle cputime accounting · 79741dd3
      Martin Schwidefsky 提交于
      The cpu time spent by the idle process actually doing something is
      currently accounted as idle time. This is plain wrong, the architectures
      that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
      time spent doing nothing and the time spent by idle doing work. The first
      is accounted with account_idle_time and the second with account_system_time.
      The architectures that use the account_xxx_time interface directly and not
      the account_xxx_ticks interface now need to do the check for the idle
      process in their arch code. In particular to improve the system vs true
      idle time accounting the arch code needs to measure the true idle time
      instead of just testing for the idle process.
      To improve the tick based accounting as well we would need an architecture
      primitive that can tell us if the pt_regs of the interrupted context
      points to the magic instruction that halts the cpu.
      
      In addition idle time is no more added to the stime of the idle process.
      This field now contains the system time of the idle process as it should
      be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
      every tick that occurs while idle is running will be accounted as idle
      time.
      
      This patch contains the necessary common code changes to be able to
      distinguish idle system time and true idle time. The architectures with
      support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      79741dd3
    • M
      [PATCH] fix scaled & unscaled cputime accounting · 457533a7
      Martin Schwidefsky 提交于
      The utimescaled / stimescaled fields in the task structure and the
      global cpustat should be set on all architectures. On s390 the calls
      to account_user_time_scaled and account_system_time_scaled never have
      been added. In addition system time that is accounted as guest time
      to the user time of a process is accounted to the scaled system time
      instead of the scaled user time.
      To fix the bugs and to prevent future forgetfulness this patch merges
      account_system_time_scaled into account_system_time and
      account_user_time_scaled into account_user_time.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Michael Neuling <mikey@neuling.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      457533a7
  12. 26 12月, 2008 1 次提交
  13. 25 12月, 2008 1 次提交
  14. 24 12月, 2008 1 次提交
  15. 19 12月, 2008 6 次提交
    • I
      sched: fix warning in kernel/sched.c · 9924da43
      Ingo Molnar 提交于
      Impact: fix cpumask conversion bug
      
      this warning:
      
        kernel/sched.c: In function ‘find_busiest_group’:
        kernel/sched.c:3429: warning: passing argument 1 of ‘__first_cpu’ from incompatible pointer type
      
      shows that we forgot to convert a new patch to the new cpumask APIs.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9924da43
    • V
      sched: activate active load balancing in new idle cpus · ad273b32
      Vaidyanathan Srinivasan 提交于
      Impact: tweak task balancing to save power more agressively
      
      Active load balancing is a process by which migration thread
      is woken up on the target CPU in order to pull current
      running task on another package into this newly idle
      package.
      
      This method is already in use with normal load_balance(),
      this patch introduces this method to new idle cpus when
      sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.
      
      This logic provides effective consolidation of short running
      daemon jobs in a almost idle system
      
      The side effect of this patch may be ping-ponging of tasks
      if the system is moderately utilised. May need to adjust the
      iterations before triggering.
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad273b32
    • V
      sched: nominate preferred wakeup cpu · 7a09b1a2
      Vaidyanathan Srinivasan 提交于
      Impact: extend load-balancing code (no change in behavior yet)
      
      When the system utilisation is low and more cpus are idle,
      then the process waking up from sleep should prefer to
      wakeup an idle cpu from semi-idle cpu package (multi core
      package) rather than a completely idle cpu package which
      would waste power.
      
      Use the sched_mc balance logic in find_busiest_group() to
      nominate a preferred wakeup cpu.
      
      This info can be stored in appropriate sched_domain, but
      updating this info in all copies of sched_domain is not
      practical.  Hence this information is stored in root_domain
      struct which is one copy per partitioned sched domain.
      The root_domain can be accessed from each cpu's runqueue
      and there is one copy per partitioned sched domain.
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a09b1a2
    • V
      sched: favour lower logical cpu number for sched_mc balance · d5679bd1
      Vaidyanathan Srinivasan 提交于
      Impact: change load-balancing direction to match that of irqbalanced
      
      Just in case two groups have identical load, prefer to move load to lower
      logical cpu number rather than the present logic of moving to higher logical
      number.
      
      find_busiest_group() tries to look for a group_leader that has spare capacity
      to take more tasks and freeup an appropriate least loaded group.  Just in case
      there is a tie and the load is equal, then the group with higher logical number
      is favoured.  This conflicts with user space irqbalance daemon that will move
      interrupts to lower logical number if the system utilisation is very low.
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d5679bd1
    • G
      sched: framework for sched_mc/smt_power_savings=N · afb8a9b7
      Gautham R Shenoy 提交于
      Impact: extend range of /sys/devices/system/cpu/sched_mc_power_savings
      
      Currently the sched_mc/smt_power_savings variable is a boolean,
      which either enables or disables topology based power savings.
      This patch extends the behaviour of the variable from boolean to
      multivalued, such that based on the value, we decide how
      aggressively do we want to perform powersavings balance at
      appropriate sched domain based on topology.
      
      Variable levels of power saving tunable would benefit end user to
      match the required level of power savings vs performance
      trade-off depending on the system configuration and workloads.
      
      This version makes the sched_mc_power_savings global variable to
      take more values (0,1,2).  Later versions can have a single
      tunable called sched_power_savings instead of
      sched_{mc,smt}_power_savings.
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      afb8a9b7
    • I
      tracing: fix warnings in kernel/trace/trace_sched_switch.c · c71dd42d
      Ingo Molnar 提交于
      these warnings:
      
        kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_register’:
        kernel/trace/trace_sched_switch.c:96: warning: passing argument 1 of ‘register_trace_sched_wakeup_new’ from incompatible pointer type
        kernel/trace/trace_sched_switch.c:112: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type
        kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_unregister’:
        kernel/trace/trace_sched_switch.c:121: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type
      
      Trigger because sched_wakeup_new tracepoints need the same trace
      signature as sched_wakeup - which was changed recently.
      
      Fix it.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c71dd42d
  16. 18 12月, 2008 1 次提交
  17. 16 12月, 2008 3 次提交
  18. 13 12月, 2008 1 次提交
    • R
      cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and... · 29c0177e
      Rusty Russell 提交于
      cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.
      
      Impact: change calling convention of existing cpumask APIs
      
      Most cpumask functions started with cpus_: these have been replaced by
      cpumask_ ones which take struct cpumask pointers as expected.
      
      These four functions don't have good replacement names; fortunately
      they're rarely used, so we just change them over.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: paulus@samba.org
      Cc: mingo@redhat.com
      Cc: tony.luck@intel.com
      Cc: ralf@linux-mips.org
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: cl@linux-foundation.org
      Cc: srostedt@redhat.com
      29c0177e
  19. 12 12月, 2008 3 次提交
  20. 10 12月, 2008 1 次提交
    • B
      sched: CPU remove deadlock fix · 9a2bd244
      Brian King 提交于
      Impact: fix possible deadlock in CPU hot-remove path
      
      This patch fixes a possible deadlock scenario in the CPU remove path.
      migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
      with the lock held. Then one of the tasks on the migration queue ends up
      calling tg_shares_up which then also tries to acquire the same rq->lock.
      
      [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
      [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
      [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
      [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
      [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
      [c000000058eab640] c00000000007ada4 .complete+0x54/0x80
      [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
      [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
      [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
      [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
      [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
      [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
      [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
      [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
      [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
      [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a2bd244
  21. 08 12月, 2008 2 次提交
  22. 02 12月, 2008 1 次提交