1. 11 3月, 2011 1 次提交
    • T
      SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41
      Trond Myklebust 提交于
      Although they run as rpciod background tasks, under normal operation
      (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
      and nfs4_do_close() want to be fully synchronous. This means that when we
      exit, we want all references to the rpc_task to be gone, and we want
      any dentry references etc. held by that task to be released.
      
      For this reason these functions call __rpc_wait_for_completion_task(),
      followed by rpc_put_task() in the expectation that the latter will be
      releasing the last reference to the rpc_task, and thus ensuring that the
      callback_ops->rpc_release() has been called synchronously.
      
      This patch fixes a race which exists due to the fact that
      rpciod calls rpc_complete_task() (in order to wake up the callers of
      __rpc_wait_for_completion_task()) and then subsequently calls
      rpc_put_task() without ensuring that these two steps are done atomically.
      
      In order to avoid adding new spin locks, the patch uses the existing
      waitqueue spin lock to order the rpc_task reference count releases between
      the waiting process and rpciod.
      The common case where nobody is waiting for completion is optimised for by
      checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
      reference count is 1: in those cases we drop trying to grab the spin lock,
      and immediately free up the rpc_task.
      
      Those few processes that need to put the rpc_task from inside an
      asynchronous context and that do not care about ordering are given a new
      helper: rpc_put_task_async().
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      bf294b41
  2. 16 2月, 2011 1 次提交
  3. 12 2月, 2011 1 次提交
    • S
      ftrace: Fix memory leak with function graph and cpu hotplug · 868baf07
      Steven Rostedt 提交于
      When the fuction graph tracer starts, it needs to make a special
      stack for each task to save the real return values of the tasks.
      All running tasks have this stack created, as well as any new
      tasks.
      
      On CPU hot plug, the new idle task will allocate a stack as well
      when init_idle() is called. The problem is that cpu hotplug does
      not create a new idle_task. Instead it uses the idle task that
      existed when the cpu went down.
      
      ftrace_graph_init_task() will add a new ret_stack to the task
      that is given to it. Because a clone will make the task
      have a stack of its parent it does not check if the task's
      ret_stack is already NULL or not. When the CPU hotplug code
      starts a CPU up again, it will allocate a new stack even
      though one already existed for it.
      
      The solution is to treat the idle_task specially. In fact, the
      function_graph code already does, just not at init_idle().
      Instead of using the ftrace_graph_init_task() for the idle task,
      which that function expects the task to be a clone, have a
      separate ftrace_graph_init_idle_task(). Also, we will create a
      per_cpu ret_stack that is used by the idle task. When we call
      ftrace_graph_init_idle_task() it will check if the idle task's
      ret_stack is NULL, if it is, then it will assign it the per_cpu
      ret_stack.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stable Tree <stable@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      868baf07
  4. 03 2月, 2011 1 次提交
    • P
      perf: Cure task_oncpu_function_call() races · fe4b04fa
      Peter Zijlstra 提交于
      Oleg reported that on architectures with
      __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from
      task_oncpu_function_call() can land before perf_event_task_sched_in()
      and cause interesting situations for eg. perf_install_in_context().
      
      This patch reworks the task_oncpu_function_call() interface to give a
      more usable primitive as well as rework all its users to hopefully be
      more obvious as well as remove the races.
      
      While looking at the code I also found a number of races against
      perf_event_task_sched_out() which can flip contexts between tasks so
      plug those too.
      Reported-and-reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fe4b04fa
  5. 19 1月, 2011 1 次提交
  6. 18 1月, 2011 2 次提交
  7. 07 1月, 2011 3 次提交
  8. 05 1月, 2011 2 次提交
    • N
      sched: Change wait_for_completion_*_timeout() to return a signed long · 6bf41237
      NeilBrown 提交于
      wait_for_completion_*_timeout() can return:
      
         0: if the wait timed out
       -ve: if the wait was interrupted
       +ve: if the completion was completed.
      
      As they currently return an 'unsigned long', the last two cases
      are not easily distinguished which can easily result in buggy
      code, as is the case for the recently added
      wait_for_completion_interruptible_timeout() call in
      net/sunrpc/cache.c
      
      So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
      is LONG_MAX, a large +ve return value should never overflow.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: J.  Bruce Fields <bfields@fieldses.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bf41237
    • G
      [S390] mutex: Introduce arch_mutex_cpu_relax() · 34b133f8
      Gerald Schaefer 提交于
      The spinning mutex implementation uses cpu_relax() in busy loops as a
      compiler barrier. Depending on the architecture, cpu_relax() may do more
      than needed in this specific mutex spin loops. On System z we also give
      up the time slice of the virtual cpu in cpu_relax(), which prevents
      effective spinning on the mutex.
      
      This patch replaces cpu_relax() in the spinning mutex code with
      arch_mutex_cpu_relax(), which can be defined by each architecture that
      selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
      this patch should not affect other architectures than System z for now.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1290437256.7455.4.camel@thinkpad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34b133f8
  9. 04 1月, 2011 1 次提交
  10. 20 12月, 2010 1 次提交
  11. 16 12月, 2010 3 次提交
  12. 09 12月, 2010 3 次提交
  13. 30 11月, 2010 3 次提交
    • M
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith 提交于
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5091faa4
    • P
      sched: Fix unregister_fair_sched_group() · 822bc180
      Paul Turner 提交于
      In the flipping and flopping between calling
      unregister_fair_sched_group() on a per-cpu versus per-group basis
      we ended up in a bad state.
      
      Remove from the list for the passed cpu as opposed to some
      arbitrary index.
      
      ( This fixes explosions w/ autogroup as well as a group
        creation/destruction stress test. )
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NPaul Turner <pjt@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <20101130005740.080828123@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      822bc180
    • L
      rcu,cleanup: move synchronize_sched_expedited() out of sched.c · 7b27d547
      Lai Jiangshan 提交于
      The first version of synchronize_sched_expedited() used the migration
      code in the scheduler, and was therefore implemented in kernel/sched.c.
      However, the more recent version of this code no longer uses the
      migration code, so this commit moves it to the main RCU source files.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7b27d547
  14. 26 11月, 2010 2 次提交
  15. 23 11月, 2010 1 次提交
  16. 18 11月, 2010 7 次提交
  17. 11 11月, 2010 2 次提交
    • P
      sched: Fix cross-sched-class wakeup preemption · 1e5a7405
      Peter Zijlstra 提交于
      Instead of dealing with sched classes inside each check_preempt_curr()
      implementation, pull out this logic into the generic wakeup preemption
      path.
      
      This fixes a hang in KVM (and others) where we are waiting for the
      stop machine thread to run ...
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Tested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1288891946.2039.31.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e5a7405
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd
  18. 02 11月, 2010 1 次提交
  19. 23 10月, 2010 1 次提交
  20. 22 10月, 2010 1 次提交
  21. 21 10月, 2010 1 次提交
  22. 19 10月, 2010 1 次提交
    • I
      sched: Export account_system_vtime() · b7dadc38
      Ingo Molnar 提交于
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b7dadc38