1. 31 3月, 2011 1 次提交
    • D
      sched: Leave sched_setscheduler() earlier if possible, do not disturb SCHED_FIFO tasks · a51e9198
      Dario Faggioli 提交于
      sched_setscheduler() (in sched.c) is called in order of changing the
      scheduling policy and/or the real-time priority of a task. Thus,
      if we find out that neither of those are actually being modified, it
      is possible to return earlier and save the overhead of a full
      deactivate+activate cycle of the task in question.
      
      Beside that, if we have more than one SCHED_FIFO task with the same
      priority on the same rq (which means they share the same priority queue)
      having one of them changing its position in the priority queue because of
      a sched_setscheduler (as it happens by means of the deactivate+activate)
      that does not actually change the priority violates POSIX which states,
      for SCHED_FIFO:
      
        "If a thread whose policy or priority has been modified by
         pthread_setschedprio() is a running thread or is runnable, the effect on
         its position in the thread list depends on the direction of the
         modification, as follows: a. <...> b. If the priority is unchanged, the
         thread does not change position in the thread list. c. <...>"
      
           http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_08.html
      
       (ed: And the POSIX specification here does, briefly and somewhat unexpectedly,
            match what common sense tells us as well. )
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1300971618.3960.82.camel@Palantir>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a51e9198
  2. 24 3月, 2011 1 次提交
  3. 23 3月, 2011 1 次提交
  4. 20 3月, 2011 1 次提交
  5. 17 3月, 2011 1 次提交
  6. 16 3月, 2011 1 次提交
  7. 11 3月, 2011 1 次提交
    • T
      SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41
      Trond Myklebust 提交于
      Although they run as rpciod background tasks, under normal operation
      (i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
      and nfs4_do_close() want to be fully synchronous. This means that when we
      exit, we want all references to the rpc_task to be gone, and we want
      any dentry references etc. held by that task to be released.
      
      For this reason these functions call __rpc_wait_for_completion_task(),
      followed by rpc_put_task() in the expectation that the latter will be
      releasing the last reference to the rpc_task, and thus ensuring that the
      callback_ops->rpc_release() has been called synchronously.
      
      This patch fixes a race which exists due to the fact that
      rpciod calls rpc_complete_task() (in order to wake up the callers of
      __rpc_wait_for_completion_task()) and then subsequently calls
      rpc_put_task() without ensuring that these two steps are done atomically.
      
      In order to avoid adding new spin locks, the patch uses the existing
      waitqueue spin lock to order the rpc_task reference count releases between
      the waiting process and rpciod.
      The common case where nobody is waiting for completion is optimised for by
      checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
      reference count is 1: in those cases we drop trying to grab the spin lock,
      and immediately free up the rpc_task.
      
      Those few processes that need to put the rpc_task from inside an
      asynchronous context and that do not care about ordering are given a new
      helper: rpc_put_task_async().
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      bf294b41
  8. 10 3月, 2011 1 次提交
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  9. 05 3月, 2011 1 次提交
    • A
      BKL: That's all, folks · 4ba8216c
      Arnd Bergmann 提交于
      This removes the implementation of the big kernel lock,
      at last. A lot of people have worked on this in the
      past, I so the credit for this patch should be with
      everyone who participated in the hunt.
      
      The names on the Cc list are the people that were the
      most active in this, according to the recorded git
      history, in alphabetical order.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NAlan Cox <alan@linux.intel.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jan Blunck <jblunck@infradead.org>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Oliver Neukum <oliver@neukum.org>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      4ba8216c
  10. 04 3月, 2011 2 次提交
    • V
      sched: Resched proper CPU on yield_to() · 6d1cafd8
      Venkatesh Pallipadi 提交于
      yield_to_task_fair() has code to resched the CPU of yielding task when the
      intention is to resched the CPU of the task that is being yielded to.
      
      Change here fixes the problem and also makes the resched conditional on
      rq != p_rq.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1299025701-22168-1-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6d1cafd8
    • D
      sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy · c02aa73b
      Darren Hart 提交于
      The current scheduler implementation returns -EPERM when trying to
      change from SCHED_IDLE to SCHED_OTHER or SCHED_BATCH. Since SCHED_IDLE
      is considered to be a nice 20 on steroids, changing to another policy
      should be allowed provided the RLIMIT_NICE is accounted for.
      
      This patch allows the following test-case to pass with RLIMIT_NICE=40,
      but still fail with RLIMIT_NICE=10 when the calling process is run
      from a typical shell (nice 0, or 20 in rlimit terms).
      
      int main()
      {
      	int ret;
      	struct sched_param sp;
      	sp.sched_priority = 0;
      
      	/* switch to SCHED_IDLE */
      	ret = sched_setscheduler(0, SCHED_IDLE, &sp);
      	printf("setscheduler IDLE: %d\n", ret);
      	if (ret) return ret;
      
      	/* switch back to SCHED_OTHER */
      	ret = sched_setscheduler(0, SCHED_OTHER, &sp);
      	printf("setscheduler OTHER: %d\n", ret);
      
      	return ret;
      }
      
       $ ulimit -e
       40
       $ ./test
       setscheduler IDLE: 0
       setscheduler OTHER: 0
      
       $ ulimit -e 10
       $ ulimit -e
       10
       $ ./test
       setscheduler IDLE: 0
       setscheduler OTHER: -1
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
      LKML-Reference: <4D657BEE.4040608@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c02aa73b
  11. 26 2月, 2011 2 次提交
  12. 25 2月, 2011 1 次提交
  13. 16 2月, 2011 1 次提交
  14. 12 2月, 2011 1 次提交
    • S
      ftrace: Fix memory leak with function graph and cpu hotplug · 868baf07
      Steven Rostedt 提交于
      When the fuction graph tracer starts, it needs to make a special
      stack for each task to save the real return values of the tasks.
      All running tasks have this stack created, as well as any new
      tasks.
      
      On CPU hot plug, the new idle task will allocate a stack as well
      when init_idle() is called. The problem is that cpu hotplug does
      not create a new idle_task. Instead it uses the idle task that
      existed when the cpu went down.
      
      ftrace_graph_init_task() will add a new ret_stack to the task
      that is given to it. Because a clone will make the task
      have a stack of its parent it does not check if the task's
      ret_stack is already NULL or not. When the CPU hotplug code
      starts a CPU up again, it will allocate a new stack even
      though one already existed for it.
      
      The solution is to treat the idle_task specially. In fact, the
      function_graph code already does, just not at init_idle().
      Instead of using the ftrace_graph_init_task() for the idle task,
      which that function expects the task to be a clone, have a
      separate ftrace_graph_init_idle_task(). Also, we will create a
      per_cpu ret_stack that is used by the idle task. When we call
      ftrace_graph_init_idle_task() it will check if the idle task's
      ret_stack is NULL, if it is, then it will assign it the per_cpu
      ret_stack.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stable Tree <stable@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      868baf07
  15. 03 2月, 2011 3 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • P
      perf: Cure task_oncpu_function_call() races · fe4b04fa
      Peter Zijlstra 提交于
      Oleg reported that on architectures with
      __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from
      task_oncpu_function_call() can land before perf_event_task_sched_in()
      and cause interesting situations for eg. perf_install_in_context().
      
      This patch reworks the task_oncpu_function_call() interface to give a
      more usable primitive as well as rework all its users to hopefully be
      more obvious as well as remove the races.
      
      While looking at the code I also found a number of races against
      perf_event_task_sched_out() which can flip contexts between tasks so
      plug those too.
      Reported-and-reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fe4b04fa
  16. 27 1月, 2011 1 次提交
  17. 26 1月, 2011 7 次提交
  18. 19 1月, 2011 1 次提交
  19. 18 1月, 2011 2 次提交
  20. 07 1月, 2011 3 次提交
  21. 05 1月, 2011 2 次提交
    • N
      sched: Change wait_for_completion_*_timeout() to return a signed long · 6bf41237
      NeilBrown 提交于
      wait_for_completion_*_timeout() can return:
      
         0: if the wait timed out
       -ve: if the wait was interrupted
       +ve: if the completion was completed.
      
      As they currently return an 'unsigned long', the last two cases
      are not easily distinguished which can easily result in buggy
      code, as is the case for the recently added
      wait_for_completion_interruptible_timeout() call in
      net/sunrpc/cache.c
      
      So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
      is LONG_MAX, a large +ve return value should never overflow.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: J.  Bruce Fields <bfields@fieldses.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bf41237
    • G
      [S390] mutex: Introduce arch_mutex_cpu_relax() · 34b133f8
      Gerald Schaefer 提交于
      The spinning mutex implementation uses cpu_relax() in busy loops as a
      compiler barrier. Depending on the architecture, cpu_relax() may do more
      than needed in this specific mutex spin loops. On System z we also give
      up the time slice of the virtual cpu in cpu_relax(), which prevents
      effective spinning on the mutex.
      
      This patch replaces cpu_relax() in the spinning mutex code with
      arch_mutex_cpu_relax(), which can be defined by each architecture that
      selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
      this patch should not affect other architectures than System z for now.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1290437256.7455.4.camel@thinkpad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34b133f8
  22. 04 1月, 2011 1 次提交
  23. 20 12月, 2010 1 次提交
  24. 16 12月, 2010 3 次提交