1. 21 7月, 2011 1 次提交
  2. 05 7月, 2011 1 次提交
  3. 31 5月, 2011 1 次提交
  4. 30 5月, 2011 1 次提交
    • L
      mm: Fix boot crash in mm_alloc() · 6345d24d
      Linus Torvalds 提交于
      Thomas Gleixner reports that we now have a boot crash triggered by
      CONFIG_CPUMASK_OFFSTACK=y:
      
          BUG: unable to handle kernel NULL pointer dereference at   (null)
          IP: [<c11ae035>] find_next_bit+0x55/0xb0
          Call Trace:
           [<c11addda>] cpumask_any_but+0x2a/0x70
           [<c102396b>] flush_tlb_mm+0x2b/0x80
           [<c1022705>] pud_populate+0x35/0x50
           [<c10227ba>] pgd_alloc+0x9a/0xf0
           [<c103a3fc>] mm_init+0xec/0x120
           [<c103a7a3>] mm_alloc+0x53/0xd0
      
      which was introduced by commit de03c72c ("mm: convert
      mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
      mm_init() vs mm_init_cpumask
      
      Thomas wrote a patch to just fix the ordering of initialization, but I
      hate the new double allocation in the fork path, so I ended up instead
      doing some more radical surgery to clean it all up.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6345d24d
  5. 28 5月, 2011 1 次提交
  6. 27 5月, 2011 1 次提交
  7. 26 5月, 2011 1 次提交
    • S
      ftrace: Add internal recursive checks · b1cff0ad
      Steven Rostedt 提交于
      Witold reported a reboot caused by the selftests of the dynamic function
      tracer. He sent me a config and I used ktest to do a config_bisect on it
      (as my config did not cause the crash). It pointed out that the problem
      config was CONFIG_PROVE_RCU.
      
      What happened was that if multiple callbacks are attached to the
      function tracer, we iterate a list of callbacks. Because the list is
      managed by synchronize_sched() and preempt_disable, the access to the
      pointers uses rcu_dereference_raw().
      
      When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
      debugging functions, which happen to be traced. The tracing of the debug
      function would then call rcu_dereference_raw() which would then call the
      debug function and then... well you get the idea.
      
      I first wrote two different patches to solve this bug.
      
      1) add a __rcu_dereference_raw() that would not do any checks.
      2) add notrace to the offending debug functions.
      
      Both of these patches worked.
      
      Talking with Paul McKenney on IRC, he suggested to add recursion
      detection instead. This seemed to be a better solution, so I decided to
      implement it. As the task_struct already has a trace_recursion to detect
      recursion in the ring buffer, and that has a very small number it
      allows, I decided to use that same variable to add flags that can detect
      the recursion inside the infrastructure of the function tracer.
      
      I plan to change it so that the task struct bit can be checked in
      mcount, but as that requires changes to all archs, I will hold that off
      to the next merge window.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.comReported-by: NWitold Baryluk <baryluk@smp.if.uj.edu.pl>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b1cff0ad
  8. 25 5月, 2011 2 次提交
    • K
      mm: convert mm->cpu_vm_cpumask into cpumask_var_t · de03c72c
      KOSAKI Motohiro 提交于
      cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
      It might lead to reduce cache hit ratio.
      
      This patch has two change.
      1) Move the place of cpumask into last of mm_struct. Because usually cpumask
         is accessed only front bits when the system has cpu-hotplug capability
      2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
         footprint if cpumask_size() will use nr_cpumask_bits properly in future.
      
      In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
      It may help to detect out of tree cpu_vm_mask users.
      
      This patch has no functional change.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de03c72c
    • D
      oom: replace PF_OOM_ORIGIN with toggling oom_score_adj · 72788c38
      David Rientjes 提交于
      There's a kernel-wide shortage of per-process flags, so it's always
      helpful to trim one when possible without incurring a significant penalty.
       It's even more important when you're planning on adding a per- process
      flag yourself, which I plan to do shortly for transparent hugepages.
      
      PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a
      tendency to allocate large amounts of memory and should be preferred for
      killing over other tasks.  We'd rather immediately kill the task making
      the errant syscall rather than penalizing an innocent task.
      
      This patch removes PF_OOM_ORIGIN since its behavior is equivalent to
      setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX.
      
      The process's old oom_score_adj is stored and then set to
      OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN.  The old
      value is then reinstated when the process should no longer be considered a
      high priority for oom killing.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72788c38
  9. 23 5月, 2011 1 次提交
  10. 20 5月, 2011 2 次提交
    • N
      sched: Increase SCHED_LOAD_SCALE resolution · c8b28116
      Nikhil Rao 提交于
      Introduce SCHED_LOAD_RESOLUTION, which scales is added to
      SCHED_LOAD_SHIFT and increases the resolution of
      SCHED_LOAD_SCALE. This patch sets the value of
      SCHED_LOAD_RESOLUTION to 10, scaling up the weights for all
      sched entities by a factor of 1024. With this extra resolution,
      we can handle deeper cgroup hiearchies and the scheduler can do
      better shares distribution and load load balancing on larger
      systems (especially for low weight task groups).
      
      This does not change the existing user interface, the scaled
      weights are only used internally. We do not modify
      prio_to_weight values or inverses, but use the original weights
      when calculating the inverse which is used to scale execution
      time delta in calc_delta_mine(). This ensures we do not lose
      accuracy when accounting time to the sched entities. Thanks to
      Nikunj Dadhania for fixing an bug in c_d_m() that broken fairness.
      
      Below is some analysis of the performance costs/improvements of
      this patch.
      
      1. Micro-arch performance costs:
      
      Experiment was to run Ingo's pipe_test_100k 200 times with the
      task pinned to one cpu. I measured instruction, cycles and
      stalled-cycles for the runs. See:
      
         http://thread.gmane.org/gmane.linux.kernel/1129232/focus=1129389
      
      for more info.
      
      -tip (baseline):
      
       Performance counter stats for '/root/load-scale/pipe-test-100k' (200 runs):
      
             964,991,769 instructions             #    0.82  insns per cycle
                                                  #    0.33  stalled cycles per insn
                                                  #    ( +-  0.05% )
           1,171,186,635 cycles                   #    0.000 GHz                      ( +-  0.08% )
             306,373,664 stalled-cycles-backend   #   26.16% backend  cycles idle     ( +-  0.28% )
             314,933,621 stalled-cycles-frontend  #   26.89% frontend cycles idle     ( +-  0.34% )
      
              1.122405684  seconds time elapsed  ( +-  0.05% )
      
      -tip+patches:
      
       Performance counter stats for './load-scale/pipe-test-100k' (200 runs):
      
             963,624,821 instructions             #    0.82  insns per cycle
                                                  #    0.33  stalled cycles per insn
                                                  #    ( +-  0.04% )
           1,175,215,649 cycles                   #    0.000 GHz                      ( +-  0.08% )
             315,321,126 stalled-cycles-backend   #   26.83% backend  cycles idle     ( +-  0.28% )
             316,835,873 stalled-cycles-frontend  #   26.96% frontend cycles idle     ( +-  0.29% )
      
              1.122238659  seconds time elapsed  ( +-  0.06% )
      
      With this patch, instructions decrease by ~0.10% and cycles
      increase by 0.27%. This doesn't look statistically significant.
      The number of stalled cycles in the backend increased from
      26.16% to 26.83%. This can be attributed to the shifts we do in
      c_d_m() and other places. The fraction of stalled cycles in the
      frontend remains about the same, at 26.96% compared to 26.89% in -tip.
      
      2. Balancing low-weight task groups
      
      Test setup: run 50 tasks with random sleep/busy times (biased
      around 100ms) in a low weight container (with cpu.shares = 2).
      Measure %idle as reported by mpstat over a 10s window.
      
      -tip (baseline):
      
      06:47:48 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
      06:47:49 PM  all   94.32    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.62  15888.00
      06:47:50 PM  all   94.57    0.00    0.62    0.00    0.00    0.00    0.00    0.00    4.81  16180.00
      06:47:51 PM  all   94.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.25  15966.00
      06:47:52 PM  all   95.81    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.19  16053.00
      06:47:53 PM  all   94.88    0.06    0.00    0.00    0.00    0.00    0.00    0.00    5.06  15984.00
      06:47:54 PM  all   93.31    0.00    0.00    0.00    0.00    0.00    0.00    0.00    6.69  15806.00
      06:47:55 PM  all   94.19    0.00    0.06    0.00    0.00    0.00    0.00    0.00    5.75  15896.00
      06:47:56 PM  all   92.87    0.00    0.00    0.00    0.00    0.00    0.00    0.00    7.13  15716.00
      06:47:57 PM  all   94.88    0.00    0.00    0.00    0.00    0.00    0.00    0.00    5.12  15982.00
      06:47:58 PM  all   95.44    0.00    0.00    0.00    0.00    0.00    0.00    0.00    4.56  16075.00
      Average:     all   94.49    0.01    0.08    0.00    0.00    0.00    0.00    0.00    5.42  15954.60
      
      -tip+patches:
      
      06:47:03 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle    intr/s
      06:47:04 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16630.00
      06:47:05 PM  all   99.69    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.31  16580.20
      06:47:06 PM  all   99.69    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.25  16596.00
      06:47:07 PM  all   99.20    0.00    0.74    0.00    0.00    0.06    0.00    0.00    0.00  17838.61
      06:47:08 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16540.00
      06:47:09 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16575.00
      06:47:10 PM  all  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  16614.00
      06:47:11 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.06  16588.00
      06:47:12 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16593.00
      06:47:13 PM  all   99.94    0.00    0.06    0.00    0.00    0.00    0.00    0.00    0.00  16551.00
      Average:     all   99.84    0.00    0.09    0.00    0.00    0.01    0.00    0.00    0.06  16711.58
      
      We see an improvement in idle% on the system (drops from 5.42% on -tip to 0.06%
      with the patches).
      
      We see an improvement in idle% on the system (drops from 5.42%
      on -tip to 0.06% with the patches).
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
      Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1305754668-18792-1-git-send-email-ncrao@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c8b28116
    • N
      sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations · 1399fa78
      Nikhil Rao 提交于
      SCHED_LOAD_SCALE is used to increase nice resolution and to
      scale cpu_power calculations in the scheduler. This patch
      introduces SCHED_POWER_SCALE and converts all uses of
      SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
      instead.
      
      This is a preparatory patch for increasing the resolution of
      SCHED_LOAD_SCALE, and there is no need to increase resolution
      for cpu_power calculations.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
      Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      1399fa78
  11. 12 5月, 2011 1 次提交
  12. 25 4月, 2011 1 次提交
    • F
      ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018
      Frederic Weisbecker 提交于
      When a task is traced and is in a stopped state, the tracer
      may execute a ptrace request to examine the tracee state and
      get its task struct. Right after, the tracee can be killed
      and thus its breakpoints released.
      This can happen concurrently when the tracer is in the middle
      of reading or modifying these breakpoints, leading to dereferencing
      a freed pointer.
      
      Hence, to prepare the fix, create a generic breakpoint reference
      holding API. When a reference on the breakpoints of a task is
      held, the breakpoints won't be released until the last reference
      is dropped. After that, no more ptrace request on the task's
      breakpoints can be serviced for the tracer.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: v2.6.33.. <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com
      bf26c018
  13. 24 4月, 2011 1 次提交
  14. 15 4月, 2011 1 次提交
  15. 14 4月, 2011 8 次提交
  16. 11 4月, 2011 4 次提交
  17. 04 4月, 2011 1 次提交
    • O
      signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED · ee77f075
      Oleg Nesterov 提交于
      This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to
      task_struct->group_stop, and thus makes it per-thread.
      
      Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive
      after return from get_signal_to_deliver(), this is fine. The only
      purpose of this bit is: we can drop ->siglock after __dequeue_signal()
      returns the sig_kernel_stop() signal and before we call
      do_signal_stop(), in this case we must not miss SIGCONT if it comes in
      between.
      
      But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be
      false-positive in do_signal_stop() if multiple threads dequeue the
      sig_kernel_stop() signal at the same time.
      
      Consider two threads T1 and T2, SIGTTIN has a hanlder.
      
      	- T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then
      	  it drops ->siglock
      
      	- SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP
      	  should be cancelled.
      
      	- T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again.
      	  Since we have a handler we should not stop, T2 returns
      	  to usermode to run the handler.
      
      	- T1 continues, calls do_signal_stop() and wrongly starts
      	  the group stop because SIGNAL_STOP_DEQUEUED was restored
      	  in between.
      
      With or without this change:
      
      	- we need to do something with ptrace_signal() which can
      	  return SIGSTOP, but this needs another discussion
      
      	- SIGSTOP can be lost if it races with the mt exec, will
      	  be fixed later.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee77f075
  18. 31 3月, 2011 1 次提交
  19. 24 3月, 2011 1 次提交
  20. 23 3月, 2011 5 次提交
    • J
      sched.h: Fix a typo ("its") · e815f0a8
      Jonathan Neuschäfer 提交于
      The sentence uses the possessive pronoun, which is spelled
      without an apostrophe.
      Signed-off-by: NJonathan Neuschäfer <j.neuschaefer@gmx.net>
      Cc: Jiri Kosina <trivial@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <1300735487-2406-1-git-send-email-j.neuschaefer@gmx.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e815f0a8
    • T
      ptrace: Clean transitions between TASK_STOPPED and TRACED · d79fdd6d
      Tejun Heo 提交于
      Currently, if the task is STOPPED on ptrace attach, it's left alone
      and the state is silently changed to TRACED on the next ptrace call.
      The behavior breaks the assumption that arch_ptrace_stop() is called
      before any task is poked by ptrace and is ugly in that a task
      manipulates the state of another task directly.
      
      With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and
      TRACED can be made clean.  The tracer can use the flag to tell the
      tracee to retry stop on attach and detach.  On retry, the tracee will
      enter the desired state in the correct way.  The lower 16bits of
      task->group_stop is used to remember the signal number which caused
      the last group stop.  This is used while retrying for ptrace attach as
      the original group_exit_code could have been consumed with wait(2) by
      then.
      
      As the real parent may wait(2) and consume the group_exit_code
      anytime, the group_exit_code needs to be saved separately so that it
      can be used when switching from regular sleep to ptrace_stop().  This
      is recorded in the lower 16bits of task->group_stop.
      
      If a task is already stopped and there's no intervening SIGCONT, a
      ptrace request immediately following a successful PTRACE_ATTACH should
      always succeed even if the tracer doesn't wait(2) for attach
      completion; however, with this change, the tracee might still be
      TASK_RUNNING trying to enter TASK_TRACED which would cause the
      following request to fail with -ESRCH.
      
      This intermediate state is hidden from the ptracer by setting
      GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait
      for it to clear on its signal->wait_chldexit.  Completing the
      transition or getting killed clears TRAPPING and wakes up the tracer.
      
      Note that the STOPPED -> RUNNING -> TRACED transition is still visible
      to other threads which are in the same group as the ptracer and the
      reverse transition is visible to all.  Please read the comments for
      details.
      
      Oleg:
      
      * Spotted a race condition where a task may retry group stop without
        proper bookkeeping.  Fixed by redoing bookkeeping on retry.
      
      * Spotted that the transition is visible to userland in several
        different ways.  Most are fixed with GROUP_STOP_TRAPPING.  Unhandled
        corner case is documented.
      
      * Pointed out not setting GROUP_STOP_SIGMASK on an already stopped
        task would result in more consistent behavior.
      
      * Pointed out that calling ptrace_stop() from do_signal_stop() in
        TASK_STOPPED can race with group stop start logic and then confuse
        the TRAPPING wait in ptrace_check_attach().  ptrace_stop() is now
        called with TASK_RUNNING.
      
      * Suggested using signal->wait_chldexit instead of bit wait.
      
      * Spotted a race condition between TRACED transition and clearing of
        TRAPPING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      d79fdd6d
    • T
      signal: Use GROUP_STOP_PENDING to stop once for a single group stop · 39efa3ef
      Tejun Heo 提交于
      Currently task->signal->group_stop_count is used to decide whether to
      stop for group stop.  However, if there is a task in the group which
      is taking a long time to stop, other tasks which are continued by
      ptrace would repeatedly stop for the same group stop until the group
      stop is complete.
      
      Conversely, if a ptraced task is in TASK_TRACED state, the debugger
      won't get notified of group stops which is inconsistent compared to
      the ptraced task in any other state.
      
      This patch introduces GROUP_STOP_PENDING which tracks whether a task
      is yet to stop for the group stop in progress.  The flag is set when a
      group stop starts and cleared when the task stops the first time for
      the group stop, and consulted whenever whether the task should
      participate in a group stop needs to be determined.  Note that now
      tasks in TASK_TRACED also participate in group stop.
      
      This results in the following behavior changes.
      
      * For a single group stop, a ptracer would see at most one stop
        reported.
      
      * A ptracee in TASK_TRACED now also participates in group stop and the
        tracer would get the notification.  However, as a ptraced task could
        be in TASK_STOPPED state or any ptrace trap could consume group
        stop, the notification may still be missing.  These will be
        addressed with further patches.
      
      * A ptracee may start a group stop while one is still in progress if
        the tracer let it continue with stop signal delivery.  Group stop
        code handles this correctly.
      
      Oleg:
      
      * Spotted that a task might skip signal check even when its
        GROUP_STOP_PENDING is set.  Fixed by updating
        recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
        group_stop_count.
      
      * Pointed out that task->group_stop should be cleared whenever
        task->signal->group_stop_count is cleared.  Fixed accordingly.
      
      * Pointed out the behavior inconsistency between TASK_TRACED and
        RUNNING and the last behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      39efa3ef
    • T
      signal: Fix premature completion of group stop when interfered by ptrace · e5c1902e
      Tejun Heo 提交于
      task->signal->group_stop_count is used to track the progress of group
      stop.  It's initialized to the number of tasks which need to stop for
      group stop to finish and each stopping or trapping task decrements.
      However, each task doesn't keep track of whether it decremented the
      counter or not and if woken up before the group stop is complete and
      stops again, it can decrement the counter multiple times.
      
      Please consider the following example code.
      
       static void *worker(void *arg)
       {
      	 while (1) ;
      	 return NULL;
       }
      
       int main(void)
       {
      	 pthread_t thread;
      	 pid_t pid;
      	 int i;
      
      	 pid = fork();
      	 if (!pid) {
      		 for (i = 0; i < 5; i++)
      			 pthread_create(&thread, NULL, worker, NULL);
      		 while (1) ;
      		 return 0;
      	 }
      
      	 ptrace(PTRACE_ATTACH, pid, NULL, NULL);
      	 while (1) {
      		 waitid(P_PID, pid, NULL, WSTOPPED);
      		 ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP);
      	 }
      	 return 0;
       }
      
      The child creates five threads and the parent continuously traps the
      first thread and whenever the child gets a signal, SIGSTOP is
      delivered.  If an external process sends SIGSTOP to the child, all
      other threads in the process should reliably stop.  However, due to
      the above bug, the first thread will often end up consuming
      group_stop_count multiple times and SIGSTOP often ends up stopping
      none or part of the other four threads.
      
      This patch adds a new field task->group_stop which is protected by
      siglock and uses GROUP_STOP_CONSUME flag to track which task is still
      to consume group_stop_count to fix this bug.
      
      task_clear_group_stop_pending() and task_participate_group_stop() are
      added to help manipulating group stop states.  As ptrace_stop() now
      also uses task_participate_group_stop(), it will set
      SIGNAL_STOP_STOPPED if it completes a group stop.
      
      There still are many issues regarding the interaction between group
      stop and ptrace.  Patches to address them will follow.
      
      - Oleg spotted duplicate GROUP_STOP_CONSUME.  Dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      e5c1902e
    • E
      kthread: NUMA aware kthread_create_on_node() · 207205a2
      Eric Dumazet 提交于
      All kthreads being created from a single helper task, they all use memory
      from a single node for their kernel stack and task struct.
      
      This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
      to parameters already used by kthread_create().
      
      This parameter serves in allocating memory for the new kthread on its
      memory node if possible.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      207205a2
  21. 10 3月, 2011 1 次提交
    • J
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe 提交于
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      73c10101
  22. 17 2月, 2011 1 次提交
  23. 03 2月, 2011 2 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59