1. 04 4月, 2011 1 次提交
    • O
      signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED · ee77f075
      Oleg Nesterov 提交于
      This patch moves SIGNAL_STOP_DEQUEUED from signal_struct->flags to
      task_struct->group_stop, and thus makes it per-thread.
      
      Like SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can be false-positive
      after return from get_signal_to_deliver(), this is fine. The only
      purpose of this bit is: we can drop ->siglock after __dequeue_signal()
      returns the sig_kernel_stop() signal and before we call
      do_signal_stop(), in this case we must not miss SIGCONT if it comes in
      between.
      
      But, unlike SIGNAL_STOP_DEQUEUED, GROUP_STOP_DEQUEUED can not be
      false-positive in do_signal_stop() if multiple threads dequeue the
      sig_kernel_stop() signal at the same time.
      
      Consider two threads T1 and T2, SIGTTIN has a hanlder.
      
      	- T1 dequeues SIGTSTP and sets SIGNAL_STOP_DEQUEUED, then
      	  it drops ->siglock
      
      	- SIGCONT comes and clears SIGNAL_STOP_DEQUEUED, SIGTSTP
      	  should be cancelled.
      
      	- T2 dequeues SIGTTIN and sets SIGNAL_STOP_DEQUEUED again.
      	  Since we have a handler we should not stop, T2 returns
      	  to usermode to run the handler.
      
      	- T1 continues, calls do_signal_stop() and wrongly starts
      	  the group stop because SIGNAL_STOP_DEQUEUED was restored
      	  in between.
      
      With or without this change:
      
      	- we need to do something with ptrace_signal() which can
      	  return SIGSTOP, but this needs another discussion
      
      	- SIGSTOP can be lost if it races with the mt exec, will
      	  be fixed later.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee77f075
  2. 23 3月, 2011 4 次提交
    • T
      ptrace: Clean transitions between TASK_STOPPED and TRACED · d79fdd6d
      Tejun Heo 提交于
      Currently, if the task is STOPPED on ptrace attach, it's left alone
      and the state is silently changed to TRACED on the next ptrace call.
      The behavior breaks the assumption that arch_ptrace_stop() is called
      before any task is poked by ptrace and is ugly in that a task
      manipulates the state of another task directly.
      
      With GROUP_STOP_PENDING, the transitions between TASK_STOPPED and
      TRACED can be made clean.  The tracer can use the flag to tell the
      tracee to retry stop on attach and detach.  On retry, the tracee will
      enter the desired state in the correct way.  The lower 16bits of
      task->group_stop is used to remember the signal number which caused
      the last group stop.  This is used while retrying for ptrace attach as
      the original group_exit_code could have been consumed with wait(2) by
      then.
      
      As the real parent may wait(2) and consume the group_exit_code
      anytime, the group_exit_code needs to be saved separately so that it
      can be used when switching from regular sleep to ptrace_stop().  This
      is recorded in the lower 16bits of task->group_stop.
      
      If a task is already stopped and there's no intervening SIGCONT, a
      ptrace request immediately following a successful PTRACE_ATTACH should
      always succeed even if the tracer doesn't wait(2) for attach
      completion; however, with this change, the tracee might still be
      TASK_RUNNING trying to enter TASK_TRACED which would cause the
      following request to fail with -ESRCH.
      
      This intermediate state is hidden from the ptracer by setting
      GROUP_STOP_TRAPPING on attach and making ptrace_check_attach() wait
      for it to clear on its signal->wait_chldexit.  Completing the
      transition or getting killed clears TRAPPING and wakes up the tracer.
      
      Note that the STOPPED -> RUNNING -> TRACED transition is still visible
      to other threads which are in the same group as the ptracer and the
      reverse transition is visible to all.  Please read the comments for
      details.
      
      Oleg:
      
      * Spotted a race condition where a task may retry group stop without
        proper bookkeeping.  Fixed by redoing bookkeeping on retry.
      
      * Spotted that the transition is visible to userland in several
        different ways.  Most are fixed with GROUP_STOP_TRAPPING.  Unhandled
        corner case is documented.
      
      * Pointed out not setting GROUP_STOP_SIGMASK on an already stopped
        task would result in more consistent behavior.
      
      * Pointed out that calling ptrace_stop() from do_signal_stop() in
        TASK_STOPPED can race with group stop start logic and then confuse
        the TRAPPING wait in ptrace_check_attach().  ptrace_stop() is now
        called with TASK_RUNNING.
      
      * Suggested using signal->wait_chldexit instead of bit wait.
      
      * Spotted a race condition between TRACED transition and clearing of
        TRAPPING.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      d79fdd6d
    • T
      signal: Use GROUP_STOP_PENDING to stop once for a single group stop · 39efa3ef
      Tejun Heo 提交于
      Currently task->signal->group_stop_count is used to decide whether to
      stop for group stop.  However, if there is a task in the group which
      is taking a long time to stop, other tasks which are continued by
      ptrace would repeatedly stop for the same group stop until the group
      stop is complete.
      
      Conversely, if a ptraced task is in TASK_TRACED state, the debugger
      won't get notified of group stops which is inconsistent compared to
      the ptraced task in any other state.
      
      This patch introduces GROUP_STOP_PENDING which tracks whether a task
      is yet to stop for the group stop in progress.  The flag is set when a
      group stop starts and cleared when the task stops the first time for
      the group stop, and consulted whenever whether the task should
      participate in a group stop needs to be determined.  Note that now
      tasks in TASK_TRACED also participate in group stop.
      
      This results in the following behavior changes.
      
      * For a single group stop, a ptracer would see at most one stop
        reported.
      
      * A ptracee in TASK_TRACED now also participates in group stop and the
        tracer would get the notification.  However, as a ptraced task could
        be in TASK_STOPPED state or any ptrace trap could consume group
        stop, the notification may still be missing.  These will be
        addressed with further patches.
      
      * A ptracee may start a group stop while one is still in progress if
        the tracer let it continue with stop signal delivery.  Group stop
        code handles this correctly.
      
      Oleg:
      
      * Spotted that a task might skip signal check even when its
        GROUP_STOP_PENDING is set.  Fixed by updating
        recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
        group_stop_count.
      
      * Pointed out that task->group_stop should be cleared whenever
        task->signal->group_stop_count is cleared.  Fixed accordingly.
      
      * Pointed out the behavior inconsistency between TASK_TRACED and
        RUNNING and the last behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      39efa3ef
    • T
      signal: Fix premature completion of group stop when interfered by ptrace · e5c1902e
      Tejun Heo 提交于
      task->signal->group_stop_count is used to track the progress of group
      stop.  It's initialized to the number of tasks which need to stop for
      group stop to finish and each stopping or trapping task decrements.
      However, each task doesn't keep track of whether it decremented the
      counter or not and if woken up before the group stop is complete and
      stops again, it can decrement the counter multiple times.
      
      Please consider the following example code.
      
       static void *worker(void *arg)
       {
      	 while (1) ;
      	 return NULL;
       }
      
       int main(void)
       {
      	 pthread_t thread;
      	 pid_t pid;
      	 int i;
      
      	 pid = fork();
      	 if (!pid) {
      		 for (i = 0; i < 5; i++)
      			 pthread_create(&thread, NULL, worker, NULL);
      		 while (1) ;
      		 return 0;
      	 }
      
      	 ptrace(PTRACE_ATTACH, pid, NULL, NULL);
      	 while (1) {
      		 waitid(P_PID, pid, NULL, WSTOPPED);
      		 ptrace(PTRACE_SINGLESTEP, pid, NULL, (void *)(long)SIGSTOP);
      	 }
      	 return 0;
       }
      
      The child creates five threads and the parent continuously traps the
      first thread and whenever the child gets a signal, SIGSTOP is
      delivered.  If an external process sends SIGSTOP to the child, all
      other threads in the process should reliably stop.  However, due to
      the above bug, the first thread will often end up consuming
      group_stop_count multiple times and SIGSTOP often ends up stopping
      none or part of the other four threads.
      
      This patch adds a new field task->group_stop which is protected by
      siglock and uses GROUP_STOP_CONSUME flag to track which task is still
      to consume group_stop_count to fix this bug.
      
      task_clear_group_stop_pending() and task_participate_group_stop() are
      added to help manipulating group stop states.  As ptrace_stop() now
      also uses task_participate_group_stop(), it will set
      SIGNAL_STOP_STOPPED if it completes a group stop.
      
      There still are many issues regarding the interaction between group
      stop and ptrace.  Patches to address them will follow.
      
      - Oleg spotted duplicate GROUP_STOP_CONSUME.  Dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      e5c1902e
    • E
      kthread: NUMA aware kthread_create_on_node() · 207205a2
      Eric Dumazet 提交于
      All kthreads being created from a single helper task, they all use memory
      from a single node for their kernel stack and task struct.
      
      This patch suite creates kthread_create_on_node(), adding a 'cpu' parameter
      to parameters already used by kthread_create().
      
      This parameter serves in allocating memory for the new kthread on its
      memory node if possible.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      207205a2
  3. 17 2月, 2011 1 次提交
  4. 03 2月, 2011 3 次提交
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • P
      perf: Cure task_oncpu_function_call() races · fe4b04fa
      Peter Zijlstra 提交于
      Oleg reported that on architectures with
      __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI from
      task_oncpu_function_call() can land before perf_event_task_sched_in()
      and cause interesting situations for eg. perf_install_in_context().
      
      This patch reworks the task_oncpu_function_call() interface to give a
      more usable primitive as well as rework all its users to hopefully be
      more obvious as well as remove the races.
      
      While looking at the code I also found a number of races against
      perf_event_task_sched_out() which can flip contexts between tasks so
      plug those too.
      Reported-and-reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fe4b04fa
  5. 01 2月, 2011 1 次提交
  6. 31 1月, 2011 1 次提交
    • T
      time: Provide xtime_update() · f0af911a
      Torben Hohn 提交于
      xtime_update() takes xtime_lock write locked and calls
      do_timer(). Provided to replace the do_timer() calls in the
      architecture code.
      Signed-off-by: NTorben Hohn <torbenh@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: johnstul@us.ibm.com
      Cc: yong.zhang0@gmail.com
      Cc: hch@infradead.org
      LKML-Reference: <20110127145910.23248.21379.stgit@localhost>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f0af911a
  7. 26 1月, 2011 2 次提交
  8. 14 1月, 2011 4 次提交
    • A
      thp: khugepaged · ba76149f
      Andrea Arcangeli 提交于
      Add khugepaged to relocate fragmented pages into hugepages if new
      hugepages become available.  (this is indipendent of the defrag logic that
      will have to make new hugepages available)
      
      The fundamental reason why khugepaged is unavoidable, is that some memory
      can be fragmented and not everything can be relocated.  So when a virtual
      machine quits and releases gigabytes of hugepages, we want to use those
      freely available hugepages to create huge-pmd in the other virtual
      machines that may be running on fragmented memory, to maximize the CPU
      efficiency at all times.  The scan is slow, it takes nearly zero cpu time,
      except when it copies data (in which case it means we definitely want to
      pay for that cpu time) so it seems a good tradeoff.
      
      In addition to the hugepages being released by other process releasing
      memory, we have the strong suspicion that the performance impact of
      potentially defragmenting hugepages during or before each page fault could
      lead to more performance inconsistency than allocating small pages at
      first and having them collapsed into large pages later...  if they prove
      themselfs to be long lived mappings (khugepaged scan is slow so short
      lived mappings have low probability to run into khugepaged if compared to
      long lived mappings).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba76149f
    • M
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines 提交于
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • D
      sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9
      Dave Jones 提交于
      This warning was added in commit bdff746a ("clone: prepare to recycle
      CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
      no-one is actually using CLONE_STOPPED.
      Signed-off-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43bb40c9
    • R
      epoll: convert max_user_watches to long · 52bd19f7
      Robin Holt 提交于
      On a 16TB machine, max_user_watches has an integer overflow.  Convert it
      to use a long and handle the associated fallout.
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52bd19f7
  9. 11 1月, 2011 2 次提交
  10. 07 1月, 2011 1 次提交
  11. 09 12月, 2010 2 次提交
  12. 30 11月, 2010 2 次提交
    • M
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith 提交于
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5091faa4
    • P
      rcu: priority boosting for TINY_PREEMPT_RCU · 24278d14
      Paul E. McKenney 提交于
      Add priority boosting, but only for TINY_PREEMPT_RCU.  This is enabled
      by the default-off RCU_BOOST kernel parameter.  The priority to which to
      boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel
      parameter (defaulting to real-time priority 1) and the time to wait
      before boosting the readers blocking a given grace period is controlled
      by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds).
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      24278d14
  13. 26 11月, 2010 1 次提交
  14. 18 11月, 2010 3 次提交
  15. 11 11月, 2010 1 次提交
    • S
      sched: Use group weight, idle cpu metrics to fix imbalances during idle · aae6d3dd
      Suresh Siddha 提交于
      Currently we consider a sched domain to be well balanced when the imbalance
      is less than the domain's imablance_pct. As the number of cores and threads
      are increasing, current values of imbalance_pct (for example 25% for a
      NUMA domain) are not enough to detect imbalances like:
      
      a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
      24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
      socket. Leading to an idle HT cpu.
      
      b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
      16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
      socket and 7 on another socket. Leaving one core in a socket idle
      whereas in another socket we have a core having both its HT siblings busy.
      
      While this issue can be fixed by decreasing the domain's imbalance_pct
      (by making it a function of number of logical cpus in the domain), it
      can potentially cause more task migrations across sched groups in an
      overloaded case.
      
      Fix this by using imbalance_pct only during newly_idle and busy
      load balancing. And during idle load balancing, check if there
      is an imbalance in number of idle cpu's across the busiest and this
      sched_group or if the busiest group has more tasks than its weight that
      the idle cpu in this_group can pull.
      Reported-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284760952.2676.11.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aae6d3dd
  16. 29 10月, 2010 1 次提交
  17. 28 10月, 2010 2 次提交
  18. 27 10月, 2010 1 次提交
  19. 23 10月, 2010 1 次提交
  20. 22 10月, 2010 1 次提交
  21. 19 10月, 2010 3 次提交
    • V
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi 提交于
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b52bfee4
    • V
      sched: Add a PF flag for ksoftirqd identification · 6cdd5199
      Venkatesh Pallipadi 提交于
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6cdd5199
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
  22. 25 9月, 2010 1 次提交
  23. 14 9月, 2010 1 次提交