1. 29 6月, 2009 1 次提交
  2. 19 6月, 2009 2 次提交
    • P
      perf_counter: Simplify and fix task migration counting · e5289d4a
      Peter Zijlstra 提交于
      The task migrations counter was causing rare and hard to decypher
      memory corruptions under load. After a day of debugging and bisection
      we found that the problem was introduced with:
      
        3f731ca6: perf_counter: Fix cpu migration counter
      
      Turning them off fixes the crashes. Incidentally, the whole
      perf_counter_task_migration() logic can be done simpler as well,
      by injecting a proper sw-counter event.
      
      This cleanup also fixed the crashes. The precise failure mode is
      not completely clear yet, but we are clearly not unhappy about
      having a fix ;-)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5289d4a
    • O
      kthreads: simplify migration_thread() exit path · 371cbb38
      Oleg Nesterov 提交于
      Now that kthread_stop() can be used even if the task has already exited,
      we can kill the "wait_to_die:" loop in migration_thread().  But we must
      pin rq->migration_thread after creation.
      
      Actually, I don't think CPU_UP_CANCELED or CPU_DEAD should wait for
      ->migration_thread exit.  Perhaps we can simplify this code a bit more.
      migration_call() can set ->should_stop and forget about this thread.  But
      we need a new helper in kthred.c for that.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Vitaliy Gusev <vgusev@openvz.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      371cbb38
  3. 18 6月, 2009 2 次提交
  4. 17 6月, 2009 1 次提交
  5. 15 6月, 2009 1 次提交
    • L
      sched: Introduce SCHED_RESET_ON_FORK scheduling policy flag · ca94c442
      Lennart Poettering 提交于
      This patch introduces a new flag SCHED_RESET_ON_FORK which can be passed
      to the kernel via sched_setscheduler(), ORed in the policy parameter. If
      set this will make sure that when the process forks a) the scheduling
      priority is reset to DEFAULT_PRIO if it was higher and b) the scheduling
      policy is reset to SCHED_NORMAL if it was either SCHED_FIFO or SCHED_RR.
      
      Why have this?
      
      Currently, if a process is real-time scheduled this will 'leak' to all
      its child processes. For security reasons it is often (always?) a good
      idea to make sure that if a process acquires RT scheduling this is
      confined to this process and only this process. More specifically this
      makes the per-process resource limit RLIMIT_RTTIME useful for security
      purposes, because it makes it impossible to use a fork bomb to
      circumvent the per-process RLIMIT_RTTIME accounting.
      
      This feature is also useful for tools like 'renice' which can then
      change the nice level of a process without having this spill to all its
      child processes.
      
      Why expose this via sched_setscheduler() and not other syscalls such as
      prctl() or sched_setparam()?
      
      prctl() does not take a pid parameter. Due to that it would be
      impossible to modify this flag for other processes than the current one.
      
      The struct passed to sched_setparam() can unfortunately not be extended
      without breaking compatibility, since sched_setparam() lacks a size
      parameter.
      
      How to use this from userspace? In your RT program simply replace this:
      
        sched_setscheduler(pid, SCHED_FIFO, &param);
      
      by this:
      
        sched_setscheduler(pid, SCHED_FIFO|SCHED_RESET_ON_FORK, &param);
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090615152714.GA29092@tango.0pointer.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca94c442
  6. 12 6月, 2009 4 次提交
  7. 02 6月, 2009 2 次提交
    • P
      perf_counter: Fix cpu migration counter · 3f731ca6
      Paul Mackerras 提交于
      This fixes the cpu migration software counter to count
      correctly even when contexts get swapped from one task to
      another.  Previously the cpu migration counts reported by perf
      stat were bogus, ranging from negative to several thousand for
      a single "lat_ctx 2 8 32" run.  With this patch the cpu
      migration count reported for "lat_ctx 2 8 32" is almost always
      between 35 and 44.
      
      This fixes the problem by adding a call into the perf_counter
      code from set_task_cpu when tasks are migrated.  This enables
      us to use the generic swcounter code (with some modifications)
      for the cpu migration counter.
      
      This modifies the swcounter code to allow a NULL regs pointer
      to be passed in to perf_swcounter_ctx_event() etc.  The cpu
      migration counter does this because there isn't necessarily a
      pt_regs struct for the task available.  In this case, the
      counter will not have interrupt capability - but the migration
      counter didn't have interrupt capability before, so this is no
      loss.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      LKML-Reference: <18979.35006.819769.416327@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f731ca6
    • P
      perf_counter: Initialize per-cpu context earlier on cpu up · f38b0820
      Paul Mackerras 提交于
      This arranges for perf_counter's notifier for cpu hotplug
      operations to be called earlier than the migration notifier in
      sched.c by increasing its priority to 20, compared to the 10
      for the migration notifier.  The reason for doing this is that
      a subsequent commit to convert the cpu migration counter to use
      the generic swcounter infrastructure will add a call into the
      perf_counter subsystem when tasks get migrated.  Therefore the
      perf_counter subsystem needs a chance to initialize its per-cpu
      data for the new cpu before it can get called from the
      migration code.
      
      This also adds a comment to the migration notifier noting that
      its priority needs to be lower than that of the perf_counter
      notifier.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <18981.1900.792795.836858@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f38b0820
  8. 24 5月, 2009 1 次提交
    • P
      perf_counter: Fix dynamic irq_period logging · e220d2dc
      Peter Zijlstra 提交于
      We call perf_adjust_freq() from perf_counter_task_tick() which
      is is called under the rq->lock causing lock recursion.
      However, it's no longer required to be called under the
      rq->lock, so remove it from under it.
      
      Also, fix up some related comments.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      LKML-Reference: <20090523163012.476197912@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e220d2dc
  9. 22 5月, 2009 1 次提交
    • P
      perf_counter: Optimize context switch between identical inherited contexts · 564c2b21
      Paul Mackerras 提交于
      When monitoring a process and its descendants with a set of inherited
      counters, we can often get the situation in a context switch where
      both the old (outgoing) and new (incoming) process have the same set
      of counters, and their values are ultimately going to be added together.
      In that situation it doesn't matter which set of counters are used to
      count the activity for the new process, so there is really no need to
      go through the process of reading the hardware counters and updating
      the old task's counters and then setting up the PMU for the new task.
      
      This optimizes the context switch in this situation.  Instead of
      scheduling out the perf_counter_context for the old task and
      scheduling in the new context, we simply transfer the old context
      to the new task and keep using it without interruption.  The new
      context gets transferred to the old task.  This means that both
      tasks still have a valid perf_counter_context, so no special case
      is introduced when the old task gets scheduled in again, either on
      this CPU or another CPU.
      
      The equivalence of contexts is detected by keeping a pointer in
      each cloned context pointing to the context it was cloned from.
      To cope with the situation where a context is changed by adding
      or removing counters after it has been cloned, we also keep a
      generation number on each context which is incremented every time
      a context is changed.  When a context is cloned we take a copy
      of the parent's generation number, and two cloned contexts are
      equivalent only if they have the same parent and the same
      generation number.  In order that the parent context pointer
      remains valid (and is not reused), we increment the parent
      context's reference count for each context cloned from it.
      
      Since we don't have individual fds for the counters in a cloned
      context, the only thing that can make two clones of a given parent
      different after they have been cloned is enabling or disabling all
      counters with prctl.  To account for this, we keep a count of the
      number of enabled counters in each context.  Two contexts must have
      the same number of enabled counters to be considered equivalent.
      
      Here are some measurements of the context switch time as measured with
      the lat_ctx benchmark from lmbench, comparing the times obtained with
      and without this patch series:
      
      		-----Unmodified-----		With this patch series
      Counters:	none	2 HW	4H+4S	none	2 HW	4H+4S
      
      2 processes:
      Average		3.44	6.45	11.24	3.12	3.39	3.60
      St dev		0.04	0.04	0.13	0.05	0.17	0.19
      
      8 processes:
      Average		6.45	8.79	14.00	5.57	6.23	7.57
      St dev		1.27	1.04	0.88	1.42	1.46	1.42
      
      32 processes:
      Average		5.56	8.43	13.78	5.28	5.55	7.15
      St dev		0.41	0.47	0.53	0.54	0.57	0.81
      
      The numbers are the mean and standard deviation of 20 runs of
      lat_ctx.  The "none" columns are lat_ctx run directly without any
      counters.  The "2 HW" columns are with lat_ctx run under perfstat,
      counting cycles and instructions.  The "4H+4S" columns are lat_ctx run
      under perfstat with 4 hardware counters and 4 software counters
      (cycles, instructions, cache references, cache misses, task
      clock, context switch, cpu migrations, and page faults).
      
      [ Impact: performance optimization of counter context-switches ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      564c2b21
  10. 19 5月, 2009 1 次提交
  11. 15 5月, 2009 2 次提交
    • T
      sched, timers: cleanup avenrun users · 2d02494f
      Thomas Gleixner 提交于
      avenrun is an rough estimate so we don't have to worry about
      consistency of the three avenrun values. Remove the xtime lock
      dependency and provide a function to scale the values. Cleanup the
      users.
      
      [ Impact: cleanup ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      2d02494f
    • T
      sched, timers: move calc_load() to scheduler · dce48a84
      Thomas Gleixner 提交于
      Dimitri Sivanich noticed that xtime_lock is held write locked across
      calc_load() which iterates over all online CPUs. That can cause long
      latencies for xtime_lock readers on large SMP systems. 
      
      The load average calculation is an rough estimate anyway so there is
      no real need to protect the readers vs. the update. It's not a problem
      when the avenrun array is updated while a reader copies the values.
      
      Instead of iterating over all online CPUs let the scheduler_tick code
      update the number of active tasks shortly before the avenrun update
      happens. The avenrun update itself is handled by the CPU which calls
      do_timer().
      
      [ Impact: reduce xtime_lock write locked section ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      dce48a84
  12. 13 5月, 2009 3 次提交
  13. 07 5月, 2009 1 次提交
    • D
      sched: emit thread info flags with stack trace · aa47b7e0
      David Rientjes 提交于
      When a thread is oom killed and fails to exit, it's helpful to know which
      threads have access to memory reserves if the machine livelocks.  This is
      done by testing for the TIF_MEMDIE thread info flag and should be
      displayed alongside stack traces to identify tasks that have access to
      such reserves but are still stuck allocating pages, for instance.
      
      It would probably be helpful in other cases as well, so all thread info
      flags are emitted when showing a task.
      
      ( v2: fix warning reported by Stephen Rothwell )
      
      [ Impact: extend debug printout info ]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      LKML-Reference: <alpine.DEB.2.00.0905040136390.15831@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa47b7e0
  14. 06 5月, 2009 2 次提交
  15. 05 5月, 2009 1 次提交
    • I
      perf_counter: initialize the per-cpu context earlier · 0d905bca
      Ingo Molnar 提交于
      percpu scheduling for perfcounters wants to take the context lock,
      but that lock first needs to be initialized. Currently it is an
      early_initcall() - but that is too late, the task tick runs much
      sooner than that.
      
      Call it explicitly from the scheduler init sequence instead.
      
      [ Impact: fix access-before-init crash ]
      
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d905bca
  16. 29 4月, 2009 2 次提交
    • E
      sched: account system time properly · f5f293a4
      Eric Dumazet 提交于
      Andrew Gallatin reported that IRQ and SOFTIRQ times were
      sometime not reported correctly on recent kernels, and even
      bisected to commit 457533a7
      ([PATCH] fix scaled & unscaled cputime accounting) as the first
      bad commit.
      
      Further analysis pointed that commit
      79741dd3 ([PATCH] idle cputime
      accounting) was the real cause of the problem.
      
      account_process_tick() was not taking into account timer IRQ
      interrupting the idle task servicing a hard or soft irq.
      
      On mostly idle cpu, irqs were thus not accounted and top or
      mpstat could tell user/admin that cpu was 100 % idle, 0.00 %
      irq, 0.00 % softirq, while it was not.
      
      [ Impact: fix occasionally incorrect CPU statistics in top/mpstat ]
      Reported-by: NAndrew Gallatin <gallatin@myri.com>
      Re-reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: rick.jones2@hp.com
      Cc: brice@myri.com
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      LKML-Reference: <49F84BC1.7080602@cosmosbay.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f5f293a4
    • D
      sched: Document memory barriers implied by sleep/wake-up primitives · 50fa610a
      David Howells 提交于
      Add a section to the memory barriers document to note the implied
      memory barriers of sleep primitives (set_current_state() and wrappers)
      and wake-up primitives (wake_up() and co.).
      
      Also extend the in-code comments on the wake_up() functions to note
      these implied barriers.
      
      [ Impact: add documentation ]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <20090428140138.1192.94723.stgit@warthog.procyon.org.uk>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      50fa610a
  17. 21 4月, 2009 3 次提交
  18. 17 4月, 2009 1 次提交
  19. 15 4月, 2009 3 次提交
    • M
      sched: use group_first_cpu() instead of cpumask_first(sched_group_cpus()) · 13318a71
      Miao Xie 提交于
      Impact: cleanup
      
      This patch changes cpumask_first(sched_group_cpus()) to group_first_cpu()
      for maintainability.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      13318a71
    • S
      tracing/events: move trace point headers into include/trace/events · ad8d75ff
      Steven Rostedt 提交于
      Impact: clean up
      
      Create a sub directory in include/trace called events to keep the
      trace point headers in their own separate directory. Only headers that
      declare trace points should be defined in this directory.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ad8d75ff
    • S
      tracing: create automated trace defines · a8d154b0
      Steven Rostedt 提交于
      This patch lowers the number of places a developer must modify to add
      new tracepoints. The current method to add a new tracepoint
      into an existing system is to write the trace point macro in the
      trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
      DECLARE_TRACE, then they must add the same named item into the C file
      with the macro DEFINE_TRACE(name) and then add the trace point.
      
      This change cuts out the needing to add the DEFINE_TRACE(name).
      Every file that uses the tracepoint must still include the trace/<type>.h
      file, but the one C file must also add a define before the including
      of that file.
      
       #define CREATE_TRACE_POINTS
       #include <trace/mytrace.h>
      
      This will cause the trace/mytrace.h file to also produce the C code
      necessary to implement the trace point.
      
      Note, if more than one trace/<type>.h is used to create the C code
      it is best to list them all together.
      
       #define CREATE_TRACE_POINTS
       #include <trace/foo.h>
       #include <trace/bar.h>
       #include <trace/fido.h>
      
      Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
      the cleaner solution of the define above the includes over my first
      design to have the C code include a "special" header.
      
      This patch converts sched, irq and lockdep and skb to use this new
      method.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a8d154b0
  20. 14 4月, 2009 4 次提交
    • J
      wait: don't use __wake_up_common() · 78ddb08f
      Johannes Weiner 提交于
      '777c6c5f wait: prevent exclusive waiter starvation' made
      __wake_up_common() global to be used from abort_exclusive_wait().
      
      It was needed to do a wake-up with the waitqueue lock held while
      passing down a key to the wake-up function.
      
      Since '4ede816a epoll keyed wakeups: add __wake_up_locked_key() and
      __wake_up_sync_key()' there is an appropriate wrapper for this case:
      __wake_up_locked_key().
      
      Use it here and make __wake_up_common() private to the scheduler
      again.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1239720785-19661-1-git-send-email-hannes@cmpxchg.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      78ddb08f
    • G
      sched: Nominate a power-efficient ilb in select_nohz_balancer() · e790fb0b
      Gautham R Shenoy 提交于
      The CPU that first goes idle becomes the idle-load-balancer and remains
      that until either it picks up a task or till all the CPUs of the system
      goes idle.
      
      Optimize this further to allow it to relinquish it's post
      once all it's siblings in the power-aware sched_domain go idle, thereby
      allowing the whole package-core to go idle. While relinquising the post,
      nominate another an idle-load balancer from a semi-idle core/package.
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090414045535.7645.31641.stgit@sofia.in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e790fb0b
    • G
      sched: Nominate idle load balancer from a semi-idle package. · f711f609
      Gautham R Shenoy 提交于
      Currently the nomination of idle-load balancer is done by choosing the first
      idle cpu in the nohz.cpu_mask. This may not be power-efficient, since
      such an idle cpu could come from a completely idle core/package thereby
      preventing the whole core/package from being in a low-power state.
      
      For eg, consider a quad-core dual package system. The cpu numbering need
      not be sequential and can something like [0, 2, 4, 6] and [1, 3, 5, 7].
      With sched_mc/smt_power_savings and the power-aware IRQ balance, we try to keep
      as fewer Packages/Cores active. But the current idle load balancer logic
      goes against this by choosing the first_cpu in the nohz.cpu_mask and not
      taking the system topology into consideration.
      
      Improve the algorithm to nominate the idle load balancer from a semi idle
      cores/packages thereby increasing the probability of the cores/packages being
      in deeper sleep states for longer duration.
      
      The algorithm is activated only when sched_mc/smt_power_savings != 0.
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090414045530.7645.12175.stgit@sofia.in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f711f609
    • L
      tracing, sched: mark get_parent_ip() notrace · 132380a0
      Lai Jiangshan 提交于
      Impact: remove overly redundant tracing entries
      
      When tracer is "function" or "function_graph", way too much
      "get_parent_ip" entries are recorded in ring_buffer.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      LKML-Reference: <49D458B1.5000703@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      132380a0
  21. 07 4月, 2009 2 次提交