1. 21 10月, 2010 1 次提交
  2. 19 10月, 2010 28 次提交
    • S
      tracing: Fix compile issue for trace_sched_wakeup.c · 7e40798f
      Steven Rostedt 提交于
      The function start_func_tracer() was incorrectly added in the
       #ifdef CONFIG_FUNCTION_TRACER condition, but is still used even
      when function tracing is not enabled.
      
      The calls to register_ftrace_function() and register_ftrace_graph()
      become nops (and their arguments are even ignored), thus there is
      no reason to hide start_func_tracer() when function tracing is
      not enabled.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7e40798f
    • D
      futex: Fix errors in nested key ref-counting · 7ada876a
      Darren Hart 提交于
      futex_wait() is leaking key references due to futex_wait_setup()
      acquiring an additional reference via the queue_lock() routine. The
      nested key ref-counting has been masking bugs and complicating code
      analysis. queue_lock() is only called with a previously ref-counted
      key, so remove the additional ref-counting from the queue_(un)lock()
      functions.
      
      Also futex_wait_requeue_pi() drops one key reference too many in
      unqueue_me_pi(). Remove the key reference handling from
      unqueue_me_pi(). This was paired with a queue_lock() in
      futex_lock_pi(), so the count remains unchanged.
      
      Document remaining nested key ref-counting sites.
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Reported-and-tested-by: Matthieu Fertré<matthieu.fertre@kerlabs.com>
      Reported-by: Louis Rilling<louis.rilling@kerlabs.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      LKML-Reference: <4CBB17A8.70401@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      7ada876a
    • I
      sched: Export account_system_vtime() · b7dadc38
      Ingo Molnar 提交于
      KVM uses it for example:
      
       ERROR: "account_system_vtime" [arch/x86/kvm/kvm.ko] undefined!
      
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b7dadc38
    • V
      sched: Call tick_check_idle before __irq_enter · d267f87f
      Venkatesh Pallipadi 提交于
      When CPU is idle and on first interrupt, irq_enter calls tick_check_idle()
      to notify interruption from idle. But, there is a problem if this call
      is done after __irq_enter, as all routines in __irq_enter may find
      stale time due to yet to be done tick_check_idle.
      
      Specifically, trace calls in __irq_enter when they use global clock and also
      account_system_vtime change in this patch as it wants to use sched_clock_cpu()
      to do proper irq timing.
      
      But, tick_check_idle was moved after __irq_enter intentionally to
      prevent problem of unneeded ksoftirqd wakeups by the commit ee5f80a9:
      
          irq: call __irq_enter() before calling the tick_idle_check
          Impact: avoid spurious ksoftirqd wakeups
      
      Moving tick_check_idle() before __irq_enter and wrapping it with
      local_bh_enable/disable would solve both the problems.
      Fixed-by: NYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-9-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d267f87f
    • V
      sched: Remove irq time from available CPU power · aa483808
      Venkatesh Pallipadi 提交于
      The idea was suggested by Peter Zijlstra here:
      
        http://marc.info/?l=linux-kernel&m=127476934517534&w=2
      
      irq time is technically not available to the tasks running on the CPU.
      This patch removes irq time from CPU power piggybacking on
      sched_rt_avg_update().
      
      Tested this by keeping CPU X busy with a network intensive task having 75%
      oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
      cycle soakers on the system. Without this change, there will be two tasks on
      each CPU. With this change, there is a single task on irq busy CPU X and
      remaining 7 tasks are spread around among other 3 CPUs.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-8-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa483808
    • V
      sched: Do not account irq time to current task · 305e6835
      Venkatesh Pallipadi 提交于
      Scheduler accounts both softirq and interrupt processing times to the
      currently running task. This means, if the interrupt processing was
      for some other task in the system, then the current task ends up being
      penalized as it gets shorter runtime than otherwise.
      
      Change sched task accounting to acoount only actual task time from
      currently running task. Now update_curr(), modifies the delta_exec to
      depend on rq->clock_task.
      
      Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
      extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
      for later.
      
      This change will impact scheduling behavior in interrupt heavy conditions.
      
      Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
      task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
      spending 75%+ of its time in irq processing. CPU 3 spending around 35%
      time running nc task.
      
      Now, if I run another CPU intensive task on CPU 2, without this change
      /proc/<pid>/schedstat shows 100% of time accounted to this task. With this
      change, it rightly shows less than 25% accounted to this task as remaining
      time is actually spent on irq processing.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-7-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      305e6835
    • V
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi 提交于
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b52bfee4
    • V
      sched: Add a PF flag for ksoftirqd identification · 6cdd5199
      Venkatesh Pallipadi 提交于
      To account softirq time cleanly in scheduler, we need to identify whether
      softirq is invoked in ksoftirqd context or softirq at hardirq tail context.
      Add PF_KSOFTIRQD for that purpose.
      
      As all PF flag bits are currently taken, create space by moving one of the
      infrequently used bits (PF_THREAD_BOUND) down in task_struct to be along
      with some other state fields.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-4-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6cdd5199
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
    • N
      sched: Drop group_capacity to 1 only if local group has extra capacity · 75dd321d
      Nikhil Rao 提交于
      When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
      only if the local group has extra capacity. The extra check prevents the case
      where you always pull from the heaviest group when it is already under-utilized
      (possible with a large weight task outweighs the tasks on the system).
      
      For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
      scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
      and each task is running on one core. In this case, we observe the following
      events when balancing at the NUMA domain:
      
      - find_busiest_group() will always pick the sched group containing the niced
        task to be the busiest group.
      - find_busiest_queue() will then always pick one of the cpus running the
        nice0 task (never picks the cpu with the nice -15 task since
        weighted_cpuload > imbalance).
      - The load balancer fails to migrate the task since it is the running task
        and increments sd->nr_balance_failed.
      - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
        at which point it kicks off the active load balancer, wakes up the migration
        thread and kicks the nice 0 task off the cpu.
      
      The load balancer doesn't stop until we kick out all nice 0 tasks from
      the sched group, leaving you with 3 idle cpus and one cpu running the
      nice -15 task.
      
      When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
      domain (in this case MC) has SD_PREFER_SIBLING set.  Subsequent load checks are
      not relevant because the niced task has a very large weight.
      
      In this patch, we add an extra condition to the "if(prefer_sibling)" check in
      update_sd_lb_stats(). We drop the capacity of a group only if the local group
      has extra capacity, ie. nr_running < group_capacity. This patch preserves the
      original intent of the prefer_siblings check (to spread tasks across the system
      in low utilization scenarios) and fixes the case above.
      
      It helps in the following ways:
      - In low utilization cases (where nr_tasks << nr_cpus), we still drop
        group_capacity down to 1 if we prefer siblings.
      - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
        likely be > sgs.group_capacity.
      - When balancing large weight tasks, if the local group does not have extra
        capacity, we do not pick the group with the niced task as the busiest group.
        This prevents failed balances, active migration and the under-utilization
        described above.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-5-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75dd321d
    • N
      sched: Force balancing on newidle balance if local group has capacity · fab47622
      Nikhil Rao 提交于
      This patch forces a load balance on a newly idle cpu when the local group has
      extra capacity and the busiest group does not have any. It improves system
      utilization when balancing tasks with a large weight differential.
      
      Under certain situations, such as a niced down task (i.e. nice = -15) in the
      presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
      kicks away other tasks because of its large weight. This leads to sub-optimal
      utilization of the machine. Even though the sched group has capacity, it does
      not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.
      
      With this patch, if the local group has extra capacity, we shortcut the checks
      in f_b_g() and try to pull a task over. A sched group has extra capacity if the
      group capacity is greater than the number of running tasks in that group.
      
      Thanks to Mike Galbraith for discussions leading to this patch and for the
      insight to reuse SD_NEWIDLE_BALANCE.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-4-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fab47622
    • N
      sched: Set group_imb only a task can be pulled from the busiest cpu · 2582f0eb
      Nikhil Rao 提交于
      When cycling through sched groups to determine the busiest group, set
      group_imb only if the busiest cpu has more than 1 runnable task. This patch
      fixes the case where two cpus in a group have one runnable task each, but there
      is a large weight differential between these two tasks. The load balancer is
      unable to migrate any task from this group, and hence do not consider this
      group to be imbalanced.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286996978-7007-3-git-send-email-ncrao@google.com>
      [ small code readability edits ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2582f0eb
    • N
      sched: Do not consider SCHED_IDLE tasks to be cache hot · ef8002f6
      Nikhil Rao 提交于
      This patch adds a check in task_hot to return if the task has SCHED_IDLE
      policy. SCHED_IDLE tasks have very low weight, and when run with regular
      workloads, are typically scheduled many milliseconds apart. There is no
      need to consider these tasks hot for load balancing.
      Signed-off-by: NNikhil Rao <ncrao@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287173550-30365-2-git-send-email-ncrao@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ef8002f6
    • P
      perf: Optimize sw events · 7e54a5a0
      Peter Zijlstra 提交于
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7e54a5a0
    • P
      perf: Use jump_labels to optimize the scheduler hooks · 82cd6def
      Peter Zijlstra 提交于
      Trades a call + conditional + ret for an unconditional jmp.
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101014203625.501657727@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      82cd6def
    • P
      jump_label: Use more consistent naming · 3b6e901f
      Peter Zijlstra 提交于
      Now that there's still only a few users around, rename things to make
      them more consistent.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101014203625.448565169@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3b6e901f
    • P
      perf, hw_breakpoint: Fix crash in hw_breakpoint creation · d580ff86
      Peter Zijlstra 提交于
      hw_breakpoint creation needs to account stuff per-task to ensure there
      is always sufficient hardware resources to back these things due to
      ptrace.
      
      With the perf per pmu context changes the event initialization no
      longer has access to the event context, for the simple reason that we
      need to first find the pmu (result of initialization) before we can
      find the context.
      
      This makes hw_breakpoints unhappy, because it can no longer do per
      task accounting, cure this by frobbing a task pointer in the event::hw
      bits for now...
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20101014203625.391543667@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d580ff86
    • P
      perf: Find task before event alloc · c6be5a5c
      Peter Zijlstra 提交于
      So that we can pass the task pointer to the event allocation, so that
      we can use task associated data during event initialization.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101014203625.340789919@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c6be5a5c
    • P
      perf: Fix task refcount bugs · e7d0bc04
      Peter Zijlstra 提交于
      Currently it looks like find_lively_task_by_vpid() takes a task ref
      and relies on find_get_context() to drop it.
      
      The problem is that perf_event_create_kernel_counter() shouldn't be
      dropping task refs.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      LKML-Reference: <20101014203625.278436085@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e7d0bc04
    • P
      perf: Fix group moving · 74c3337c
      Peter Zijlstra 提交于
      Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take
      the move_group path in perf_event_open().
      
      Since we cannot de-construct the group (we rely on it to move the events), we
      have to simply ignore the double attach. The group state is context invariant
      and doesn't need changing.
      Reported-by: NMatt Fleming <matt@console-pimps.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1287135757.29097.1368.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74c3337c
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
    • S
      perf_events: Fix transaction recovery in group_sched_in() · 8e5fc1a7
      Stephane Eranian 提交于
      The group_sched_in() function uses a transactional approach to schedule
      a group of events. In a group, either all events can be scheduled or
      none are. To schedule each event in, the function calls event_sched_in().
      In case of error, event_sched_out() is called on each event in the group.
      
      The problem is that event_sched_out() does not completely cancel the
      effects of event_sched_in(). Furthermore event_sched_out() changes the
      state of the event as if it had run which is not true is this particular
      case.
      
      Those inconsistencies impact time tracking fields and may lead to events
      in a group not all reporting the same time_enabled and time_running values.
      This is demonstrated with the example below:
      
      $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
      1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827)
        11423 baclears (32.85% scaling, ena=829181, run=556827)
         7671 baclears (0.00% scaling, ena=556827, run=556827)
      
      2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995)
        11705 baclears (57.83% scaling, ena=962822, run=405995)
        11705 baclears (57.83% scaling, ena=962822, run=405995)
      
      Notice that in the first group, the last baclears event does not
      report the same timings as its siblings.
      
      This issue comes from the fact that tstamp_stopped is updated
      by event_sched_out() as if the event had actually run.
      
      To solve the issue, we must ensure that, in case of error, there is
      no change in the event state whatsoever. That means timings must
      remain as they were when entering group_sched_in().
      
      To do this we defer updating tstamp_running until we know the
      transaction succeeded. Therefore, we have split event_sched_in()
      in two parts separating the update to tstamp_running.
      
      Similarly, in case of error, we do not want to update tstamp_stopped.
      Therefore, we have split event_sched_out() in two parts separating
      the update to tstamp_stopped.
      
      With this patch, we now get the following output:
      
      $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
      2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841)
        11243 baclears (71.75% scaling, ena=1093330, run=308841)
        11243 baclears (71.75% scaling, ena=1093330, run=308841)
      
      1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489)
         9253 baclears (0.00% scaling, ena=784489, run=784489)
         9253 baclears (0.00% scaling, ena=784489, run=784489)
      
      Note that the uneven timing between groups is a side effect of
      the process spending most of its time sleeping, i.e., not enough
      event rotations (but that's a separate issue).
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4cb86b4c.41e9d80a.44e9.3e19@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8e5fc1a7
    • S
      perf_events: Fix bogus context time tracking · c530ccd9
      Stephane Eranian 提交于
      You can only call update_context_time() when the context
      is active, i.e., the thread it is attached to is still running.
      
      However, perf_event_read() can be called even when the context
      is inactive, e.g., user read() the counters. The call to
      update_context_time() must be conditioned on the status of
      the context, otherwise, bogus time_enabled, time_running may
      be returned. Here is an example on AMD64. The task program
      is an example from libpfm4. The -p prints deltas every 1s.
      
      $ task -p -e cpu_clk_unhalted sleep 5
          2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)
      
      Whereas if you don't read deltas, e.g., no call to perf_event_read() until
      the process terminates:
      
      $ task -e cpu_clk_unhalted sleep 5
          2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)
      
      Notice that time_enable, time_running are bogus in the first example
      causing bogus scaling.
      
      This patch fixes the problem, by conditionally calling update_context_time()
      in perf_event_read().
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <4cb856dc.51edd80a.5ae0.38fb@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c530ccd9
    • H
      lockdep: Check the depth of subclass · 4ba053c0
      Hitoshi Mitake 提交于
      Current look_up_lock_class() doesn't check the parameter "subclass".
      This rarely rises problems because the main caller of this function,
      register_lock_class(), checks it.
      
      But register_lock_class() is not the only function which calls
      look_up_lock_class(). lock_set_class() and its callees also call it.
      And lock_set_class() doesn't check this parameter.
      
      This will rise problems when the the value of subclass is larger than
      MAX_LOCKDEP_SUBCLASSES. Because the address (used as the key of class)
      caliculated with too large subclass has a probability to point
      another key in different lock_class_key.
      
      Of course this problem depends on the memory layout and
      occurs with really low probability.
      Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
      Cc: Dmitry Torokhov <dtor@mail.ru>
      Cc: Vojtech Pavlik <vojtech@ucw.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286958626-986-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ba053c0
    • H
      lockdep: Add improved subclass caching · 62016250
      Hitoshi Mitake 提交于
      Current lockdep_map only caches one class with subclass == 0,
      and looks up hash table of classes when subclass != 0.
      
      It seems that this has no problem because the case of
      subclass != 0 is rare. But locks of struct rq are
      acquired with subclass == 1 when task migration is executed.
      Task migration is high frequent event, so I modified lockdep
      to cache subclasses.
      
      I measured the score of perf bench sched messaging.
      This patch has slightly but certain (order of milli seconds
      or 10 milli seconds) effect when lots of tasks are running.
      I'll show the result in the tail of this description.
      
      NR_LOCKDEP_CACHING_CLASSES specifies how many classes can be
      cached in the instances of lockdep_map.
      I discussed with Peter Zijlstra in LinuxCon Japan about
      this approach and he taught me that caching every subclasses(8)
      is cleary waste of memory. So number of cached classes
      should be configurable.
      
      === Score comparison of benchmarks ===
      # "min" means best score, and "max" means worst score
      
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging; done
      
      before: min: 0.565000, max: 0.583000, avg: 0.572500
      after:  min: 0.559000, max: 0.568000, avg: 0.563300
      
      # with more processes
      for i in `seq 1 10`; do ./perf bench -f simple sched messaging -g 40; done
      
      before: min: 2.274000, max: 2.298000, avg: 2.286300
      after:  min: 2.242000, max: 2.270000, avg: 2.259700
      Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286269311-28336-2-git-send-email-mitake@dcl.info.waseda.ac.jp>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62016250
    • L
      sched: Drop all load weight manipulation for RT tasks · 17bdcf94
      Linus Walleij 提交于
      Load weights are for the CFS, they do not belong in the RT task. This makes all
      RT scheduling classes leave the CFS weights alone.
      
      This fixes a real bug as well: I noticed the following phonomena: a process
      elevated to SCHED_RR forks with SCHED_RESET_ON_FORK set, and the child is
      indeed SCHED_OTHER, and the niceval is indeed reset to 0. However the weight
      inserted by set_load_weight() remains at 0, giving the task insignificat
      priority.
      
      With this fix, the weight is reset to what the task had before being elevated
      to SCHED_RR/SCHED_FIFO.
      
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Walleij <linus.walleij@stericsson.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286807811-10568-1-git-send-email-linus.walleij@stericsson.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17bdcf94
    • P
      sched: Create special class for stop/migrate work · 34f971f6
      Peter Zijlstra 提交于
      In order to separate the stop/migrate work thread from the SCHED_FIFO
      implementation, create a special class for it that is of higher priority than
      SCHED_FIFO itself.
      
      This currently solves a problem where cpu-hotplug consumes so much cpu-time
      that the SCHED_FIFO class gets throttled, but has the bandwidth replenishment
      timer pending on the now dead cpu.
      
      It is also required for when we add the planned deadline scheduling class above
      SCHED_FIFO, as the stop/migrate thread still needs to transcent those tasks.
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1285165776.2275.1022.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34f971f6
    • P
      sched: Unindent labels · 49246274
      Peter Zijlstra 提交于
      Labels should be on column 0.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49246274
  3. 18 10月, 2010 7 次提交
  4. 17 10月, 2010 4 次提交