1. 17 11月, 2011 1 次提交
  2. 16 11月, 2011 3 次提交
  3. 14 11月, 2011 2 次提交
  4. 06 10月, 2011 3 次提交
    • P
      sched: Wrap scheduler p->cpus_allowed access · fa17b507
      Peter Zijlstra 提交于
      This task is preparatory for the migrate_disable() implementation, but
      stands on its own and provides a cleanup.
      
      It currently only converts those sites required for task-placement.
      Kosaki-san once mentioned replacing cpus_allowed with a proper
      cpumask_t instead of the NR_CPUS sized array it currently is, that
      would also require something like this.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      fa17b507
    • S
      sched: Request for idle balance during nohz idle load balance · 6eb57e0d
      Suresh Siddha 提交于
      rq's idle_at_tick is set to idle/busy during the timer tick
      depending on the cpu was idle or not. This will be used later in the load
      balance that will be done in the softirq context (which is a process
      context in -RT kernels).
      
      For nohz kernels, for the cpu doing nohz idle load balance on behalf of
      all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
      recorded when it got the timer tick presumably when it is busy).
      
      As the nohz idle load balancing is also being done at the same place
      as the regular load balancing, nohz idle load balancing was bailing out
      when it sees rq's idle_at_tick not set.
      
      Thus leading to poor system utilization.
      
      Rename rq's idle_at_tick to idle_balance and set it when someone requests
      for nohz idle balance on an idle cpu.
      Reported-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      6eb57e0d
    • S
      sched: Use resched IPI to kick off the nohz idle balance · ca38062e
      Suresh Siddha 提交于
      Current use of smp call function to kick the nohz idle balance can deadlock
      in this scenario.
      
      1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
      data (csd) to the call single queue, cpu-A took a timer interrupt.  Actual IPI
      to cpu-B to process the call single queue is not yet sent.
      
      2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
      for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
      and __smp_call_function_single() with nowait will queue the csd to the
      cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
      as the call single queue was not empty.
      
      3. cpu-A is busy with lot of interrupts
      
      4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
      it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
      idle load balancer and clear its rq->nohz_balance_kick.
      
      5. At this point, csd queued as part of the step-2 above is still locked
      and waiting to be serviced on cpu-B.
      
      6. cpu-A is still busy with interrupt load and now it got another timer
      interrupt and as part of it decided to kick cpu-B for another idle load
      balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
      above) and does __smp_call_function_single() with the same csd that is
      still locked.
      
      7. And we get a deadlock waiting for the csd_lock() in the
      __smp_call_function_single().
      
      Main issue here is that cpu-B can service the idle load balancer kick
      request from cpu-A even with out receiving the IPI and this lead to
      doing multiple __smp_call_function_single() on the same csd leading to
      deadlock.
      
      To kick a cpu, scheduler already has the reschedule vector reserved. Use
      that mechanism (kick_process()) instead of using the generic smp call function
      mechanism to kick off the nohz idle load balancing and avoid the deadlock.
      
         [ This issue is present from 2.6.35+ kernels, but marking it -stable
           only from v3.0+ as the proposed fix depends on the scheduler_ipi()
           that is introduced recently. ]
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: stable@kernel.org # v3.0+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      ca38062e
  5. 26 9月, 2011 1 次提交
  6. 14 8月, 2011 14 次提交
  7. 22 7月, 2011 6 次提交
  8. 21 7月, 2011 1 次提交
  9. 01 7月, 2011 1 次提交
  10. 28 5月, 2011 1 次提交
  11. 20 5月, 2011 1 次提交
  12. 04 5月, 2011 1 次提交
  13. 19 4月, 2011 2 次提交
    • V
      sched: Next buddy hint on sleep and preempt path · 2f36825b
      Venkatesh Pallipadi 提交于
      When a task in a taskgroup sleeps, pick_next_task starts all the way back at
      the root and picks the task/taskgroup with the min vruntime across all
      runnable tasks.
      
      But when there are many frequently sleeping tasks across different taskgroups,
      it makes better sense to stay with same taskgroup for its slice period (or
      until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
      on each sleep after a short runtime.
      
      This helps specifically where taskgroups corresponds to a process with
      multiple threads. The change reduces the number of CR3 switches in this case.
      
      Example:
      
      Two taskgroups with 2 threads each which are running for 2ms and
      sleeping for 1ms. Looking at sched:sched_switch shows:
      
      BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
            cpu-soaker-5004  [003]  3683.391089
            cpu-soaker-5016  [003]  3683.393106
            cpu-soaker-5005  [003]  3683.395119
            cpu-soaker-5017  [003]  3683.397130
            cpu-soaker-5004  [003]  3683.399143
            cpu-soaker-5016  [003]  3683.401155
            cpu-soaker-5005  [003]  3683.403168
            cpu-soaker-5017  [003]  3683.405170
      
      AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
            cpu-soaker-21890 [003]   865.895494
            cpu-soaker-21935 [003]   865.897506
            cpu-soaker-21934 [003]   865.899520
            cpu-soaker-21935 [003]   865.901532
            cpu-soaker-21934 [003]   865.903543
            cpu-soaker-21935 [003]   865.905546
            cpu-soaker-21891 [003]   865.907548
            cpu-soaker-21890 [003]   865.909560
            cpu-soaker-21891 [003]   865.911571
            cpu-soaker-21890 [003]   865.913582
            cpu-soaker-21891 [003]   865.915594
            cpu-soaker-21934 [003]   865.917606
      
      Similar problem is there when there are multiple taskgroups and say a task A
      preempts currently running task B of taskgroup_1. On schedule, pick_next_task
      can pick an unrelated task on taskgroup_2. Here it would be better to give some
      preference to task B on pick_next_task.
      
      A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
      client processes with 2 threads each running on a single CPU. Avg throughput
      across 5 50 sec runs was:
      
       BEFORE: 105.84 MB/sec
       AFTER:  112.42 MB/sec
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      2f36825b
    • V
      sched: Make set_*_buddy() work on non-task entities · 69c80f3e
      Venkatesh Pallipadi 提交于
      Make set_*_buddy() work on non-task sched_entity, to facilitate the
      use of next_buddy to cache a group entity in cases where one of the
      tasks within that entity sleeps or gets preempted.
      
      set_skip_buddy() was incorrectly comparing the policy of task that is
      yielding to be not equal to SCHED_IDLE. Yielding should happen even
      when task yielding is SCHED_IDLE. This change removes the policy check
      on the yielding task.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      69c80f3e
  14. 14 4月, 2011 3 次提交