1. 06 10月, 2011 3 次提交
    • P
      sched: Wrap scheduler p->cpus_allowed access · fa17b507
      Peter Zijlstra 提交于
      This task is preparatory for the migrate_disable() implementation, but
      stands on its own and provides a cleanup.
      
      It currently only converts those sites required for task-placement.
      Kosaki-san once mentioned replacing cpus_allowed with a proper
      cpumask_t instead of the NR_CPUS sized array it currently is, that
      would also require something like this.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      fa17b507
    • S
      sched: Request for idle balance during nohz idle load balance · 6eb57e0d
      Suresh Siddha 提交于
      rq's idle_at_tick is set to idle/busy during the timer tick
      depending on the cpu was idle or not. This will be used later in the load
      balance that will be done in the softirq context (which is a process
      context in -RT kernels).
      
      For nohz kernels, for the cpu doing nohz idle load balance on behalf of
      all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
      recorded when it got the timer tick presumably when it is busy).
      
      As the nohz idle load balancing is also being done at the same place
      as the regular load balancing, nohz idle load balancing was bailing out
      when it sees rq's idle_at_tick not set.
      
      Thus leading to poor system utilization.
      
      Rename rq's idle_at_tick to idle_balance and set it when someone requests
      for nohz idle balance on an idle cpu.
      Reported-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      6eb57e0d
    • S
      sched: Use resched IPI to kick off the nohz idle balance · ca38062e
      Suresh Siddha 提交于
      Current use of smp call function to kick the nohz idle balance can deadlock
      in this scenario.
      
      1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
      data (csd) to the call single queue, cpu-A took a timer interrupt.  Actual IPI
      to cpu-B to process the call single queue is not yet sent.
      
      2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
      for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
      and __smp_call_function_single() with nowait will queue the csd to the
      cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
      as the call single queue was not empty.
      
      3. cpu-A is busy with lot of interrupts
      
      4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
      it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
      idle load balancer and clear its rq->nohz_balance_kick.
      
      5. At this point, csd queued as part of the step-2 above is still locked
      and waiting to be serviced on cpu-B.
      
      6. cpu-A is still busy with interrupt load and now it got another timer
      interrupt and as part of it decided to kick cpu-B for another idle load
      balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
      above) and does __smp_call_function_single() with the same csd that is
      still locked.
      
      7. And we get a deadlock waiting for the csd_lock() in the
      __smp_call_function_single().
      
      Main issue here is that cpu-B can service the idle load balancer kick
      request from cpu-A even with out receiving the IPI and this lead to
      doing multiple __smp_call_function_single() on the same csd leading to
      deadlock.
      
      To kick a cpu, scheduler already has the reschedule vector reserved. Use
      that mechanism (kick_process()) instead of using the generic smp call function
      mechanism to kick off the nohz idle load balancing and avoid the deadlock.
      
         [ This issue is present from 2.6.35+ kernels, but marking it -stable
           only from v3.0+ as the proposed fix depends on the scheduler_ipi()
           that is introduced recently. ]
      Reported-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: stable@kernel.org # v3.0+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      ca38062e
  2. 04 10月, 2011 2 次提交
  3. 30 9月, 2011 1 次提交
    • P
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twinsTested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d670ec13
  4. 26 9月, 2011 1 次提交
  5. 29 8月, 2011 4 次提交
    • S
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian 提交于
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
      
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      
      That's a 37% penalty.
      
      Note that pong is not even in the monitored cgroup.
      
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
       Performance counter stats for 'sleep 100':
      
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      
      Start perf stat with cgroup:
      
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      
      The penalty is now less than 2%.
      
      And the results for perf stat are correct:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
      Now perf stat reports the correct counts for
      for the non cgroup event.
      
      If we run pong inside the cgroup, then we also get the
      correct counts:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
            10.001457237 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8d757ef
    • W
      sched: Fix a memory leak in __sdt_free() · feff8fa0
      WANG Cong 提交于
      This patch fixes the following memory leak:
      
      unreferenced object 0xffff880107266800 (size 512):
        comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81133940>] create_object+0x187/0x28b
          [<ffffffff814ac103>] kmemleak_alloc+0x73/0x98
          [<ffffffff811232ba>] __kmalloc_node+0x104/0x159
          [<ffffffff81044b98>] kzalloc_node.clone.97+0x15/0x17
          [<ffffffff8104cb90>] build_sched_domains+0xb7/0x7f3
          [<ffffffff8104d4df>] partition_sched_domains+0x1db/0x24a
          [<ffffffff8109ee4a>] do_rebuild_sched_domains+0x3b/0x47
          [<ffffffff810a00c7>] rebuild_sched_domains+0x10/0x12
          [<ffffffff8104d5ba>] sched_power_savings_store+0x6c/0x7b
          [<ffffffff8104d5df>] sched_mc_power_savings_store+0x16/0x18
          [<ffffffff8131322c>] sysdev_class_store+0x20/0x22
          [<ffffffff81193876>] sysfs_write_file+0x108/0x144
          [<ffffffff81135b10>] vfs_write+0xaf/0x102
          [<ffffffff81135d23>] sys_write+0x4d/0x74
          [<ffffffff814c8a42>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org # 3.0
      Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      feff8fa0
    • T
      sched: Move blk_schedule_flush_plug() out of __schedule() · 9c40cef2
      Thomas Gleixner 提交于
      There is no real reason to run blk_schedule_flush_plug() with
      interrupts and preemption disabled.
      
      Move it into schedule() and call it when the task is going voluntarily
      to sleep. There might be false positives when the task is woken
      between that call and actually scheduling, but that's not really
      different from being woken immediately after switching away.
      
      This fixes a deadlock in the scheduler where the
      blk_schedule_flush_plug() callchain enables interrupts and thereby
      allows a wakeup to happen of the task that's going to sleep.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@kernel.org # 2.6.39+
      Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      9c40cef2
    • T
      sched: Separate the scheduler entry for preemption · c259e01a
      Thomas Gleixner 提交于
      Block-IO and workqueues call into notifier functions from the
      scheduler core code with interrupts and preemption disabled. These
      calls should be made before entering the scheduler core.
      
      To simplify this, separate the scheduler core code into
      __schedule(). __schedule() is directly called from the places which
      set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
      checks into schedule(), so they are only called when a task voluntary
      goes to sleep.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@kernel.org # 2.6.39+
      Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.deSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c259e01a
  6. 14 8月, 2011 15 次提交
  7. 22 7月, 2011 5 次提交
  8. 21 7月, 2011 4 次提交
  9. 16 7月, 2011 1 次提交
  10. 14 7月, 2011 2 次提交
    • G
      sched: adjust scheduler cpu power for stolen time · 095c0aa8
      Glauber Costa 提交于
      This patch makes update_rq_clock() aware of steal time.
      The mechanism of operation is not different from irq_time,
      and follows the same principles. This lives in a CONFIG
      option itself, and can be compiled out independently of
      the rest of steal time reporting. The effect of disabling it
      is that the scheduler will still report steal time (that cannot be
      disabled), but won't use this information for cpu power adjustments.
      
      Everytime update_rq_clock_task() is invoked, we query information
      about how much time was stolen since last call, and feed it into
      sched_rt_avg_update().
      
      Although steal time reporting in account_process_tick() keeps
      track of the last time we read the steal clock, in prev_steal_time,
      this patch do it independently using another field,
      prev_steal_time_rq. This is because otherwise, information about time
      accounted in update_process_tick() would never reach us in update_rq_clock().
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NEric B Munson <emunson@mgebm.net>
      CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      CC: Anthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      095c0aa8
    • G
      KVM guest: Steal time accounting · e6e6685a
      Glauber Costa 提交于
      This patch accounts steal time time in account_process_tick.
      If one or more tick is considered stolen in the current
      accounting cycle, user/system accounting is skipped. Idle is fine,
      since the hypervisor does not report steal time if the guest
      is halted.
      
      Accounting steal time from the core scheduler give us the
      advantage of direct acess to the runqueue data. In a later
      opportunity, it can be used to tweak cpu power and make
      the scheduler aware of the time it lost.
      
      [avi: <asm/paravirt.h> doesn't exist on many archs]
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NEric B Munson <emunson@mgebm.net>
      CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      CC: Anthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e6e6685a
  11. 09 7月, 2011 1 次提交
  12. 08 7月, 2011 1 次提交