1. 14 6月, 2016 1 次提交
  2. 11 6月, 2016 1 次提交
  3. 08 6月, 2016 1 次提交
    • J
      sched/debug: Fix 'schedstats=enable' cmdline option · 4698f88c
      Josh Poimboeuf 提交于
      The 'schedstats=enable' option doesn't work, and also produces the
      following warning during boot:
      
        WARNING: CPU: 0 PID: 0 at /home/jpoimboe/git/linux/kernel/jump_label.c:61 static_key_slow_inc+0x8c/0xa0
        static_key_slow_inc used before call to jump_label_init
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc1+ #25
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         0000000000000086 3ae3475a4bea95d4 ffffffff81e03da8 ffffffff8143fc83
         ffffffff81e03df8 0000000000000000 ffffffff81e03de8 ffffffff810b1ffb
         0000003d00000096 ffffffff823514d0 ffff88007ff197c8 0000000000000000
        Call Trace:
         [<ffffffff8143fc83>] dump_stack+0x85/0xc2
         [<ffffffff810b1ffb>] __warn+0xcb/0xf0
         [<ffffffff810b207f>] warn_slowpath_fmt+0x5f/0x80
         [<ffffffff811e9c0c>] static_key_slow_inc+0x8c/0xa0
         [<ffffffff810e07c6>] static_key_enable+0x16/0x40
         [<ffffffff8216d633>] setup_schedstats+0x29/0x94
         [<ffffffff82148a05>] unknown_bootoption+0x89/0x191
         [<ffffffff810d8617>] parse_args+0x297/0x4b0
         [<ffffffff82148d61>] start_kernel+0x1d8/0x4a9
         [<ffffffff8214897c>] ? set_init_arg+0x55/0x55
         [<ffffffff82148120>] ? early_idt_handler_array+0x120/0x120
         [<ffffffff821482db>] x86_64_start_reservations+0x2f/0x31
         [<ffffffff82148427>] x86_64_start_kernel+0x14a/0x16d
      
      The problem is that it tries to update the 'sched_schedstats' static key
      before jump labels have been initialized.
      
      Changing jump_label_init() to be called earlier before
      parse_early_param() wouldn't fix it: it would still fail trying to
      poke_text() because mm isn't yet initialized.
      
      Instead, just create a temporary '__sched_schedstats' variable which can
      be copied to the static key later during sched_init() after jump labels
      have been initialized.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/453775fe3433bed65731a583e228ccea806d18cd.1465322027.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4698f88c
  4. 25 5月, 2016 1 次提交
    • P
      sched/core: Fix remote wakeups · b7e7ade3
      Peter Zijlstra 提交于
      Commit:
      
        b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      
      ... introduced a bug: Mike Galbraith found that it introduced a
      performance regression, while Paul E. McKenney reported lost
      wakeups and bisected it to this commit.
      
      The reason is that I mis-read ttwu_queue() such that I assumed any
      wakeup that got a remote queue must have had the task migrated.
      
      Since this is not so; we need to transfer this information between
      queueing the wakeup and actually doing the wakeup. Use a new
      task_struct::sched_flag for this, we already write to
      sched_contributes_to_load in the wakeup path so this is a hot and
      modified cacheline.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e7ade3
  5. 12 5月, 2016 4 次提交
    • T
      sched/core: Provide a tsk_nr_cpus_allowed() helper · 50605ffb
      Thomas Gleixner 提交于
      tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows
      us to change the representation of ->nr_cpus_allowed if required.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50605ffb
    • W
      sched/nohz: Fix affine unpinned timers mess · 44496922
      Wanpeng Li 提交于
      The following commit:
      
        9642d18e ("nohz: Affine unpinned timers to housekeepers")'
      
      intended to affine unpinned timers to housekeepers:
      
        unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(houserkeepers, idle)    =>   nearest busy housekeepers(otherwise, fallback to itself)
      
      However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
      intention to:
      
        unpinned timers(full dynaticks, idle)   =>   any housekeepers(no mattter cpu topology)
        unpinned timers(full dynaticks, busy)   =>   any housekeepers(no mattter cpu topology)
        unpinned timers(housekeepers, idle)     =>   any busy cpus(otherwise, fallback to any housekeepers)
      
      This patch fixes it by checking if there are busy housekeepers nearby,
      otherwise falls to any housekeepers/itself. After the patch:
      
        unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(housekeepers, idle)     =>   nearest busy housekeepers(otherwise, fallback to itself)
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [ Fixed the changelog. ]
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 'commit 9642d18e ("nohz: Affine unpinned timers to housekeepers")'
      Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      44496922
    • P
      sched/core: Kill sched_class::task_waking to clean up the migration logic · 59efa0ba
      Peter Zijlstra 提交于
      With sched_class::task_waking being called only when we do
      set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
      and eliminate sched_class::task_waking entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      59efa0ba
    • P
      sched/fair: Prepare to fix fairness problems on migration · b5179ac7
      Peter Zijlstra 提交于
      Mike reported that our recent attempt to fix migration problems:
      
        3a47d512 ("sched/fair: Fix fairness issue on migration")
      
      broke interactivity and the signal starve test. We reverted that
      commit and now let's try it again more carefully, with some other
      underlying problems fixed first.
      
      One problem is that I assumed ENQUEUE_WAKING was only set when we do a
      cross-cpu wakeup (migration), which isn't true. This means we now
      destroy the vruntime history of tasks and wakeup-preemption suffers.
      
      Cure this by making my assumption true, only call
      sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
      the indirect call in the case we do a local wakeup.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Fixes: 3a47d512 ("sched/fair: Fix fairness issue on migration")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5179ac7
  6. 09 5月, 2016 2 次提交
  7. 06 5月, 2016 13 次提交
  8. 05 5月, 2016 3 次提交
    • P
      locking/lockdep, sched/core: Implement a better lock pinning scheme · e7904a28
      Peter Zijlstra 提交于
      The problem with the existing lock pinning is that each pin is of
      value 1; this mean you can simply unpin if you know its pinned,
      without having any extra information.
      
      This scheme generates a random (16 bit) cookie for each pin and
      requires this same cookie to unpin. This means you have to keep the
      cookie in context.
      
      No objsize difference for !LOCKDEP kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e7904a28
    • P
      sched/core: Introduce 'struct rq_flags' · eb580751
      Peter Zijlstra 提交于
      In order to be able to pass around more than just the IRQ flags in the
      future, add a rq_flags structure.
      
      No difference in code generation for the x86_64-defconfig build I
      tested.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      eb580751
    • P
      sched/core: Move task_rq_lock() out of line · 3e71a462
      Peter Zijlstra 提交于
      Its a rather large function, inline doesn't seems to make much sense:
      
       $ size defconfig-build/kernel/sched/core.o{.orig,}
          text    data     bss     dec     hex filename
         56533   21037    2320   79890   13812 defconfig-build/kernel/sched/core.o.orig
         55733   21037    2320   79090   134f2 defconfig-build/kernel/sched/core.o
      
      The 'perf bench sched messaging' micro-benchmark shows a visible improvement
      of 4-5%:
      
        $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done
        $ perf stat --null --repeat 25 -- perf bench sched messaging -g 40 -l 5000
      
        pre:
             4.582798193 seconds time elapsed          ( +-  1.41% )
             4.733374877 seconds time elapsed          ( +-  2.10% )
             4.560955136 seconds time elapsed          ( +-  1.43% )
             4.631062303 seconds time elapsed          ( +-  1.40% )
      
        post:
             4.364765213 seconds time elapsed          ( +-  0.91% )
             4.454442734 seconds time elapsed          ( +-  1.18% )
             4.448893817 seconds time elapsed          ( +-  1.41% )
             4.424346872 seconds time elapsed          ( +-  0.97% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e71a462
  9. 28 4月, 2016 2 次提交
  10. 23 4月, 2016 3 次提交
  11. 13 4月, 2016 1 次提交
  12. 31 3月, 2016 2 次提交
    • Y
      sched/fair: Initiate a new task's util avg to a bounded value · 2b8c41da
      Yuyang Du 提交于
      A new task's util_avg is set to full utilization of a CPU (100% time
      running). This accelerates a new task's utilization ramp-up, useful to
      boost its execution in early time. However, it may result in
      (insanely) high utilization for a transient time period when a flood
      of tasks are spawned. Importantly, it violates the "fundamentally
      bounded" CPU utilization, and its side effect is negative if we don't
      take any measure to bound it.
      
      This patch proposes an algorithm to address this issue. It has
      two methods to approach a sensible initial util_avg:
      
      (1) An expected (or average) util_avg based on its cfs_rq's util_avg:
      
        util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
      
      (2) A trajectory of how successive new tasks' util develops, which
      gives 1/2 of the left utilization budget to a new task such that
      the additional util is noticeably large (when overall util is low) or
      unnoticeably small (when overall util is high enough). In the meantime,
      the aggregate utilization is well bounded:
      
        util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
      
      where n denotes the nth task.
      
      If util_avg is larger than util_avg_cap, then the effective util is
      clamped to the util_avg_cap.
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2b8c41da
    • S
      sched/core: Add preempt checks in preempt_schedule() code · 47252cfb
      Steven Rostedt 提交于
      While testing the tracer preemptoff, I hit this strange trace:
      
         <...>-259     0...1    0us : schedule <-worker_thread
         <...>-259     0d..1    0us : rcu_note_context_switch <-__schedule
         <...>-259     0d..1    0us : rcu_sched_qs <-rcu_note_context_switch
         <...>-259     0d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
         <...>-259     0d..1    0us : _raw_spin_lock <-__schedule
         <...>-259     0d..1    0us : preempt_count_add <-_raw_spin_lock
         <...>-259     0d..2    0us : do_raw_spin_lock <-_raw_spin_lock
         <...>-259     0d..2    1us : deactivate_task <-__schedule
         <...>-259     0d..2    1us : update_rq_clock.part.84 <-deactivate_task
         <...>-259     0d..2    1us : dequeue_task_fair <-deactivate_task
         <...>-259     0d..2    1us : dequeue_entity <-dequeue_task_fair
         <...>-259     0d..2    1us : update_curr <-dequeue_entity
         <...>-259     0d..2    1us : update_min_vruntime <-update_curr
         <...>-259     0d..2    1us : cpuacct_charge <-update_curr
         <...>-259     0d..2    1us : __rcu_read_lock <-cpuacct_charge
         <...>-259     0d..2    1us : __rcu_read_unlock <-cpuacct_charge
         <...>-259     0d..2    1us : clear_buddies <-dequeue_entity
         <...>-259     0d..2    1us : account_entity_dequeue <-dequeue_entity
         <...>-259     0d..2    2us : update_min_vruntime <-dequeue_entity
         <...>-259     0d..2    2us : update_cfs_shares <-dequeue_entity
         <...>-259     0d..2    2us : hrtick_update <-dequeue_task_fair
         <...>-259     0d..2    2us : wq_worker_sleeping <-__schedule
         <...>-259     0d..2    2us : kthread_data <-wq_worker_sleeping
         <...>-259     0d..2    2us : pick_next_task_fair <-__schedule
         <...>-259     0d..2    2us : check_cfs_rq_runtime <-pick_next_task_fair
         <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
         <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
         <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
         <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
         <...>-259     0d..2    2us : set_next_entity <-pick_next_task_fair
         <...>-259     0d..2    3us : put_prev_entity <-pick_next_task_fair
         <...>-259     0d..2    3us : check_cfs_rq_runtime <-put_prev_entity
         <...>-259     0d..2    3us : set_next_entity <-pick_next_task_fair
      gnome-sh-1031    0d..2    3us : finish_task_switch <-__schedule
      gnome-sh-1031    0d..2    3us : _raw_spin_unlock_irq <-finish_task_switch
      gnome-sh-1031    0d..2    3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
      gnome-sh-1031    0...2    3us!: preempt_count_sub <-_raw_spin_unlock_irq
      gnome-sh-1031    0...1  582us : do_raw_spin_lock <-_raw_spin_lock
      gnome-sh-1031    0...1  583us : _raw_spin_unlock <-drm_gem_object_lookup
      gnome-sh-1031    0...1  583us : do_raw_spin_unlock <-_raw_spin_unlock
      gnome-sh-1031    0...1  583us : preempt_count_sub <-_raw_spin_unlock
      gnome-sh-1031    0...1  584us : _raw_spin_unlock <-drm_gem_object_lookup
      gnome-sh-1031    0...1  584us+: trace_preempt_on <-drm_gem_object_lookup
      gnome-sh-1031    0...1  603us : <stack trace>
       => preempt_count_sub
       => _raw_spin_unlock
       => drm_gem_object_lookup
       => i915_gem_madvise_ioctl
       => drm_ioctl
       => do_vfs_ioctl
       => SyS_ioctl
       => entry_SYSCALL_64_fastpath
      
      As I'm tracing preemption disabled, it seemed incorrect that the trace
      would go across a schedule and report not being in the scheduler.
      Looking into this I discovered the problem.
      
      schedule() calls preempt_disable() but the preempt_schedule() calls
      preempt_enable_notrace(). What happened above was that the gnome-shell
      task was preempted on another CPU, migrated over to the idle cpu. The
      tracer stared with idle calling schedule(), which called
      preempt_disable(), but then gnome-shell finished, and it enabled
      preemption with preempt_enable_notrace() that does stop the trace, even
      though preemption was enabled.
      
      The purpose of the preempt_disable_notrace() in the preempt_schedule()
      is to prevent function tracing from going into an infinite loop.
      Because function tracing can trace the preempt_enable/disable() calls
      that are traced. The problem with function tracing is:
      
        NEED_RESCHED set
        preempt_schedule()
          preempt_disable()
            preempt_count_inc()
              function trace (before incrementing preempt count)
                preempt_disable_notrace()
                preempt_enable_notrace()
                  sees NEED_RESCHED set
                     preempt_schedule() (repeat)
      
      Now by breaking out the preempt off/on tracing into their own code:
      preempt_disable_check() and preempt_enable_check(), we can add these to
      the preempt_schedule() code. As preemption would then be disabled, even
      if they were to be traced by the function tracer, the disabled
      preemption would prevent the recursion.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      47252cfb
  13. 29 3月, 2016 1 次提交
  14. 21 3月, 2016 1 次提交
  15. 10 3月, 2016 1 次提交
  16. 08 3月, 2016 1 次提交
  17. 05 3月, 2016 1 次提交
    • T
      sched/cputime: Fix steal time accounting vs. CPU hotplug · e9532e69
      Thomas Gleixner 提交于
      On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
      value over CPU down and up. So after the CPU comes up again the delta
      calculation in steal_account_process_tick() wreckages itself due to the
      unsigned math:
      
      	 u64 steal = paravirt_steal_clock(smp_processor_id());
      
      	 steal -= this_rq()->prev_steal_time;
      
      So if steal is smaller than rq->prev_steal_time we end up with an insane large
      value which then gets added to rq->prev_steal_time, resulting in a permanent
      wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
      become stale.
      
      Nice trick to tell the world how idle the system is (100%) while the CPU is
      100% busy running tasks. Though we prefer realistic numbers.
      
      None of the accounting values which use a previous value to account for
      fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
      check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
      deals with clock warps and limits the /proc/stat visible wreckage. The
      prev_time values are still wrong.
      
      Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: commit 095c0aa8 "sched: adjust scheduler cpu power for stolen time"
      Fixes: commit aa483808 "sched: Remove irq time from available CPU power"
      Fixes: commit e6e6685a "KVM guest: Steal time accounting"
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9532e69
  18. 02 3月, 2016 1 次提交
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3