1. 14 6月, 2016 2 次提交
    • J
      sched/debug: Fix deadlock when enabling sched events · eda8dca5
      Josh Poimboeuf 提交于
      I see a hang when enabling sched events:
      
        echo 1 > /sys/kernel/debug/tracing/events/sched/enable
      
      The printk buffer shows:
      
        BUG: spinlock recursion on CPU#1, swapper/1/0
         lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
        CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
        ...
        Call Trace:
         <IRQ>  [<ffffffff8143d663>] dump_stack+0x85/0xc2
         [<ffffffff81115948>] spin_dump+0x78/0xc0
         [<ffffffff81115aea>] do_raw_spin_lock+0x11a/0x150
         [<ffffffff81891471>] _raw_spin_lock+0x61/0x80
         [<ffffffff810e5466>] ? try_to_wake_up+0x256/0x4e0
         [<ffffffff810e5466>] try_to_wake_up+0x256/0x4e0
         [<ffffffff81891a0a>] ? _raw_spin_unlock_irqrestore+0x4a/0x80
         [<ffffffff810e5705>] wake_up_process+0x15/0x20
         [<ffffffff810cebb4>] insert_work+0x84/0xc0
         [<ffffffff810ced7f>] __queue_work+0x18f/0x660
         [<ffffffff810cf9a6>] queue_work_on+0x46/0x90
         [<ffffffffa00cd95b>] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
         [<ffffffffa00cdac0>] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
         [<ffffffff814babcd>] soft_cursor+0x1ad/0x230
         [<ffffffff814ba379>] bit_cursor+0x649/0x680
         [<ffffffff814b9d30>] ? update_attr.isra.2+0x90/0x90
         [<ffffffff814b5e6a>] fbcon_cursor+0x14a/0x1c0
         [<ffffffff81555ef8>] hide_cursor+0x28/0x90
         [<ffffffff81558b6f>] vt_console_print+0x3bf/0x3f0
         [<ffffffff81122c63>] call_console_drivers.constprop.24+0x183/0x200
         [<ffffffff811241f4>] console_unlock+0x3d4/0x610
         [<ffffffff811247f5>] vprintk_emit+0x3c5/0x610
         [<ffffffff81124bc9>] vprintk_default+0x29/0x40
         [<ffffffff811e965b>] printk+0x57/0x73
         [<ffffffff810f7a9e>] enqueue_entity+0xc2e/0xc70
         [<ffffffff810f7b39>] enqueue_task_fair+0x59/0xab0
         [<ffffffff8106dcd9>] ? kvm_sched_clock_read+0x9/0x20
         [<ffffffff8103fb39>] ? sched_clock+0x9/0x10
         [<ffffffff810e3fcc>] activate_task+0x5c/0xa0
         [<ffffffff810e4514>] ttwu_do_activate+0x54/0xb0
         [<ffffffff810e5cea>] sched_ttwu_pending+0x7a/0xb0
         [<ffffffff810e5e51>] scheduler_ipi+0x61/0x170
         [<ffffffff81059e7f>] smp_trace_reschedule_interrupt+0x4f/0x2a0
         [<ffffffff81893ba6>] trace_reschedule_interrupt+0x96/0xa0
         <EOI>  [<ffffffff8106e0d6>] ? native_safe_halt+0x6/0x10
         [<ffffffff8110fb1d>] ? trace_hardirqs_on+0xd/0x10
         [<ffffffff81040ac0>] default_idle+0x20/0x1a0
         [<ffffffff8104147f>] arch_cpu_idle+0xf/0x20
         [<ffffffff81102f8f>] default_idle_call+0x2f/0x50
         [<ffffffff8110332e>] cpu_startup_entry+0x37e/0x450
         [<ffffffff8105af70>] start_secondary+0x160/0x1a0
      
      Note the hang only occurs when echoing the above from a physical serial
      console, not from an ssh session.
      
      The bug is caused by a deadlock where the task is trying to grab the rq
      lock twice because printk()'s aren't safe in sched code.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@trebleSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eda8dca5
    • P
      sched/fair: Fix post_init_entity_util_avg() serialization · b7fa30c9
      Peter Zijlstra 提交于
      Chris Wilson reported a divide by 0 at:
      
       post_init_entity_util_avg():
      
       >    725	if (cfs_rq->avg.util_avg != 0) {
       >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
       > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
       >    728
       >    729		if (sa->util_avg > cap)
       >    730			sa->util_avg = cap;
       >    731	} else {
      
      Which given the lack of serialization, and the code generated from
      update_cfs_rq_load_avg() is entirely possible:
      
      	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      		removed_load = 1;
      	}
      
      turns into:
      
        ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
        ffffffff8108706b:       48 85 c0                test   %rax,%rax
        ffffffff8108706e:       74 40                   je     ffffffff810870b0
        ffffffff81087070:       4c 89 f8                mov    %r15,%rax
        ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
        ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
        ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
        ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
        ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
        ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      
      Which you'll note ends up with 'sa->load_avg - r' in memory at
      ffffffff8108707a.
      
      By calling post_init_entity_util_avg() under rq->lock we're sure to be
      fully serialized against PELT updates and cannot observe intermediate
      state like this.
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 2b8c41da ("sched/fair: Initiate a new task's util avg to a bounded value")
      Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7fa30c9
  2. 12 5月, 2016 6 次提交
    • M
      sched/fair: Correct unit of load_above_capacity · cfa10334
      Morten Rasmussen 提交于
      In calculate_imbalance() load_above_capacity currently has the unit
      [capacity] while it is used as being [load/capacity]. Not only is it
      wrong it also makes it unlikely that load_above_capacity is ever used
      as the subsequent code picks the smaller of load_above_capacity and
      the avg_load
      
      This patch ensures that load_above_capacity has the right unit
      [load/capacity].
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      [ Changed changelog to note it was in capacity unit; +rebase. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cfa10334
    • P
      sched/fair: Clean up scale confusion · 1be0eb2a
      Peter Zijlstra 提交于
      Wanpeng noted that the scale_load_down() in calculate_imbalance() was
      weird. I agree, it should be SCHED_CAPACITY_SCALE, since we're going
      to compare against busiest->group_capacity, which is in [capacity]
      units.
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1be0eb2a
    • P
      sched/fair: Fix fairness issue on migration · 2f950354
      Peter Zijlstra 提交于
      Pavan reported that in the presence of very light tasks (or cgroups)
      the placement of migrated tasks can cause severe fairness issues.
      
      The problem is that enqueue_entity() places the task before it updates
      time, thereby it can place the task far in the past (remember that
      light tasks will shoot virtual time forward at a high speed, so in
      relation to the pre-existing light task, we can land far in the past).
      
      This is done because update_curr() needs the current task, and we
      might be placing the current task.
      
      The obvious solution is to differentiate between the current and any
      other task; placing the current before we update time, and placing any
      other task after, such that !curr tasks end up at the current moment
      in time, and not in the past.
      
      This commit re-introduces the previously reverted commit:
      
        3a47d512 ("sched/fair: Fix fairness issue on migration")
      
      ... which is now safe to do, after we've also fixed another
      underlying bug first, in:
      
        sched/fair: Prepare to fix fairness problems on migration
      
      and cleaned up other details in the migration code:
      
        sched/core: Kill sched_class::task_waking
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2f950354
    • P
      sched/core: Kill sched_class::task_waking to clean up the migration logic · 59efa0ba
      Peter Zijlstra 提交于
      With sched_class::task_waking being called only when we do
      set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
      and eliminate sched_class::task_waking entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      59efa0ba
    • P
      sched/fair: Prepare to fix fairness problems on migration · b5179ac7
      Peter Zijlstra 提交于
      Mike reported that our recent attempt to fix migration problems:
      
        3a47d512 ("sched/fair: Fix fairness issue on migration")
      
      broke interactivity and the signal starve test. We reverted that
      commit and now let's try it again more carefully, with some other
      underlying problems fixed first.
      
      One problem is that I assumed ENQUEUE_WAKING was only set when we do a
      cross-cpu wakeup (migration), which isn't true. This means we now
      destroy the vruntime history of tasks and wakeup-preemption suffers.
      
      Cure this by making my assumption true, only call
      sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
      the indirect call in the case we do a local wakeup.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Fixes: 3a47d512 ("sched/fair: Fix fairness issue on migration")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5179ac7
    • P
      sched/fair: Move record_wakee() · c58d25f3
      Peter Zijlstra 提交于
      Since I want to make ->task_woken() conditional on the task getting
      migrated, we cannot use it to call record_wakee().
      
      Move it to select_task_rq_fair(), which gets called in almost all the
      same conditions. The only exception is if the woken task (@p) is
      CPU-bound (as per the nr_cpus_allowed test in select_task_rq()).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c58d25f3
  3. 11 5月, 2016 1 次提交
  4. 07 5月, 2016 1 次提交
  5. 06 5月, 2016 1 次提交
  6. 05 5月, 2016 7 次提交
  7. 23 4月, 2016 7 次提交
    • F
      sched/fair: Optimize !CONFIG_NO_HZ_COMMON CPU load updates · 9fd81dd5
      Frederic Weisbecker 提交于
      Some code in CPU load update only concern NO_HZ configs but it is
      built on all configurations. When NO_HZ isn't built, that code is harmless
      but just happens to take some useless ressources in CPU and memory:
      
      1) one useless field in struct rq
      2) jiffies record on every tick that is never used (cpu_load_update_periodic)
      3) decay_load_missed is called two times on every tick to eventually
         return immediately with no action taken. And that function is dead
         code.
      
      For pure optimization purposes, lets conditionally build the NO_HZ
      related code.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1461080211-16271-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9fd81dd5
    • F
      sched/fair: Correctly handle nohz ticks CPU load accounting · 1f41906a
      Frederic Weisbecker 提交于
      Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
      mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
      mode and we try to minimize the ticks as much as possible. The nohz
      subsystem uses a confusing terminology with the internal state
      "ts->tick_stopped" which is also available through its public interface
      with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
      reduced with the best effort rather than stopped. In the best case the
      tick can indeed be actually stopped but there is no guarantee about that.
      If a timer needs to fire one second later, a tick will fire while the
      CPU is in nohz mode and this is a very common scenario.
      
      Now this confusion happens to be a problem with CPU load updates:
      cpu_load_update_active() doesn't handle nohz ticks correctly because it
      assumes that ticks are completely stopped in nohz mode and that
      cpu_load_update_active() can't be called in dynticks mode. When that
      happens, the whole previous tickless load is ignored and the function
      just records the load for the current tick, ignoring potentially long
      idle periods behind.
      
      In order to solve this, we could account the current load for the
      previous nohz time but there is a risk that we account the load of a
      task that got freshly enqueued for the whole nohz period.
      
      So instead, lets record the dynticks load on nohz frame entry so we know
      what to record in case of nohz ticks, then use this record to account
      the tickless load on nohz ticks and nohz frame end.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f41906a
    • F
      sched/fair: Gather CPU load functions under a more conventional namespace · cee1afce
      Frederic Weisbecker 提交于
      The CPU load update related functions have a weak naming convention
      currently, starting with update_cpu_load_*() which isn't ideal as
      "update" is a very generic concept.
      
      Since two of these functions are public already (and a third is to come)
      that's enough to introduce a more conventional naming scheme. So let's
      do the following rename instead:
      
      	update_cpu_load_*() -> cpu_load_update_*()
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cee1afce
    • S
      sched/fair: Call cpufreq hook in additional paths · a2c6c91f
      Steve Muckle 提交于
      The cpufreq hook should be called any time the root CFS rq utilization
      changes. This can occur when a task is switched to or from the fair
      class, or a task moves between groups or CPUs, but these paths
      currently do not call the cpufreq hook.
      
      Fix this by adding the hook to attach_entity_load_avg() and
      detach_entity_load_avg().
      Suggested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NSteve Muckle <smuckle@linaro.org>
      [ Added the .update_freq argument to update_cfs_rq_load_avg() to avoid a double cpufreq call. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Michael Turquette <mturquette@baylibre.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1458858367-2831-1-git-send-email-smuckle@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a2c6c91f
    • S
      sched/fair: Do not call cpufreq hook unless util changed · 41e0d37f
      Steve Muckle 提交于
      There's no reason to call the cpufreq hook if the root cfs_rq
      utilization has not been modified.
      Signed-off-by: NSteve Muckle <smuckle@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Michael Turquette <mturquette@baylibre.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1458606068-7476-2-git-send-email-smuckle@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      41e0d37f
    • S
      sched/fair: Move cpufreq hook to update_cfs_rq_load_avg() · 21e96f88
      Steve Muckle 提交于
      The cpufreq hook should be called whenever the root cfs_rq
      utilization changes so update_cfs_rq_load_avg() is a better
      place for it. The current location is not invoked in the
      enqueue_entity() or update_blocked_averages() paths.
      Suggested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NSteve Muckle <smuckle@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Michael Turquette <mturquette@baylibre.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1458606068-7476-1-git-send-email-smuckle@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      21e96f88
    • S
      sched/fair: Fix asym packing to select correct CPU · 1f621e02
      Srikar Dronamraju 提交于
      When asymmetric packing is set in the sched_domain and target CPU is
      busy, update_sd_pick_busiest() may not select the busiest runqueue.
      When target CPU is busy, find_busiest_group() will ignore checks for
      asym packing and may continue to load balance using the currently
      selected not-the-busiest runqueue as source runqueue.
      Selecting the busiest runqueue as source when the target CPU is busy,
      should result in achieving much better load balance.
      
      Also when target CPU is not busy and asymmetric packing is set in sd,
      select higher CPU as source CPU for load balancing.
      
      While doing this change, move the check to see if target CPU is busy
      into check_asym_packing().
      
      The extent of performance benefit from this change decreases with the
      increasing load. However there is benefit in undercommit as well as
      overcommit conditions.
      
      1. Record per second ebizzy (32 threads) on a 64 CPU power 7 box. (5 iterations)
      4.6.0-rc2
      	Testcase:         Min         Max         Avg      StdDev
      	  ebizzy:  5223767.00 10368236.00  7946971.00  1753094.76
      
      4.6.0-rc2+asym-changes
      	Testcase:         Min         Max         Avg      StdDev     %Change
      	  ebizzy:  8617191.00 13872356.00 11383980.00  1783400.89     +24.78%
      
      2. Record per second ebizzy (64 threads) on a 64 CPU power 7 box. (5 iterations)
      4.6.0-rc2
      	Testcase:         Min         Max         Avg      StdDev
      	  ebizzy:  6497666.00 18399783.00 10818093.20  4051452.08
      
      4.6.0-rc2+asym-changes
      	Testcase:         Min         Max         Avg      StdDev     %Change
      	  ebizzy:  7567365.00 19456937.00 11674063.60  4295407.48      +4.40%
      
      3. Record per second ebizzy (128 threads) on a 64 CPU power 7 box. (5 iterations)
      4.6.0-rc2
      	Testcase:         Min         Max         Avg      StdDev
      	  ebizzy: 37073983.00 40341911.00 38776241.80  1259766.82
      
      4.6.0-rc2+asym-changes
      	Testcase:         Min         Max         Avg      StdDev     %Change
      	  ebizzy: 38030399.00 41333378.00 39827404.40  1255001.86      +2.54%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1459948660-16073-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f621e02
  8. 31 3月, 2016 4 次提交
  9. 21 3月, 2016 2 次提交
  10. 09 3月, 2016 1 次提交
  11. 29 2月, 2016 3 次提交
  12. 09 2月, 2016 2 次提交
    • R
      sched/numa: Spread memory according to CPU and memory use · 4142c3eb
      Rik van Riel 提交于
      The pseudo-interleaving in NUMA placement has a fundamental problem:
      using hard usage thresholds to spread memory equally between nodes
      can prevent workloads from converging, or keep memory "trapped" on
      nodes where the workload is barely running any more.
      
      In order for workloads to properly converge, the memory migration
      should not be stopped when nodes reach parity, but instead be
      distributed according to how heavily memory is used from each node.
      This way memory migration and task migration reinforce each other,
      instead of one putting the brakes on the other.
      
      Remove the hard thresholds from the pseudo-interleaving code, and
      instead use a more gradual policy on memory placement. This also
      seems to improve convergence of workloads that do not run flat out,
      but sleep in between bursts of activity.
      
      We still want to slow down NUMA scanning and migration once a workload
      has settled on a few actively used nodes, so keep the 3/4 hysteresis
      in place. Keep track of whether a workload is actively running on
      multiple nodes, so task_numa_migrate does a full scan of the system
      for better task placement.
      
      In the case of running 3 SPECjbb2005 instances on a 4 node system,
      this code seems to result in fairer distribution of memory between
      nodes, with more memory bandwidth for each instance.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/20160125170739.2fc9a641@annuminas.surriel.com
      [ Minor readability tweaks. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4142c3eb
    • M
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman 提交于
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb251765
  13. 22 1月, 2016 1 次提交
    • G
      sched/numa: Fix use-after-free bug in the task_numa_compare · 1dff76b9
      Gavin Guo 提交于
      The following message can be observed on the Ubuntu v3.13.0-65 with KASan
      backported:
      
        ==================================================================
        BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
        Read of size 8 by task qemu-system-x86/3998900
        =============================================================================
        BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
        -----------------------------------------------------------------------------
      
        INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
      	__slab_alloc+0x4f8/0x560
      	__kmalloc+0x1eb/0x280
      	task_numa_fault+0xc1b/0xed0
      	do_numa_page+0x192/0x200
      	handle_mm_fault+0x808/0x1160
      	__do_page_fault+0x218/0x750
      	do_page_fault+0x1a/0x70
      	page_fault+0x28/0x30
      	SyS_poll+0x66/0x1a0
      	system_call_fastpath+0x1a/0x1f
        INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
      	__slab_free+0x2ab/0x3f0
      	kfree+0x161/0x170
      	task_numa_free+0x1d2/0x200
      	finish_task_switch+0x1d2/0x210
      	__schedule+0x5d4/0xc60
      	schedule_preempt_disabled+0x40/0xc0
      	cpu_startup_entry+0x2da/0x340
      	start_secondary+0x28f/0x360
        Call Trace:
         [<ffffffff81a6ce35>] dump_stack+0x45/0x56
         [<ffffffff81244aed>] print_trailer+0xfd/0x170
         [<ffffffff8124ac36>] object_err+0x36/0x40
         [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
         [<ffffffff8124d260>] kasan_report+0x40/0x50
         [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
         [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
         [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
         [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
         [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
         [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
         [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
         [<ffffffff8120ef02>] do_numa_page+0x192/0x200
         [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
         [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
         [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
         [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
         [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
         [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
         [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
         [<ffffffff81a772e8>] page_fault+0x28/0x30
         [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
         [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
         [<ffffffff810233c9>] ? sched_clock+0x9/0x10
         [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
         [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
         [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
         [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
         [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
         [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
         [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
         [<ffffffff81022c89>] ? read_tsc+0x9/0x20
         [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
         [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
         [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
         [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f
      
      As commit 1effd9f1 ("sched/numa: Fix unsafe get_task_struct() in
      task_numa_assign()") points out, the rcu_read_lock() cannot protect the
      task_struct from being freed in the finish_task_switch(). And the bug
      happens in the process of calculation of imp which requires the access of
      p->numa_faults being freed in the following path:
      
      do_exit()
              current->flags |= PF_EXITING;
          release_task()
              ~~delayed_put_task_struct()~~
          schedule()
          ...
          ...
      rq->curr = next;
          context_switch()
              finish_task_switch()
                  put_task_struct()
                      __put_task_struct()
      		    task_numa_free()
      
      The fix here to get_task_struct() early before end of dst_rq->lock to
      protect the calculation process and also put_task_struct() in the
      corresponding point if finally the dst_rq->curr somehow cannot be
      assigned.
      
      Additional credit to Liang Chen who helped fix the error logic and add the
      put_task_struct() to the place it missed.
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jay.vosburgh@canonical.com
      Cc: liang.chen@canonical.com
      Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dff76b9
  14. 06 1月, 2016 2 次提交