1. 27 6月, 2016 2 次提交
    • V
      sched/cgroup: Fix cpu_cgroup_fork() handling · ea86cb4b
      Vincent Guittot 提交于
      A new fair task is detached and attached from/to task_group with:
      
        cgroup_post_fork()
          ss->fork(child) := cpu_cgroup_fork()
            sched_move_task()
              task_move_group_fair()
      
      Which is wrong, because at this point in fork() the task isn't fully
      initialized and it cannot 'move' to another group, because its not
      attached to any group as yet.
      
      In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
      can just call this small part directly instead sched_move_task(). And
      the task doesn't really migrate because it is not yet attached so we
      need the following sequence:
      
        do_fork()
          sched_fork()
            __set_task_cpu()
      
          cgroup_post_fork()
            set_task_rq() # set task group and runqueue
      
          wake_up_new_task()
            select_task_rq() can select a new cpu
            __set_task_cpu
            post_init_entity_util_avg
              attach_task_cfs_rq()
            activate_task
              enqueue_task
      
      This patch makes that happen.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added TASK_SET_GROUP to set depth properly. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ea86cb4b
    • P
      sched/fair: Fix and optimize the fork() path · e210bffd
      Peter Zijlstra 提交于
      The task_fork_fair() callback already calls __set_task_cpu() and takes
      rq->lock.
      
      If we move the sched_class::task_fork callback in sched_fork() under
      the existing p->pi_lock, right after its set_task_cpu() call, we can
      avoid doing two such calls and omit the IRQ disabling on the rq->lock.
      
      Change to __set_task_cpu() to skip the migration bits, this is a new
      task, not a migration. Similarly, make wake_up_new_task() use
      __set_task_cpu() for the same reason, the task hasn't actually
      migrated as it hasn't ever ran.
      
      This cures the problem of calling migrate_task_rq_fair(), which does
      remove_entity_from_load_avg() on tasks that have never been added to
      the load avg to begin with.
      
      This bug would result in transiently messed up load_avg values, averaged
      out after a few dozen milliseconds. This is probably the reason why
      this bug was not found for such a long time.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e210bffd
  2. 24 6月, 2016 1 次提交
    • T
      sched/core: Allow kthreads to fall back to online && !active cpus · feb245e3
      Tejun Heo 提交于
      During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
      online but not active.  A CPU_ONLINE callback may create or bind a
      kthread so that its cpus_allowed mask only allows the CPU which is
      being brought online.  The kthread may start executing before the CPU
      is made active and can end up in select_fallback_rq().
      
      In such cases, the expected behavior is selecting the CPU which is
      coming online; however, because select_fallback_rq() only chooses from
      active CPUs, it determines that the task doesn't have any viable CPU
      in its allowed mask and ends up overriding it to cpu_possible_mask.
      
      CPU_ONLINE callbacks should be able to put kthreads on the CPU which
      is coming online.  Update select_fallback_rq() so that it follows
      cpu_online() rather than cpu_active() for kthreads.
      Reported-by: NGautham R Shenoy <ego@linux.vnet.ibm.com>
      Tested-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feb245e3
  3. 14 6月, 2016 3 次提交
    • A
      kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w · 57675cb9
      Andrey Ryabinin 提交于
      Lengthy output of sysrq-w may take a lot of time on slow serial console.
      
      Currently we reset NMI-watchdog on the current CPU to avoid spurious
      lockup messages. Sometimes this doesn't work since softlockup watchdog
      might trigger on another CPU which is waiting for an IPI to proceed.
      We reset softlockup watchdogs on all CPUs, but we do this only after
      listing all tasks, and this may be too late on a busy system.
      
      So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      57675cb9
    • W
      sched/cputime: Fix prev steal time accouting during CPU hotplug · 3d89e547
      Wanpeng Li 提交于
      Commit:
      
        e9532e69 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")
      
      ... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
      fix the case where (after CPU hotplug) steal time is smaller than
      rq->prev_steal_time.
      
      However, this should never happen. Steal time was only smaller because of the
      KVM-specific bug fixed by the previous patch.  Worse, the previous patch
      triggers a bug on CPU hot-unplug/plug operation: because
      rq->prev_steal_time is cleared, all of the CPU's past steal time will be
      accounted again on hot-plug.
      
      Since the root cause has been fixed, we can just revert commit e9532e69.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 'commit e9532e69 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
      Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3d89e547
    • P
      sched/fair: Fix post_init_entity_util_avg() serialization · b7fa30c9
      Peter Zijlstra 提交于
      Chris Wilson reported a divide by 0 at:
      
       post_init_entity_util_avg():
      
       >    725	if (cfs_rq->avg.util_avg != 0) {
       >    726		sa->util_avg  = cfs_rq->avg.util_avg * se->load.weight;
       > -> 727		sa->util_avg /= (cfs_rq->avg.load_avg + 1);
       >    728
       >    729		if (sa->util_avg > cap)
       >    730			sa->util_avg = cap;
       >    731	} else {
      
      Which given the lack of serialization, and the code generated from
      update_cfs_rq_load_avg() is entirely possible:
      
      	if (atomic_long_read(&cfs_rq->removed_load_avg)) {
      		s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
      		sa->load_avg = max_t(long, sa->load_avg - r, 0);
      		sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
      		removed_load = 1;
      	}
      
      turns into:
      
        ffffffff81087064:       49 8b 85 98 00 00 00    mov    0x98(%r13),%rax
        ffffffff8108706b:       48 85 c0                test   %rax,%rax
        ffffffff8108706e:       74 40                   je     ffffffff810870b0
        ffffffff81087070:       4c 89 f8                mov    %r15,%rax
        ffffffff81087073:       49 87 85 98 00 00 00    xchg   %rax,0x98(%r13)
        ffffffff8108707a:       49 29 45 70             sub    %rax,0x70(%r13)
        ffffffff8108707e:       4c 89 f9                mov    %r15,%rcx
        ffffffff81087081:       bb 01 00 00 00          mov    $0x1,%ebx
        ffffffff81087086:       49 83 7d 70 00          cmpq   $0x0,0x70(%r13)
        ffffffff8108708b:       49 0f 49 4d 70          cmovns 0x70(%r13),%rcx
      
      Which you'll note ends up with 'sa->load_avg - r' in memory at
      ffffffff8108707a.
      
      By calling post_init_entity_util_avg() under rq->lock we're sure to be
      fully serialized against PELT updates and cannot observe intermediate
      state like this.
      Reported-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Fixes: 2b8c41da ("sched/fair: Initiate a new task's util avg to a bounded value")
      Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7fa30c9
  4. 11 6月, 2016 1 次提交
  5. 08 6月, 2016 1 次提交
    • J
      sched/debug: Fix 'schedstats=enable' cmdline option · 4698f88c
      Josh Poimboeuf 提交于
      The 'schedstats=enable' option doesn't work, and also produces the
      following warning during boot:
      
        WARNING: CPU: 0 PID: 0 at /home/jpoimboe/git/linux/kernel/jump_label.c:61 static_key_slow_inc+0x8c/0xa0
        static_key_slow_inc used before call to jump_label_init
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc1+ #25
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
         0000000000000086 3ae3475a4bea95d4 ffffffff81e03da8 ffffffff8143fc83
         ffffffff81e03df8 0000000000000000 ffffffff81e03de8 ffffffff810b1ffb
         0000003d00000096 ffffffff823514d0 ffff88007ff197c8 0000000000000000
        Call Trace:
         [<ffffffff8143fc83>] dump_stack+0x85/0xc2
         [<ffffffff810b1ffb>] __warn+0xcb/0xf0
         [<ffffffff810b207f>] warn_slowpath_fmt+0x5f/0x80
         [<ffffffff811e9c0c>] static_key_slow_inc+0x8c/0xa0
         [<ffffffff810e07c6>] static_key_enable+0x16/0x40
         [<ffffffff8216d633>] setup_schedstats+0x29/0x94
         [<ffffffff82148a05>] unknown_bootoption+0x89/0x191
         [<ffffffff810d8617>] parse_args+0x297/0x4b0
         [<ffffffff82148d61>] start_kernel+0x1d8/0x4a9
         [<ffffffff8214897c>] ? set_init_arg+0x55/0x55
         [<ffffffff82148120>] ? early_idt_handler_array+0x120/0x120
         [<ffffffff821482db>] x86_64_start_reservations+0x2f/0x31
         [<ffffffff82148427>] x86_64_start_kernel+0x14a/0x16d
      
      The problem is that it tries to update the 'sched_schedstats' static key
      before jump labels have been initialized.
      
      Changing jump_label_init() to be called earlier before
      parse_early_param() wouldn't fix it: it would still fail trying to
      poke_text() because mm isn't yet initialized.
      
      Instead, just create a temporary '__sched_schedstats' variable which can
      be copied to the static key later during sched_init() after jump labels
      have been initialized.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: cb251765 ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
      Link: http://lkml.kernel.org/r/453775fe3433bed65731a583e228ccea806d18cd.1465322027.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4698f88c
  6. 25 5月, 2016 1 次提交
    • P
      sched/core: Fix remote wakeups · b7e7ade3
      Peter Zijlstra 提交于
      Commit:
      
        b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      
      ... introduced a bug: Mike Galbraith found that it introduced a
      performance regression, while Paul E. McKenney reported lost
      wakeups and bisected it to this commit.
      
      The reason is that I mis-read ttwu_queue() such that I assumed any
      wakeup that got a remote queue must have had the task migrated.
      
      Since this is not so; we need to transfer this information between
      queueing the wakeup and actually doing the wakeup. Use a new
      task_struct::sched_flag for this, we already write to
      sched_contributes_to_load in the wakeup path so this is a hot and
      modified cacheline.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e7ade3
  7. 12 5月, 2016 4 次提交
    • T
      sched/core: Provide a tsk_nr_cpus_allowed() helper · 50605ffb
      Thomas Gleixner 提交于
      tsk_nr_cpus_allowed() is an accessor for task->nr_cpus_allowed which allows
      us to change the representation of ->nr_cpus_allowed if required.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1462969411-17735-2-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      50605ffb
    • W
      sched/nohz: Fix affine unpinned timers mess · 44496922
      Wanpeng Li 提交于
      The following commit:
      
        9642d18e ("nohz: Affine unpinned timers to housekeepers")'
      
      intended to affine unpinned timers to housekeepers:
      
        unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(houserkeepers, idle)    =>   nearest busy housekeepers(otherwise, fallback to itself)
      
      However, the !idle_cpu(i) && is_housekeeping_cpu(cpu) check modified the
      intention to:
      
        unpinned timers(full dynaticks, idle)   =>   any housekeepers(no mattter cpu topology)
        unpinned timers(full dynaticks, busy)   =>   any housekeepers(no mattter cpu topology)
        unpinned timers(housekeepers, idle)     =>   any busy cpus(otherwise, fallback to any housekeepers)
      
      This patch fixes it by checking if there are busy housekeepers nearby,
      otherwise falls to any housekeepers/itself. After the patch:
      
        unpinned timers(full dynaticks, idle)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(full dynaticks, busy)   =>   nearest busy housekeepers(otherwise, fallback to any housekeepers)
        unpinned timers(housekeepers, idle)     =>   nearest busy housekeepers(otherwise, fallback to itself)
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [ Fixed the changelog. ]
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 'commit 9642d18e ("nohz: Affine unpinned timers to housekeepers")'
      Link: http://lkml.kernel.org/r/1462344334-8303-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      44496922
    • P
      sched/core: Kill sched_class::task_waking to clean up the migration logic · 59efa0ba
      Peter Zijlstra 提交于
      With sched_class::task_waking being called only when we do
      set_task_cpu(), we can make sched_class::migrate_task_rq() do the work
      and eliminate sched_class::task_waking entirely.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      59efa0ba
    • P
      sched/fair: Prepare to fix fairness problems on migration · b5179ac7
      Peter Zijlstra 提交于
      Mike reported that our recent attempt to fix migration problems:
      
        3a47d512 ("sched/fair: Fix fairness issue on migration")
      
      broke interactivity and the signal starve test. We reverted that
      commit and now let's try it again more carefully, with some other
      underlying problems fixed first.
      
      One problem is that I assumed ENQUEUE_WAKING was only set when we do a
      cross-cpu wakeup (migration), which isn't true. This means we now
      destroy the vruntime history of tasks and wakeup-preemption suffers.
      
      Cure this by making my assumption true, only call
      sched_class::task_waking() when we do a cross-cpu wakeup. This avoids
      the indirect call in the case we do a local wakeup.
      Reported-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Cc: linux-kernel@vger.kernel.org
      Fixes: 3a47d512 ("sched/fair: Fix fairness issue on migration")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5179ac7
  8. 09 5月, 2016 2 次提交
  9. 06 5月, 2016 13 次提交
  10. 05 5月, 2016 3 次提交
    • P
      locking/lockdep, sched/core: Implement a better lock pinning scheme · e7904a28
      Peter Zijlstra 提交于
      The problem with the existing lock pinning is that each pin is of
      value 1; this mean you can simply unpin if you know its pinned,
      without having any extra information.
      
      This scheme generates a random (16 bit) cookie for each pin and
      requires this same cookie to unpin. This means you have to keep the
      cookie in context.
      
      No objsize difference for !LOCKDEP kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e7904a28
    • P
      sched/core: Introduce 'struct rq_flags' · eb580751
      Peter Zijlstra 提交于
      In order to be able to pass around more than just the IRQ flags in the
      future, add a rq_flags structure.
      
      No difference in code generation for the x86_64-defconfig build I
      tested.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      eb580751
    • P
      sched/core: Move task_rq_lock() out of line · 3e71a462
      Peter Zijlstra 提交于
      Its a rather large function, inline doesn't seems to make much sense:
      
       $ size defconfig-build/kernel/sched/core.o{.orig,}
          text    data     bss     dec     hex filename
         56533   21037    2320   79890   13812 defconfig-build/kernel/sched/core.o.orig
         55733   21037    2320   79090   134f2 defconfig-build/kernel/sched/core.o
      
      The 'perf bench sched messaging' micro-benchmark shows a visible improvement
      of 4-5%:
      
        $ for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do echo performance > $i ; done
        $ perf stat --null --repeat 25 -- perf bench sched messaging -g 40 -l 5000
      
        pre:
             4.582798193 seconds time elapsed          ( +-  1.41% )
             4.733374877 seconds time elapsed          ( +-  2.10% )
             4.560955136 seconds time elapsed          ( +-  1.43% )
             4.631062303 seconds time elapsed          ( +-  1.40% )
      
        post:
             4.364765213 seconds time elapsed          ( +-  0.91% )
             4.454442734 seconds time elapsed          ( +-  1.18% )
             4.448893817 seconds time elapsed          ( +-  1.41% )
             4.424346872 seconds time elapsed          ( +-  0.97% )
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3e71a462
  11. 28 4月, 2016 2 次提交
  12. 23 4月, 2016 3 次提交
  13. 13 4月, 2016 1 次提交
  14. 31 3月, 2016 2 次提交
    • Y
      sched/fair: Initiate a new task's util avg to a bounded value · 2b8c41da
      Yuyang Du 提交于
      A new task's util_avg is set to full utilization of a CPU (100% time
      running). This accelerates a new task's utilization ramp-up, useful to
      boost its execution in early time. However, it may result in
      (insanely) high utilization for a transient time period when a flood
      of tasks are spawned. Importantly, it violates the "fundamentally
      bounded" CPU utilization, and its side effect is negative if we don't
      take any measure to bound it.
      
      This patch proposes an algorithm to address this issue. It has
      two methods to approach a sensible initial util_avg:
      
      (1) An expected (or average) util_avg based on its cfs_rq's util_avg:
      
        util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
      
      (2) A trajectory of how successive new tasks' util develops, which
      gives 1/2 of the left utilization budget to a new task such that
      the additional util is noticeably large (when overall util is low) or
      unnoticeably small (when overall util is high enough). In the meantime,
      the aggregate utilization is well bounded:
      
        util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
      
      where n denotes the nth task.
      
      If util_avg is larger than util_avg_cap, then the effective util is
      clamped to the util_avg_cap.
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: steve.muckle@linaro.org
      Link: http://lkml.kernel.org/r/1459283456-21682-1-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2b8c41da
    • S
      sched/core: Add preempt checks in preempt_schedule() code · 47252cfb
      Steven Rostedt 提交于
      While testing the tracer preemptoff, I hit this strange trace:
      
         <...>-259     0...1    0us : schedule <-worker_thread
         <...>-259     0d..1    0us : rcu_note_context_switch <-__schedule
         <...>-259     0d..1    0us : rcu_sched_qs <-rcu_note_context_switch
         <...>-259     0d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
         <...>-259     0d..1    0us : _raw_spin_lock <-__schedule
         <...>-259     0d..1    0us : preempt_count_add <-_raw_spin_lock
         <...>-259     0d..2    0us : do_raw_spin_lock <-_raw_spin_lock
         <...>-259     0d..2    1us : deactivate_task <-__schedule
         <...>-259     0d..2    1us : update_rq_clock.part.84 <-deactivate_task
         <...>-259     0d..2    1us : dequeue_task_fair <-deactivate_task
         <...>-259     0d..2    1us : dequeue_entity <-dequeue_task_fair
         <...>-259     0d..2    1us : update_curr <-dequeue_entity
         <...>-259     0d..2    1us : update_min_vruntime <-update_curr
         <...>-259     0d..2    1us : cpuacct_charge <-update_curr
         <...>-259     0d..2    1us : __rcu_read_lock <-cpuacct_charge
         <...>-259     0d..2    1us : __rcu_read_unlock <-cpuacct_charge
         <...>-259     0d..2    1us : clear_buddies <-dequeue_entity
         <...>-259     0d..2    1us : account_entity_dequeue <-dequeue_entity
         <...>-259     0d..2    2us : update_min_vruntime <-dequeue_entity
         <...>-259     0d..2    2us : update_cfs_shares <-dequeue_entity
         <...>-259     0d..2    2us : hrtick_update <-dequeue_task_fair
         <...>-259     0d..2    2us : wq_worker_sleeping <-__schedule
         <...>-259     0d..2    2us : kthread_data <-wq_worker_sleeping
         <...>-259     0d..2    2us : pick_next_task_fair <-__schedule
         <...>-259     0d..2    2us : check_cfs_rq_runtime <-pick_next_task_fair
         <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
         <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
         <...>-259     0d..2    2us : pick_next_entity <-pick_next_task_fair
         <...>-259     0d..2    2us : clear_buddies <-pick_next_entity
         <...>-259     0d..2    2us : set_next_entity <-pick_next_task_fair
         <...>-259     0d..2    3us : put_prev_entity <-pick_next_task_fair
         <...>-259     0d..2    3us : check_cfs_rq_runtime <-put_prev_entity
         <...>-259     0d..2    3us : set_next_entity <-pick_next_task_fair
      gnome-sh-1031    0d..2    3us : finish_task_switch <-__schedule
      gnome-sh-1031    0d..2    3us : _raw_spin_unlock_irq <-finish_task_switch
      gnome-sh-1031    0d..2    3us : do_raw_spin_unlock <-_raw_spin_unlock_irq
      gnome-sh-1031    0...2    3us!: preempt_count_sub <-_raw_spin_unlock_irq
      gnome-sh-1031    0...1  582us : do_raw_spin_lock <-_raw_spin_lock
      gnome-sh-1031    0...1  583us : _raw_spin_unlock <-drm_gem_object_lookup
      gnome-sh-1031    0...1  583us : do_raw_spin_unlock <-_raw_spin_unlock
      gnome-sh-1031    0...1  583us : preempt_count_sub <-_raw_spin_unlock
      gnome-sh-1031    0...1  584us : _raw_spin_unlock <-drm_gem_object_lookup
      gnome-sh-1031    0...1  584us+: trace_preempt_on <-drm_gem_object_lookup
      gnome-sh-1031    0...1  603us : <stack trace>
       => preempt_count_sub
       => _raw_spin_unlock
       => drm_gem_object_lookup
       => i915_gem_madvise_ioctl
       => drm_ioctl
       => do_vfs_ioctl
       => SyS_ioctl
       => entry_SYSCALL_64_fastpath
      
      As I'm tracing preemption disabled, it seemed incorrect that the trace
      would go across a schedule and report not being in the scheduler.
      Looking into this I discovered the problem.
      
      schedule() calls preempt_disable() but the preempt_schedule() calls
      preempt_enable_notrace(). What happened above was that the gnome-shell
      task was preempted on another CPU, migrated over to the idle cpu. The
      tracer stared with idle calling schedule(), which called
      preempt_disable(), but then gnome-shell finished, and it enabled
      preemption with preempt_enable_notrace() that does stop the trace, even
      though preemption was enabled.
      
      The purpose of the preempt_disable_notrace() in the preempt_schedule()
      is to prevent function tracing from going into an infinite loop.
      Because function tracing can trace the preempt_enable/disable() calls
      that are traced. The problem with function tracing is:
      
        NEED_RESCHED set
        preempt_schedule()
          preempt_disable()
            preempt_count_inc()
              function trace (before incrementing preempt count)
                preempt_disable_notrace()
                preempt_enable_notrace()
                  sees NEED_RESCHED set
                     preempt_schedule() (repeat)
      
      Now by breaking out the preempt off/on tracing into their own code:
      preempt_disable_check() and preempt_enable_check(), we can add these to
      the preempt_schedule() code. As preemption would then be disabled, even
      if they were to be traced by the function tracer, the disabled
      preemption would prevent the recursion.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160321112339.6dc78ad6@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      47252cfb
  15. 29 3月, 2016 1 次提交