1. 11 3月, 2016 1 次提交
  2. 09 3月, 2016 1 次提交
  3. 05 3月, 2016 1 次提交
    • T
      sched/cputime: Fix steal time accounting vs. CPU hotplug · e9532e69
      Thomas Gleixner 提交于
      On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time
      value over CPU down and up. So after the CPU comes up again the delta
      calculation in steal_account_process_tick() wreckages itself due to the
      unsigned math:
      
      	 u64 steal = paravirt_steal_clock(smp_processor_id());
      
      	 steal -= this_rq()->prev_steal_time;
      
      So if steal is smaller than rq->prev_steal_time we end up with an insane large
      value which then gets added to rq->prev_steal_time, resulting in a permanent
      wreckage of the accounting. As a consequence the per CPU stats in /proc/stat
      become stale.
      
      Nice trick to tell the world how idle the system is (100%) while the CPU is
      100% busy running tasks. Though we prefer realistic numbers.
      
      None of the accounting values which use a previous value to account for
      fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity
      check for prev_irq_time and prev_steal_time_rq, but that sanity check solely
      deals with clock warps and limits the /proc/stat visible wreckage. The
      prev_time values are still wrong.
      
      Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Fixes: commit 095c0aa8 "sched: adjust scheduler cpu power for stolen time"
      Fixes: commit aa483808 "sched: Remove irq time from available CPU power"
      Fixes: commit e6e6685a "KVM guest: Steal time accounting"
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9532e69
  4. 02 3月, 2016 2 次提交
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3
    • F
      sched: Account rr tasks · 01d36d0a
      Frederic Weisbecker 提交于
      In order to evaluate the scheduler tick dependency without probing
      context switches, we need to know how much SCHED_RR and SCHED_FIFO tasks
      are enqueued as those policies don't have the same preemption
      requirements.
      
      To prepare for that, let's account SCHED_RR tasks, we'll be able to
      deduce SCHED_FIFO tasks as well from it and the total RT tasks in the
      runqueue.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      01d36d0a
  5. 29 2月, 2016 4 次提交
    • S
      sched/debug: Move sched_domain_sysctl to debug.c · 3866e845
      Steven Rostedt (Red Hat) 提交于
      The sched_domain_sysctl setup is only enabled when SCHED_DEBUG is
      configured. As debug.c is only compiled when SCHED_DEBUG is configured as
      well, move the setup of sched_domain_sysctl into that file.
      
      Note, the (un)register_sched_domain_sysctl() functions had to be changed
      from static to allow access to them from core.c.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160222212825.599278093@goodmis.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3866e845
    • P
      sched/rt: Fix PI handling vs. sched_setscheduler() · ff77e468
      Peter Zijlstra 提交于
      Andrea Parri reported:
      
      > I found that the following scenario (with CONFIG_RT_GROUP_SCHED=y) is not
      > handled correctly:
      >
      >     T1 (prio = 20)
      >        lock(rtmutex);
      >
      >     T2 (prio = 20)
      >        blocks on rtmutex  (rt_nr_boosted = 0 on T1's rq)
      >
      >     T1 (prio = 20)
      >        sys_set_scheduler(prio = 0)
      >           [new_effective_prio == oldprio]
      >           T1 prio = 20    (rt_nr_boosted = 0 on T1's rq)
      >
      > The last step is incorrect as T1 is now boosted (c.f., rt_se_boosted());
      > in particular, if we continue with
      >
      >    T1 (prio = 20)
      >       unlock(rtmutex)
      >          wakeup(T2)
      >          adjust_prio(T1)
      >             [prio != rt_mutex_getprio(T1)]
      >	    dequeue(T1)
      >	       rt_nr_boosted = (unsigned long)(-1)
      >	       ...
      >             T1 prio = 0
      >
      > then we end up leaving rt_nr_boosted in an "inconsistent" state.
      >
      > The simple program attached could reproduce the previous scenario; note
      > that, as a consequence of the presence of this state, the "assertion"
      >
      >     WARN_ON(!rt_nr_running && rt_nr_boosted)
      >
      > from dec_rt_group() may trigger.
      
      So normally we dequeue/enqueue tasks in sched_setscheduler(), which
      would ensure the accounting stays correct. However in the early PI path
      we fail to do so.
      
      So this was introduced at around v3.14, by:
      
        c365c292 ("sched: Consider pi boosting in setscheduler()")
      
      which fixed another problem exactly because that dequeue/enqueue, joy.
      
      Fix this by teaching rt about DEQUEUE_SAVE/ENQUEUE_RESTORE and have it
      preserve runqueue location with that option. This requires decoupling
      the on_rt_rq() state from being on the list.
      
      In order to allow for explicit movement during the SAVE/RESTORE,
      introduce {DE,EN}QUEUE_MOVE. We still must use SAVE/RESTORE in these
      cases to preserve other invariants.
      
      Respecting the SAVE/RESTORE flags also has the (nice) side-effect that
      things like sys_nice()/sys_sched_setaffinity() also do not reorder
      FIFO tasks (whereas they used to before this patch).
      Reported-by: NAndrea Parri <parri.andrea@gmail.com>
      Tested-by: NAndrea Parri <parri.andrea@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff77e468
    • D
      sched/core: Remove duplicated sched_group_set_shares() prototype · 41d93397
      Dongsheng Yang 提交于
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1452674558-31897-1-git-send-email-yangds.fnst@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      41d93397
    • P
      sched/cgroup: Fix cgroup entity load tracking tear-down · 6fe1f348
      Peter Zijlstra 提交于
      When a cgroup's CPU runqueue is destroyed, it should remove its
      remaining load accounting from its parent cgroup.
      
      The current site for doing so it unsuited because its far too late and
      unordered against other cgroup removal (->css_free() will be, but we're also
      in an RCU callback).
      
      Put it in the ->css_offline() callback, which is the start of cgroup
      destruction, right after the group has been made unavailable to
      userspace. The ->css_offline() callbacks are called in hierarchical order
      after the following v4.4 commit:
      
        aa226ff4 ("cgroup: make sure a parent css isn't offlined before its children")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fe1f348
  6. 09 2月, 2016 1 次提交
    • M
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman 提交于
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb251765
  7. 04 12月, 2015 5 次提交
    • W
      sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline · b0367629
      Waiman Long 提交于
      If a system with large number of sockets was driven to full
      utilization, it was found that the clock tick handling occupied a
      rather significant proportion of CPU time when fair group scheduling
      and autogroup were enabled.
      
      Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
      profile looked like:
      
        10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
         5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      In particular, the high CPU time consumed by update_cfs_shares()
      was mostly due to contention on the cacheline that contained the
      task_group's load_avg statistical counter. This cacheline may also
      contains variables like shares, cfs_rq & se which are accessed rather
      frequently during clock tick processing.
      
      This patch moves the load_avg variable into another cacheline
      separated from the other frequently accessed variables. It also
      creates a cacheline aligned kmemcache for task_group to make sure
      that all the allocated task_group's are cacheline aligned.
      
      By doing so, the perf profile became:
      
         9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
         8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
         7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
         7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
         7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
         5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
         4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares
      
      The %cpu time is still pretty high, but it is better than before. The
      benchmark results before and after the patch was as follows:
      
        Before patch - Max-jOPs: 907533    Critical-jOps: 134877
        After patch  - Max-jOPs: 916011    Critical-jOps: 142366
      Signed-off-by: NWaiman Long <Waiman.Long@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Douglas Hatch <doug.hatch@hpe.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott J Norton <scott.norton@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0367629
    • A
      sched/core: Move the sched_to_prio[] arrays out of line · ed82b8a1
      Andi Kleen 提交于
      When building a kernel with a gcc 6 snapshot the compiler complains
      about unused const static variables for prio_to_weight and prio_to_mult
      for multiple scheduler files (all but core.c and autogroup.c)
      
      The way the array is currently declared it will be duplicated in
      every scheduler file that includes sched.h, which seems rather wasteful.
      
      Move the array out of line into core.c. I also added a sched_ prefix
      to avoid any potential name space collisions.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1448859583-3252-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed82b8a1
    • B
      sched/fair: Make it possible to account fair load avg consistently · ad936d86
      Byungchul Park 提交于
      The current code accounts for the time a task was absent from the fair
      class (per ATTACH_AGE_LOAD). However it does not work correctly when a
      task got migrated or moved to another cgroup while outside of the fair
      class.
      
      This patch tries to address that by aging on migration. We locklessly
      read the 'last_update_time' stamp from both the old and new cfs_rq,
      ages the load upto the old time, and sets it to the new time.
      
      These timestamps should in general not be more than 1 tick apart from
      one another, so there is a definite bound on things.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      [ Changelog, a few edits and !SMP build fix ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad936d86
    • P
      locking, sched: Introduce smp_cond_acquire() and use it · b3e0b1b6
      Peter Zijlstra 提交于
      Introduce smp_cond_acquire() which combines a control dependency and a
      read barrier to form acquire semantics.
      
      This primitive has two benefits:
      
       - it documents control dependencies,
       - its typically cheaper than using smp_load_acquire() in a loop.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3e0b1b6
    • P
      sched/core: Better document the try_to_wake_up() barriers · b75a2253
      Peter Zijlstra 提交于
      Explain how the control dependency and smp_rmb() end up providing
      ACQUIRE semantics and pair with smp_store_release() in
      finish_lock_switch().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b75a2253
  8. 23 11月, 2015 1 次提交
  9. 06 10月, 2015 3 次提交
    • X
      sched/core: Remove a parameter in the migrate_task_rq() function · 5a4fd036
      xiaofeng.yan 提交于
      The parameter "int next_cpu" in the following function is unused:
      
        migrate_task_rq(struct task_struct *p, int next_cpu)
      
      Remove it.
      Signed-off-by: Nxiaofeng.yan <yanxiaofeng@inspur.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a4fd036
    • P
      sched/core: Fix task and run queue sched_info::run_delay inconsistencies · 1de64443
      Peter Zijlstra 提交于
      Mike Meyer reported the following bug:
      
      > During evaluation of some performance data, it was discovered thread
      > and run queue run_delay accounting data was inconsistent with the other
      > accounting data that was collected.  Further investigation found under
      > certain circumstances execution time was leaking into the task and
      > run queue accounting of run_delay.
      >
      > Consider the following sequence:
      >
      >     a. thread is running.
      >     b. thread moves beween cgroups, changes scheduling class or priority.
      >     c. thread sleeps OR
      >     d. thread involuntarily gives up cpu.
      >
      > a. implies:
      >
      >     thread->sched_info.last_queued = 0
      >
      > a. and b. results in the following:
      >
      >     1. dequeue_task(rq, thread)
      >
      >            sched_info_dequeued(rq, thread)
      >                delta = 0
      >
      >                sched_info_reset_dequeued(thread)
      >                    thread->sched_info.last_queued = 0
      >
      >                thread->sched_info.run_delay += delta
      >
      >     2. enqueue_task(rq, thread)
      >
      >            sched_info_queued(rq, thread)
      >
      >                /* thread is still on cpu at this point. */
      >                thread->sched_info.last_queued = task_rq(thread)->clock;
      >
      > c. results in:
      >
      >     dequeue_task(rq, thread)
      >
      >         sched_info_dequeued(rq, thread)
      >
      >             /* delta is execution time not run_delay. */
      >             delta = task_rq(thread)->clock - thread->sched_info.last_queued
      >
      >         sched_info_reset_dequeued(thread)
      >             thread->sched_info.last_queued = 0
      >
      >         thread->sched_info.run_delay += delta
      >
      >     Since thread was running between enqueue_task(rq, thread) and
      >     dequeue_task(rq, thread), the delta above is really execution
      >     time and not run_delay.
      >
      > d. results in:
      >
      >     __sched_info_switch(thread, next_thread)
      >
      >         sched_info_depart(rq, thread)
      >
      >             sched_info_queued(rq, thread)
      >
      >                 /* last_queued not updated due to being non-zero */
      >                 return
      >
      >     Since thread was running between enqueue_task(rq, thread) and
      >     __sched_info_switch(thread, next_thread), the execution time
      >     between enqueue_task(rq, thread) and
      >     __sched_info_switch(thread, next_thread) now will become
      >     associated with run_delay due to when last_queued was last updated.
      >
      
      This alternative patch solves the problem by not calling
      sched_info_{de,}queued() in {de,en}queue_task(). Therefore the
      sched_info state is preserved and things work as expected.
      
      By inlining the {de,en}queue_task() functions the new condition
      becomes (mostly) a compile-time constant and we'll not emit any new
      branch instructions.
      
      It even shrinks the code (due to inlining {en,de}queue_task()):
      
      $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig
         text    data     bss     dec     hex filename
        64019   23378    2344   89741   15e8d defconfig-build/kernel/sched/core.o
        64149   23378    2344   89871   15f0f defconfig-build/kernel/sched/core.o.orig
      Reported-by: NMike Meyer <Mike.Meyer@Teradata.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1de64443
    • P
      sched/core: Fix TASK_DEAD race in finish_task_switch() · 95913d97
      Peter Zijlstra 提交于
      So the problem this patch is trying to address is as follows:
      
              CPU0                            CPU1
      
              context_switch(A, B)
                                              ttwu(A)
                                                LOCK A->pi_lock
                                                A->on_cpu == 0
              finish_task_switch(A)
                prev_state = A->state  <-.
                WMB                      |
                A->on_cpu = 0;           |
                UNLOCK rq0->lock         |
                                         |    context_switch(C, A)
                                         `--  A->state = TASK_DEAD
                prev_state == TASK_DEAD
                  put_task_struct(A)
                                              context_switch(A, C)
                                              finish_task_switch(A)
                                                A->state == TASK_DEAD
                                                  put_task_struct(A)
      
      The argument being that the WMB will allow the load of A->state on CPU0
      to cross over and observe CPU1's store of A->state, which will then
      result in a double-drop and use-after-free.
      
      Now the comment states (and this was true once upon a long time ago)
      that we need to observe A->state while holding rq->lock because that
      will order us against the wakeup; however the wakeup will not in fact
      acquire (that) rq->lock; it takes A->pi_lock these days.
      
      We can obviously fix this by upgrading the WMB to an MB, but that is
      expensive, so we'd rather avoid that.
      
      The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
      which avoids the MB on some archs, but not important ones like ARM.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org> # v3.1+
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: manfred@colorfullife.com
      Cc: will.deacon@arm.com
      Fixes: e4a52bcb ("sched: Remove rq->lock from the first half of ttwu()")
      Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95913d97
  10. 23 9月, 2015 1 次提交
  11. 18 9月, 2015 1 次提交
  12. 13 9月, 2015 6 次提交
  13. 12 8月, 2015 1 次提交
  14. 04 8月, 2015 1 次提交
  15. 03 8月, 2015 4 次提交
    • Y
      sched/fair: Provide runnable_load_avg back to cfs_rq · 13962234
      Yuyang Du 提交于
      The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
      Before this series, sometimes the runnable_load_avg is used, and sometimes
      the load_avg is used. Completely replacing all uses of runnable_load_avg
      with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
      to result in overrated load. Therefore, we get runnable_load_avg back.
      
      The new cfs_rq's runnable_load_avg is improved to be updated with all of the
      runnable sched_eneities at the same time, so the one sched_entity updated and
      the others stale problem is solved.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13962234
    • Y
      sched/fair: Init cfs_rq's sched_entity load average · 540247fb
      Yuyang Du 提交于
      The runnable load and utilization averages of cfs_rq's sched_entity
      were not initiated. Like done to a task, give new cfs_rq' sched_entity
      start values to heavy its load in infant time.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      540247fb
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • Y
      sched/fair: Remove rq's runnable avg · cd126afe
      Yuyang Du 提交于
      The current rq->avg is not used at all since its merge into the kernel,
      and the code is in the scheduler's hot path, so remove it.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd126afe
  16. 04 7月, 2015 2 次提交
  17. 19 6月, 2015 3 次提交
  18. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  19. 08 5月, 2015 1 次提交