1. 03 5月, 2018 1 次提交
  2. 05 4月, 2018 1 次提交
  3. 20 3月, 2018 3 次提交
    • P
      sched/fair: Update util_est only on util_avg updates · d519329f
      Patrick Bellasi 提交于
      The estimated utilization of a task is currently updated every time the
      task is dequeued. However, to keep overheads under control, PELT signals
      are effectively updated at maximum once every 1ms.
      
      Thus, for really short running tasks, it can happen that their util_avg
      value has not been updates since their last enqueue.  If such tasks are
      also frequently running tasks (e.g. the kind of workload generated by
      hackbench) it can also happen that their util_avg is updated only every
      few activations.
      
      This means that updating util_est at every dequeue potentially introduces
      not necessary overheads and it's also conceptually wrong if the util_avg
      signal has never been updated during a task activation.
      
      Let's introduce a throttling mechanism on task's util_est updates
      to sync them with util_avg updates. To make the solution memory
      efficient, both in terms of space and load/store operations, we encode a
      synchronization flag into the LSB of util_est.enqueued.
      This makes util_est an even values only metric, which is still
      considered good enough for its purpose.
      The synchronization bit is (re)set by __update_load_avg_se() once the
      PELT signal of a task has been updated during its last activation.
      
      Such a throttling mechanism allows to keep under control util_est
      overheads in the wakeup hot path, thus making it a suitable mechanism
      which can be enabled also on high-intensity workload systems.
      Thus, this now switches on by default the estimation utilization
      scheduler feature.
      Suggested-by: NChris Redpath <chris.redpath@arm.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d519329f
    • P
      sched/fair: Use util_est in LB and WU paths · f9be3e59
      Patrick Bellasi 提交于
      When the scheduler looks at the CPU utilization, the current PELT value
      for a CPU is returned straight away. In certain scenarios this can have
      undesired side effects on task placement.
      
      For example, since the task utilization is decayed at wakeup time, when
      a long sleeping big task is enqueued it does not add immediately a
      significant contribution to the target CPU.
      As a result we generate a race condition where other tasks can be placed
      on the same CPU while it is still considered relatively empty.
      
      In order to reduce this kind of race conditions, this patch introduces the
      required support to integrate the usage of the CPU's estimated utilization
      in the wakeup path, via cpu_util_wake(), as well as in the load-balance
      path, via cpu_util() which is used by update_sg_lb_stats().
      
      The estimated utilization of a CPU is defined to be the maximum between
      its PELT's utilization and the sum of the estimated utilization (at
      previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
      This allows to properly represent the spare capacity of a CPU which, for
      example, has just got a big task running since a long sleep period.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f9be3e59
    • P
      sched/fair: Add util_est on top of PELT · 7f65ea42
      Patrick Bellasi 提交于
      The util_avg signal computed by PELT is too variable for some use-cases.
      For example, a big task waking up after a long sleep period will have its
      utilization almost completely decayed. This introduces some latency before
      schedutil will be able to pick the best frequency to run a task.
      
      The same issue can affect task placement. Indeed, since the task
      utilization is already decayed at wakeup, when the task is enqueued in a
      CPU, this can result in a CPU running a big task as being temporarily
      represented as being almost empty. This leads to a race condition where
      other tasks can be potentially allocated on a CPU which just started to run
      a big task which slept for a relatively long period.
      
      Moreover, the PELT utilization of a task can be updated every [ms], thus
      making it a continuously changing value for certain longer running
      tasks. This means that the instantaneous PELT utilization of a RUNNING
      task is not really meaningful to properly support scheduler decisions.
      
      For all these reasons, a more stable signal can do a better job of
      representing the expected/estimated utilization of a task/cfs_rq.
      Such a signal can be easily created on top of PELT by still using it as
      an estimator which produces values to be aggregated on meaningful
      events.
      
      This patch adds a simple implementation of util_est, a new signal built on
      top of PELT's util_avg where:
      
          util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))
      
      This allows to remember how big a task has been reported by PELT in its
      previous activations via f(task::util_avg@dequeue), which is the new
      _task_util_est(struct task_struct*) function added by this patch.
      
      If a task should change its behavior and it runs longer in a new
      activation, after a certain time its util_est will just track the
      original PELT signal (i.e. task::util_avg).
      
      The estimated utilization of cfs_rq is defined only for root ones.
      That's because the only sensible consumer of this signal are the
      scheduler and schedutil when looking for the overall CPU utilization
      due to FAIR tasks.
      
      For this reason, the estimated utilization of a root cfs_rq is simply
      defined as:
      
          util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)
      
      where:
      
          cfs_rq::util_est::enqueued = sum(_task_util_est(task))
                                       for each RUNNABLE task on that root cfs_rq
      
      It's worth noting that the estimated utilization is tracked only for
      objects of interests, specifically:
      
       - Tasks: to better support tasks placement decisions
       - root cfs_rqs: to better support both tasks placement decisions as
                       well as frequencies selection
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7f65ea42
  4. 09 3月, 2018 15 次提交
  5. 04 3月, 2018 1 次提交
    • I
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar 提交于
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      325ea10c
  6. 03 3月, 2018 1 次提交
    • I
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar 提交于
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      97fb7a0a
  7. 21 2月, 2018 7 次提交
    • F
      sched/isolation: Offload residual 1Hz scheduler tick · d84b3131
      Frederic Weisbecker 提交于
      When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
      keep the scheduler stats alive. However this residual tick is a burden
      for bare metal tasks that can't stand any interruption at all, or want
      to minimize them.
      
      The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
      outsource these scheduler ticks to the global workqueue so that a
      housekeeping CPU handles those remotely. The sched_class::task_tick()
      implementations have been audited and look safe to be called remotely
      as the target runqueue and its current task are passed in parameter
      and don't seem to be accessed locally.
      
      Note that in the case of using isolcpus, it's still up to the user to
      affine the global workqueues to the housekeeping CPUs through
      /sys/devices/virtual/workqueue/cpumask or domains isolation
      "isolcpus=nohz,domain".
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d84b3131
    • M
      sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() · 7347fc87
      Mel Gorman 提交于
      If wake_affine() pulls a task to another node for any reason and the node is
      no longer preferred then temporarily stop automatic NUMA balancing pulling
      the task back. Otherwise, tasks with a strong waker/wakee relationship
      may constantly fight automatic NUMA balancing over where a task should
      be placed.
      
      Once again netperf is interesting here. The performance barely changes
      but automatic NUMA balancing is interesting:
      
       Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
       Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
       Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
       Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
       Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
       Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
       Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
       Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
       Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
       Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)
      
       NUMA alloc hit                 1465986     1423090
       NUMA alloc miss                      0           0
       NUMA interleave hit                  0           0
       NUMA alloc local               1465897     1423003
       NUMA base PTE updates             1473        1420
       NUMA huge PMD updates                0           0
       NUMA page range updates           1473        1420
       NUMA hint faults                  1383        1312
       NUMA hint local faults             451         124
       NUMA hint local percent             32           9
      
      There is a slight degrading in performance but there are slightly fewer
      NUMA faults. There is a large drop in the percentage of local faults but
      the bulk of migrations for netperf are in small shared libraries so it's
      reflecting the fact that automatic NUMA balancing has backed off. This is
      a case where despite wake_affine() and automatic NUMA balancing fighting
      for placement that there is a marginal benefit to rescheduling to local
      data quickly. However, it should be noted that wake_affine() and automatic
      NUMA balancing fighting each other constantly is undesirable.
      
      However, the benefit in other cases is large. This is the result for NAS
      with the D class sizing on a 4-socket machine:
      
       nas-mpi
                                 4.15.0                 4.15.0
                           sdnuma-v1r23       delayretry-v1r23
       Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
       Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
       Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
       Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
       Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)
      
                     4.15.0      4.15.0
               sdnuma-v1r23delayretry-v1r23
       User        75665.20    70413.30
       System      20321.59     8861.67
       Elapsed       766.13      634.92
      
       Minor Faults                  16528502     7127941
       Major Faults                      4553        5068
       NUMA alloc local               6963197     6749135
       NUMA base PTE updates        366409093   107491434
       NUMA huge PMD updates           687556      198880
       NUMA page range updates      718437765   209317994
       NUMA hint faults              13643410     4601187
       NUMA hint local faults         9212593     3063996
       NUMA hint local percent             67          66
      
      Note the massive reduction in system CPU usage even though the percentage
      of local faults is barely affected. There is a massive reduction in the
      number of PTE updates showing that automatic NUMA balancing has backed off.
      A critical observation is also that there is a massive reduction in minor
      faults which is due to far fewer NUMA hinting faults being trapped.
      
      There were questions on NAS OMP and how it behaved related to threads
      being bound to CPUs. First, there are more gains than losses with this
      patch applied and a reduction in system CPU usage:
      
      nas-omp
                            4.16.0-rc1             4.16.0-rc1
                           sdnuma-v2r1        delayretry-v2r1
      Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
      Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
      Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
      Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
      Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
      Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
      Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
      Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)
      
                4.16.0-rc1  4.16.0-rc1
               sdnuma-v2r1delayretry-v2r1
      User       305633.08   296751.91
      System        451.75      357.80
      Elapsed      2595.73     2368.13
      
      However, it does not close the gap between binding and being unbound. There
      is negligible difference between the performance of the baseline and a
      patched kernel when threads are bound so it is not presented here:
      
                            4.16.0-rc1             4.16.0-rc1
                       delayretry-bind     delayretry-unbound
      Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
      Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
      Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
      Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
      Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
      Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
      Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
      Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)
      
                4.16.0-rc1  4.16.0-rc1
              delayretry-binddelayretry-unbound
      User       277731.25   296751.91
      System        261.29      357.80
      Elapsed      2100.55     2368.13
      
      Unfortunately, while performance is improved by the patch, there is still
      quite a long way to go before it's equivalent to hard binding.
      
      Other workloads like hackbench, tbench, dbench and schbench are barely
      affected. dbench shows a mix of gains and losses depending on the machine
      although in general, the results are more stable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7347fc87
    • M
      sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on · 2c833627
      Mel Gorman 提交于
      find_idlest_group() compares a local group with each other group to select
      the one that is most idle. When comparing groups in different NUMA domains,
      a very slight imbalance is enough to select a remote NUMA node even if the
      runnable load on both groups is 0 or close to 0. This ignores the cost of
      remote accesses entirely and is a problem when selecting the CPU for a
      newly forked task to run on.  This is problematic when a forking server
      is almost guaranteed to run on a remote node incurring numerous remote
      accesses and potentially causing automatic NUMA balancing to try migrate
      the task back or migrate the data to another node. Similar weirdness is
      observed if a basic shell command pipes output to another as each process
      in the pipeline is likely to start on different nodes and then get adjusted
      later by wake_affine().
      
      This patch adds imbalance to remote domains when considering whether to
      select CPUs from remote domains. If the local domain is selected, imbalance
      will still be used to try select a CPU from a lower scheduler domain's group
      instead of stacking tasks on the same CPU.
      
      A variety of workloads and machines were tested and as expected, there is no
      difference on UMA. The difference on NUMA can be dramatic. This is a comparison
      of elapsed times running the git regression test suite. It's fork-intensive with
      short-lived processes:
      
                                        4.15.0                 4.15.0
                                  noexit-v1r23           sdnuma-v1r23
       Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
       Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
       Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
       Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
       Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)
      
                     4.15.0      4.15.0
               noexit-v1r23 sdnuma-v1r23
       User         5434.12     5188.41
       System       4878.77     3467.09
       Elapsed     10259.06     8624.21
      
      That shows a considerable reduction in elapsed times. It's important to
      note that automatic NUMA balancing does not affect this load as processes
      are too short-lived.
      
      There is also a noticable impact on hackbench such as this example using
      processes and pipes:
      
       hackbench-process-pipes
                                     4.15.0                 4.15.0
                               noexit-v1r23           sdnuma-v1r23
       Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
       Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
       Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
       Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
       Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
       Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
       Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
       Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
       Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
       Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
       Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
       Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
       Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
       Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
       Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)
      
      It's not a universal win as there are occasions when spreading wide and
      quickly is a benefit but it's more of a win than it is a loss. For other
      workloads, there is little difference but netperf is interesting. Without
      the patch, the server and client starts on different nodes but quickly get
      migrated due to wake_affine. Hence, the difference is overall performance
      is marginal but detectable:
      
                                            4.15.0                 4.15.0
                                      noexit-v1r23           sdnuma-v1r23
       Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
       Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
       Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
       Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
       Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
       Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
       Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
       Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
       Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
       Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)
      
      However, what is very interesting is how automatic NUMA balancing behaves.
      Each netperf instance runs long enough for balancing to activate:
      
       NUMA base PTE updates             4620        1473
       NUMA huge PMD updates                0           0
       NUMA page range updates           4620        1473
       NUMA hint faults                  4301        1383
       NUMA hint local faults            1309         451
       NUMA hint local percent             30          32
       NUMA pages migrated               1335         491
       AutoNUMA cost                      21%          6%
      
      There is an unfortunate number of remote faults although tracing indicated
      that the vast majority are in shared libraries. However, the tendency to
      start tasks on the same node if there is capacity means that there were
      far fewer PTE updates and faults incurred overall.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c833627
    • P
      sched/fair: Do not migrate due to a sync wakeup on exit · 24d0c1d6
      Peter Zijlstra 提交于
      When a task exits, it notifies the parent that it has exited. This is a
      sync wakeup and the exiting task may pull the parent towards the wakers
      CPU. For simple workloads like using a shell, it was observed that the
      shell is pulled across nodes by exiting processes. This is daft as the
      parent may be long-lived and properly placed. This patch special cases a
      sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
      of workloads and machines showed very little differences in performance
      although there was a small 3% boost on some machines running a shellscript
      intensive workload (git regression test suite).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      24d0c1d6
    • M
      sched/fair: Do not migrate on wake_affine_weight() if weights are equal · 082f764a
      Mel Gorman 提交于
      wake_affine_weight() will consider migrating a task to, or near, the current
      CPU if there is a load imbalance. If the CPUs share LLC then either CPU
      is valid as a search-for-idle-sibling target and equally appropriate for
      stacking two tasks on one CPU if an idle sibling is unavailable. If they do
      not share cache then a cross-node migration potentially impacts locality
      so while they are equal from a CPU capacity point of view, they are not
      equal in terms of memory locality. In either case, it's more appropriate
      to migrate only if there is a difference in their effective load.
      
      This patch modifies wake_affine_weight() to only consider migrating a task
      if there is a load imbalance for normal wakeups but will allow potential
      stacking if the loads are equal and it's a sync wakeup.
      
      For the most part, the different in performance is marginal. For example,
      on a 4-socket server running netperf UDP_STREAM on localhost the differences
      are as follows:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
       Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
       Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
       Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
       Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
       Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
       Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
       Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
       Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
       Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
       Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
       Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
       Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
       Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
       Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
       Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
       Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
       Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
       Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
       Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)
      
      The difference in overall performance is marginal but note that most
      measurements are less variable. There were similar observations for other
      netperf comparisons. hackbench with sockets or threads with processes or
      threads showed minor difference with some reduction of migration. tbench
      showed only marginal differences that were within the noise. dbench,
      regardless of filesystem, showed minor differences all of which are
      within noise. Multiple machines, both UMA and NUMA were tested without
      any regressions showing up.
      
      The biggest risk with a patch like this is affecting wakeup latencies.
      However, the schbench load from Facebook which is very sensitive to wakeup
      latency showed a mixed result with mostly improvements in wakeup latency:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
       Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
       Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
       Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
       Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
       Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
       Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
       Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
       Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
       Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
       Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
       Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
       Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
       Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
       Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
       Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
       Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
       Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
       Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
       Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
       Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
       Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
       Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
       Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
       Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
       Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
       Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
       Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
       Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
       Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
       Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
       Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
       Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
       Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
       Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
       Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
       Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
       Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
       Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
       Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
       Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
       Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
       Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
       Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
       Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
       Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
       Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      082f764a
    • M
      sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed · eeb60398
      Mel Gorman 提交于
      On sync wakeups, the previous CPU effective load may not be used so delay
      the calculation until it's needed.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eeb60398
    • M
      sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine · 7ebb66a1
      Mel Gorman 提交于
      The only caller of wake_affine() knows the CPU ID. Pass it in instead of
      rechecking it.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7ebb66a1
  8. 13 2月, 2018 1 次提交
  9. 06 2月, 2018 5 次提交
    • M
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman 提交于
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32e839dd
    • M
      sched/fair: Do not migrate if the prev_cpu is idle · 806486c3
      Mel Gorman 提交于
      wake_affine_idle() prefers to move a task to the current CPU if the
      wakeup is due to an interrupt. The expectation is that the interrupt
      data is cache hot and relevant to the waking task as well as avoiding
      a search. However, there is no way to determine if there was cache hot
      data on the previous CPU that may exceed the interrupt data. Furthermore,
      round-robin delivery of interrupts can migrate tasks around a socket where
      each CPU is under-utilised.  This can interact badly with cpufreq which
      makes decisions based on per-cpu data. It has been observed on machines
      with HWP that p-states are not boosted to their maximum levels even though
      the workload is latency and throughput sensitive.
      
      This patch uses the previous CPU for the task if it's idle and cache-affine
      with the current CPU even if the current CPU is idle due to the wakup
      being related to the interrupt. This reduces migrations at the cost of
      the interrupt data not being cache hot when the task wakes.
      
      A variety of workloads were tested on various machines and no adverse
      impact was noticed that was outside noise. dbench on ext4 on UMA showed
      roughly 10% reduction in the number of CPU migrations and it is a case
      where interrupts are frequent for IO competions. In most cases, the
      difference in performance is quite small but variability is often
      reduced. For example, this is the result for pgbench running on a UMA
      machine with different numbers of clients.
      
                                4.15.0-rc9             4.15.0-rc9
                                  baseline              waprev-v1
       Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
       Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
       Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
       Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
       Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
       Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
       Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
       Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
       Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
       Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
       CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)
      
      Note that the different in performance is marginal but for low utilisation,
      there is less variability.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      806486c3
    • M
      sched/fair: Restructure wake_affine*() to return a CPU id · 3b76c4a3
      Mel Gorman 提交于
      This is a preparation patch that has wake_affine*() return a CPU ID instead of
      a boolean. The intent is to allow the wake_affine() helpers to be avoided
      if a decision is already made. This patch has no functional change.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3b76c4a3
    • M
      sched/fair: Remove unnecessary parameters from wake_affine_idle() · 89a55f56
      Mel Gorman 提交于
      wake_affine_idle() takes parameters it never uses so clean it up.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89a55f56
    • P
      sched/core: Optimize update_stats_*() · 2ed41a55
      Peter Zijlstra 提交于
      These functions are already gated by schedstats_enabled(), there is no
      point in then issuing another static_branch for every individual
      update in them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2ed41a55
  10. 24 1月, 2018 1 次提交
  11. 10 1月, 2018 4 次提交
    • J
      sched/deadline: Make bandwidth enforcement scale-invariant · 07881166
      Juri Lelli 提交于
      Apply frequency and CPU scale-invariance correction factor to bandwidth
      enforcement (similar to what we already do to fair utilization tracking).
      
      Each delta_exec gets scaled considering current frequency and maximum
      CPU capacity; which means that the reservation runtime parameter (that
      need to be specified profiling the task execution at max frequency on
      biggest capacity core) gets thus scaled accordingly.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      07881166
    • J
      sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter · 7673c8a4
      Juri Lelli 提交于
      The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
      see where information coming from scheduling domains might help doing
      frequency invariance scaling).
      
      Remove it; also in anticipation of moving arch_scale_freq_capacity()
      outside CONFIG_SMP.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7673c8a4
    • M
      sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache · 7332dec0
      Mel Gorman 提交于
      If waking from an idle CPU due to an interrupt then it's possible that
      the waker task will be pulled to wake on the current CPU. Unfortunately,
      depending on the type of interrupt and IRQ configuration, there may not
      be a strong relationship between the CPU an interrupt was delivered on
      and the CPU a task was running on. For example, the interrupts could all
      be delivered to CPUs on one particular node due to the machine topology
      or IRQ affinity configuration. Another example is an interrupt for an IO
      completion which can be delivered to any CPU where there is no guarantee
      the data is either cache hot or even local.
      
      This patch was motivated by the observation that an IO workload was
      being pulled cross-node on a frequent basis when IO completed.  From a
      wakeup latency perspective, it's still useful to know that an idle CPU is
      immediately available for use but lets only consider an automatic migration
      if the CPUs share cache to limit damage due to NUMA migrations. Migrations
      may still occur if wake_affine_weight determines it's appropriate.
      
      These are the throughput results for dbench running on ext4 comparing
      4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
      completions can happen on any CPU.
      
                                4.15.0-rc3             4.15.0-rc3
                                   vanilla            lessmigrate
      Hmean     1        854.64 (   0.00%)      865.01 (   1.21%)
      Hmean     2       1229.60 (   0.00%)     1274.44 (   3.65%)
      Hmean     4       1591.81 (   0.00%)     1628.08 (   2.28%)
      Hmean     8       1845.04 (   0.00%)     1831.80 (  -0.72%)
      Hmean     16      2038.61 (   0.00%)     2091.44 (   2.59%)
      Hmean     32      2327.19 (   0.00%)     2430.29 (   4.43%)
      Hmean     64      2570.61 (   0.00%)     2568.54 (  -0.08%)
      Hmean     128     2481.89 (   0.00%)     2499.28 (   0.70%)
      Stddev    1         14.31 (   0.00%)        5.35 (  62.65%)
      Stddev    2         21.29 (   0.00%)       11.09 (  47.92%)
      Stddev    4          7.22 (   0.00%)        6.80 (   5.92%)
      Stddev    8         26.70 (   0.00%)        9.41 (  64.76%)
      Stddev    16        22.40 (   0.00%)       20.01 (  10.70%)
      Stddev    32        45.13 (   0.00%)       44.74 (   0.85%)
      Stddev    64        93.10 (   0.00%)       93.18 (  -0.09%)
      Stddev    128      184.28 (   0.00%)      177.85 (   3.49%)
      
      Note the small increase in throughput for low thread counts but also
      note that the standard deviation for each sample during the test run is
      lower. The throughput figures for dbench can be misleading so the benchmark
      is actually modified to time the latency of the processing of one load
      file with many samples taken. The difference in latency is
      
                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla            lessmigrate
      Amean      1         21.71 (   0.00%)       21.47 (   1.08%)
      Amean      2         30.89 (   0.00%)       29.58 (   4.26%)
      Amean      4         47.54 (   0.00%)       46.61 (   1.97%)
      Amean      8         82.71 (   0.00%)       82.81 (  -0.12%)
      Amean      16       149.45 (   0.00%)      145.01 (   2.97%)
      Amean      32       265.49 (   0.00%)      248.43 (   6.42%)
      Amean      64       463.23 (   0.00%)      463.55 (  -0.07%)
      Amean      128      933.97 (   0.00%)      935.50 (  -0.16%)
      Stddev     1          1.58 (   0.00%)        1.54 (   2.26%)
      Stddev     2          2.84 (   0.00%)        2.95 (  -4.15%)
      Stddev     4          6.78 (   0.00%)        6.85 (  -0.99%)
      Stddev     8         16.85 (   0.00%)       16.37 (   2.85%)
      Stddev     16        41.59 (   0.00%)       41.04 (   1.32%)
      Stddev     32       111.05 (   0.00%)      105.11 (   5.35%)
      Stddev     64       285.94 (   0.00%)      288.01 (  -0.72%)
      Stddev     128      803.39 (   0.00%)      809.73 (  -0.79%)
      
      It's a small improvement which is not surprising given that migrations that
      migrate to a different node as not that common. However, it is noticeable
      in the CPU migration statistics which are reduced by 24%.
      
      There was a query for v1 of this patch about NAS so here are the results
      for C-class using MPI for parallelisation on the same machine
      
      nas-mpi
                            4.15.0-rc3             4.15.0-rc3
                               vanilla                  noirq
      Time cg.C       24.25 (   0.00%)       23.17 (   4.45%)
      Time ep.C        8.22 (   0.00%)        8.29 (  -0.85%)
      Time ft.C       22.67 (   0.00%)       20.34 (  10.28%)
      Time is.C        1.42 (   0.00%)        1.47 (  -3.52%)
      Time lu.C       55.62 (   0.00%)       54.81 (   1.46%)
      Time mg.C        7.93 (   0.00%)        7.91 (   0.25%)
      
                4.15.0-rc3  4.15.0-rc3
                   vanilla  noirq-v1r1
      User         3799.96     3748.34
      System        672.10      626.15
      Elapsed        91.91       79.49
      
      lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
      regressions but in terms of absolute time, the difference is small and
      likely within run-to-run variance. System CPU usage is slightly reduced.
      
      schbench from Facebook was also requested. This is a bit of a mixed bag but
      it's important to note that this workload should not be heavily impacted
      by wakeups from interrupt context.
      
                                       4.15.0-rc3             4.15.0-rc3
                                          vanilla             noirq-v1r1
      Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
      Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
      Lat 90.00th-qrtle-1        43.00 (   0.00%)       44.00 (  -2.33%)
      Lat 95.00th-qrtle-1        44.00 (   0.00%)       46.00 (  -4.55%)
      Lat 99.00th-qrtle-1        57.00 (   0.00%)       58.00 (  -1.75%)
      Lat 99.50th-qrtle-1        59.00 (   0.00%)       59.00 (   0.00%)
      Lat 99.90th-qrtle-1        67.00 (   0.00%)       78.00 ( -16.42%)
      Lat 50.00th-qrtle-2        40.00 (   0.00%)       51.00 ( -27.50%)
      Lat 75.00th-qrtle-2        45.00 (   0.00%)       56.00 ( -24.44%)
      Lat 90.00th-qrtle-2        53.00 (   0.00%)       59.00 ( -11.32%)
      Lat 95.00th-qrtle-2        57.00 (   0.00%)       61.00 (  -7.02%)
      Lat 99.00th-qrtle-2        67.00 (   0.00%)       71.00 (  -5.97%)
      Lat 99.50th-qrtle-2        69.00 (   0.00%)       74.00 (  -7.25%)
      Lat 99.90th-qrtle-2        83.00 (   0.00%)       77.00 (   7.23%)
      Lat 50.00th-qrtle-4        51.00 (   0.00%)       51.00 (   0.00%)
      Lat 75.00th-qrtle-4        57.00 (   0.00%)       56.00 (   1.75%)
      Lat 90.00th-qrtle-4        60.00 (   0.00%)       59.00 (   1.67%)
      Lat 95.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
      Lat 99.00th-qrtle-4        73.00 (   0.00%)       72.00 (   1.37%)
      Lat 99.50th-qrtle-4        76.00 (   0.00%)       74.00 (   2.63%)
      Lat 99.90th-qrtle-4        85.00 (   0.00%)       78.00 (   8.24%)
      Lat 50.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
      Lat 75.00th-qrtle-8        59.00 (   0.00%)       62.00 (  -5.08%)
      Lat 90.00th-qrtle-8        65.00 (   0.00%)       66.00 (  -1.54%)
      Lat 95.00th-qrtle-8        67.00 (   0.00%)       70.00 (  -4.48%)
      Lat 99.00th-qrtle-8        78.00 (   0.00%)       79.00 (  -1.28%)
      Lat 99.50th-qrtle-8        81.00 (   0.00%)       80.00 (   1.23%)
      Lat 99.90th-qrtle-8       116.00 (   0.00%)       83.00 (  28.45%)
      Lat 50.00th-qrtle-16       65.00 (   0.00%)       64.00 (   1.54%)
      Lat 75.00th-qrtle-16       77.00 (   0.00%)       71.00 (   7.79%)
      Lat 90.00th-qrtle-16       83.00 (   0.00%)       82.00 (   1.20%)
      Lat 95.00th-qrtle-16       87.00 (   0.00%)       87.00 (   0.00%)
      Lat 99.00th-qrtle-16       95.00 (   0.00%)       96.00 (  -1.05%)
      Lat 99.50th-qrtle-16       99.00 (   0.00%)      103.00 (  -4.04%)
      Lat 99.90th-qrtle-16      104.00 (   0.00%)      122.00 ( -17.31%)
      Lat 50.00th-qrtle-32       71.00 (   0.00%)       73.00 (  -2.82%)
      Lat 75.00th-qrtle-32       91.00 (   0.00%)       92.00 (  -1.10%)
      Lat 90.00th-qrtle-32      108.00 (   0.00%)      107.00 (   0.93%)
      Lat 95.00th-qrtle-32      118.00 (   0.00%)      115.00 (   2.54%)
      Lat 99.00th-qrtle-32      134.00 (   0.00%)      129.00 (   3.73%)
      Lat 99.50th-qrtle-32      138.00 (   0.00%)      133.00 (   3.62%)
      Lat 99.90th-qrtle-32      149.00 (   0.00%)      146.00 (   2.01%)
      Lat 50.00th-qrtle-39       83.00 (   0.00%)       81.00 (   2.41%)
      Lat 75.00th-qrtle-39      105.00 (   0.00%)      102.00 (   2.86%)
      Lat 90.00th-qrtle-39      120.00 (   0.00%)      119.00 (   0.83%)
      Lat 95.00th-qrtle-39      129.00 (   0.00%)      128.00 (   0.78%)
      Lat 99.00th-qrtle-39      153.00 (   0.00%)      149.00 (   2.61%)
      Lat 99.50th-qrtle-39      166.00 (   0.00%)      156.00 (   6.02%)
      Lat 99.90th-qrtle-39    12304.00 (   0.00%)    12848.00 (  -4.42%)
      
      When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
      are small gains in many cases. Otherwise it depends on the quartile used
      where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
      are probably a co-incidence. For this workload, much depends on what node
      the threads get placed on and their relative locality and not wakeups from
      interrupt context. A larger component on how it behaves would be automatic
      NUMA balancing where a fault incurred to measure locality would be a much
      larger contributer to latency than the wakeup path.
      
      This is the results from an almost identical machine that happened to run
      the same test.  They only differ in terms of storage which is irrelevant
      for this test.
      
                                       4.15.0-rc3             4.15.0-rc3
                                          vanilla             noirq-v1r1
      Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
      Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
      Lat 90.00th-qrtle-1        44.00 (   0.00%)       43.00 (   2.27%)
      Lat 95.00th-qrtle-1        53.00 (   0.00%)       45.00 (  15.09%)
      Lat 99.00th-qrtle-1        59.00 (   0.00%)       58.00 (   1.69%)
      Lat 99.50th-qrtle-1        60.00 (   0.00%)       59.00 (   1.67%)
      Lat 99.90th-qrtle-1        86.00 (   0.00%)       61.00 (  29.07%)
      Lat 50.00th-qrtle-2        52.00 (   0.00%)       41.00 (  21.15%)
      Lat 75.00th-qrtle-2        57.00 (   0.00%)       46.00 (  19.30%)
      Lat 90.00th-qrtle-2        60.00 (   0.00%)       53.00 (  11.67%)
      Lat 95.00th-qrtle-2        62.00 (   0.00%)       57.00 (   8.06%)
      Lat 99.00th-qrtle-2        73.00 (   0.00%)       68.00 (   6.85%)
      Lat 99.50th-qrtle-2        74.00 (   0.00%)       71.00 (   4.05%)
      Lat 99.90th-qrtle-2        90.00 (   0.00%)       75.00 (  16.67%)
      Lat 50.00th-qrtle-4        57.00 (   0.00%)       52.00 (   8.77%)
      Lat 75.00th-qrtle-4        60.00 (   0.00%)       58.00 (   3.33%)
      Lat 90.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
      Lat 95.00th-qrtle-4        65.00 (   0.00%)       65.00 (   0.00%)
      Lat 99.00th-qrtle-4        76.00 (   0.00%)       75.00 (   1.32%)
      Lat 99.50th-qrtle-4        77.00 (   0.00%)       77.00 (   0.00%)
      Lat 99.90th-qrtle-4        87.00 (   0.00%)       81.00 (   6.90%)
      Lat 50.00th-qrtle-8        59.00 (   0.00%)       57.00 (   3.39%)
      Lat 75.00th-qrtle-8        63.00 (   0.00%)       62.00 (   1.59%)
      Lat 90.00th-qrtle-8        66.00 (   0.00%)       67.00 (  -1.52%)
      Lat 95.00th-qrtle-8        68.00 (   0.00%)       70.00 (  -2.94%)
      Lat 99.00th-qrtle-8        79.00 (   0.00%)       80.00 (  -1.27%)
      Lat 99.50th-qrtle-8        80.00 (   0.00%)       84.00 (  -5.00%)
      Lat 99.90th-qrtle-8        84.00 (   0.00%)       90.00 (  -7.14%)
      Lat 50.00th-qrtle-16       65.00 (   0.00%)       65.00 (   0.00%)
      Lat 75.00th-qrtle-16       77.00 (   0.00%)       75.00 (   2.60%)
      Lat 90.00th-qrtle-16       84.00 (   0.00%)       83.00 (   1.19%)
      Lat 95.00th-qrtle-16       88.00 (   0.00%)       87.00 (   1.14%)
      Lat 99.00th-qrtle-16       97.00 (   0.00%)       96.00 (   1.03%)
      Lat 99.50th-qrtle-16      100.00 (   0.00%)      104.00 (  -4.00%)
      Lat 99.90th-qrtle-16      110.00 (   0.00%)      126.00 ( -14.55%)
      Lat 50.00th-qrtle-32       70.00 (   0.00%)       71.00 (  -1.43%)
      Lat 75.00th-qrtle-32       92.00 (   0.00%)       94.00 (  -2.17%)
      Lat 90.00th-qrtle-32      110.00 (   0.00%)      110.00 (   0.00%)
      Lat 95.00th-qrtle-32      121.00 (   0.00%)      118.00 (   2.48%)
      Lat 99.00th-qrtle-32      135.00 (   0.00%)      137.00 (  -1.48%)
      Lat 99.50th-qrtle-32      140.00 (   0.00%)      146.00 (  -4.29%)
      Lat 99.90th-qrtle-32      150.00 (   0.00%)      160.00 (  -6.67%)
      Lat 50.00th-qrtle-39       80.00 (   0.00%)       71.00 (  11.25%)
      Lat 75.00th-qrtle-39      102.00 (   0.00%)       91.00 (  10.78%)
      Lat 90.00th-qrtle-39      118.00 (   0.00%)      108.00 (   8.47%)
      Lat 95.00th-qrtle-39      128.00 (   0.00%)      117.00 (   8.59%)
      Lat 99.00th-qrtle-39      149.00 (   0.00%)      133.00 (  10.74%)
      Lat 99.50th-qrtle-39      160.00 (   0.00%)      139.00 (  13.12%)
      Lat 99.90th-qrtle-39    13808.00 (   0.00%)     4920.00 (  64.37%)
      
      Despite being nearly identical, it showed a variety of major gains so
      I'm not convinced that heavy emphasis should be placed on this particular
      workload in terms of evaluating this particular patch. Further evidence of
      this is the fact that testing on a UMA machine showed small gains/losses
      even though the patch should be a no-op on UMA.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7332dec0
    • J
      sched/fair: Correct obsolete comment about cpufreq_update_util() · 9783be2c
      Joel Fernandes 提交于
      Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
      from remote CPUs. The comment about local CPUs is thus obsolete. Update it
      accordingly.
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Android Kernel <kernel-team@android.com>
      Cc: Atish Patra <atish.patra@oracle.com>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: EAS Dev <eas-dev@lists.linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Ramussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rohit Jain <rohit.k.jain@oracle.com>
      Cc: Saravana Kannan <skannan@quicinc.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikram Mulukutla <markivx@codeaurora.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9783be2c