1. 20 12月, 2018 1 次提交
  2. 01 12月, 2018 1 次提交
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  3. 16 10月, 2018 1 次提交
  4. 11 10月, 2018 1 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
  5. 02 10月, 2018 6 次提交
    • M
      sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task · 37355bdc
      Mel Gorman 提交于
      Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
      should migrate to a local node. This filter avoids excessive ping-ponging
      if a page is shared or used by threads that migrate cross-node frequently.
      
      Threads inherit both page tables and the preferred node ID from the
      parent. This means that threads can trigger hinting faults earlier than
      a new task which delays scanning for a number of seconds. As it can be
      load balanced very early in its lifetime there can be an unnecessary delay
      before it starts migrating thread-local data. This patch migrates private
      pages faster early in the lifetime of a thread using the sequence counter
      as an identifier of new tasks.
      
      With this patch applied, STREAM performance is the same as 4.17 even though
      processes are not spread cross-node prematurely. Other workloads showed
      a mix of minor gains and losses. This is somewhat expected most workloads
      are not very sensitive to the starting conditions of a process.
      
                               4.19.0-rc5             4.19.0-rc5                 4.17.0
                               numab-v1r1       fastmigrate-v1r1                vanilla
      MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
      MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
      MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
      MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37355bdc
    • S
      sched/numa: Avoid task migration for small NUMA improvement · 6fd98e77
      Srikar Dronamraju 提交于
      If NUMA improvement from the task migration is going to be very
      minimal, then avoid task migration.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     198512  205910   3.72673
      1     313559  318491   1.57291
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev     Current  %Change
      8     74761.9  74935.9  0.232739
      1     214874   226796   5.54837
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     180536  189780   5.12031
      1     210281  205695   -2.18089
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     56511.4  60370    6.828
      1     104899   108100   3.05151
      
      1/7 cases is regressing, if we look at events migrate_pages seem
      to vary the most especially in the regressing case. Also some
      amount of variance is expected between different runs of
      Specjbb2005.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,818,546      13,801,554
      migrations                1,149,960       1,151,541
      faults                    385,583         433,246
      cache-misses              55,259,546,768  55,168,691,835
      sched:sched_move_numa     2,257           2,551
      sched:sched_stick_numa    9               24
      sched:sched_swap_numa     512             904
      migrate:mm_migrate_pages  2,225           1,571
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        72692   113682
      numa_hint_faults_local  62270   102163
      numa_hit                238762  240181
      numa_huge_pte_updates   48      36
      numa_interleave         75      64
      numa_local              238676  240103
      numa_other              86      78
      numa_pages_migrated     2225    1564
      numa_pte_updates        98557   134080
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,173,490       3,079,150
      migrations                36,966          31,455
      faults                    108,776         99,081
      cache-misses              12,200,075,320  11,588,126,740
      sched:sched_move_numa     1,264           1
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     0               0
      migrate:mm_migrate_pages  899             36
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        21109   430
      numa_hint_faults_local  17120   77
      numa_hit                72934   71277
      numa_huge_pte_updates   42      0
      numa_interleave         33      22
      numa_local              72866   71218
      numa_other              68      59
      numa_pages_migrated     915     23
      numa_pte_updates        42326   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,312,022    8,707,565
      migrations                231,705      171,342
      faults                    310,242      310,820
      cache-misses              402,324,573  136,115,400
      sched:sched_move_numa     193          215
      sched:sched_stick_numa    0            6
      sched:sched_swap_numa     3            24
      migrate:mm_migrate_pages  93           162
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11838   8985
      numa_hint_faults_local  11216   8154
      numa_hit                90689   93819
      numa_huge_pte_updates   0       0
      numa_interleave         1579    882
      numa_local              89634   93496
      numa_other              1055    323
      numa_pages_migrated     92      169
      numa_pte_updates        12109   9217
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,170,481   2,152,072
      migrations                10,126      10,704
      faults                    160,962     164,376
      cache-misses              10,834,845  3,818,437
      sched:sched_move_numa     10          16
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           7
      migrate:mm_migrate_pages  2           199
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        403     2248
      numa_hint_faults_local  358     1666
      numa_hit                25898   25704
      numa_huge_pte_updates   0       0
      numa_interleave         207     200
      numa_local              25860   25679
      numa_other              38      25
      numa_pages_migrated     2       197
      numa_pte_updates        400     2234
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        110,339,633      93,330,595
      migrations                4,139,812        4,122,061
      faults                    863,622          865,979
      cache-misses              231,838,045,660  225,395,083,479
      sched:sched_move_numa     2,196            2,372
      sched:sched_stick_numa    33               24
      sched:sched_swap_numa     544              769
      migrate:mm_migrate_pages  2,469            1,677
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        85748   91638
      numa_hint_faults_local  66831   78096
      numa_hit                242213  242225
      numa_huge_pte_updates   0       0
      numa_interleave         0       2
      numa_local              242211  242219
      numa_other              2       6
      numa_pages_migrated     2376    1515
      numa_pte_updates        86233   92274
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        59,331,057      51,487,271
      migrations                552,019         537,170
      faults                    266,586         256,921
      cache-misses              73,796,312,990  70,073,831,187
      sched:sched_move_numa     981             576
      sched:sched_stick_numa    54              24
      sched:sched_swap_numa     286             327
      migrate:mm_migrate_pages  713             726
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        14807   12000
      numa_hint_faults_local  5738    5024
      numa_hit                36230   36470
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36228   36465
      numa_other              2       5
      numa_pages_migrated     703     726
      numa_pte_updates        14742   11930
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fd98e77
    • M
      sched/numa: Limit the conditions where scan period is reset · 05cbdf4f
      Mel Gorman 提交于
      migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
      cross-node migration. In the event of excessive load balancing due to
      saturation, this may result in the scan rate being pegged at maximum and
      further overloading the machine.
      
      This patch only resets the scan if NUMA balancing is active, a preferred
      node has been selected and the task is being migrated from the preferred
      node as these are the most harmful. For example, a migration to the preferred
      node does not justify a faster scan rate. Similarly, a migration between two
      nodes that are not preferred is probably bouncing due to over-saturation of
      the machine.  In that case, scanning faster and trapping more NUMA faults
      will further overload the machine.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203370  205332   0.964744
      1     328431  319785   -2.63252
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     206070  206585   0.249915
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188386  189162   0.41192
      1     201566  213760   6.04963
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     59157.4  58736.8  -0.710985
      1     105495   105419   -0.0720413
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,825,492      14,285,708
      migrations                1,152,509       1,180,621
      faults                    371,948         339,114
      cache-misses              55,654,206,041  55,205,631,894
      sched:sched_move_numa     1,856           843
      sched:sched_stick_numa    4               6
      sched:sched_swap_numa     428             219
      migrate:mm_migrate_pages  898             365
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        57146   26907
      numa_hint_faults_local  51612   24279
      numa_hit                238164  239771
      numa_huge_pte_updates   16      0
      numa_interleave         63      68
      numa_local              238085  239688
      numa_other              79      83
      numa_pages_migrated     883     363
      numa_pte_updates        67540   27415
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,288,525       3,202,779
      migrations                38,652          37,186
      faults                    111,678         106,076
      cache-misses              12,111,197,376  12,024,873,744
      sched:sched_move_numa     900             931
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     5               1
      migrate:mm_migrate_pages  714             637
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        18572   17409
      numa_hint_faults_local  14850   14367
      numa_hit                73197   73953
      numa_huge_pte_updates   11      20
      numa_interleave         25      25
      numa_local              73138   73892
      numa_other              59      61
      numa_pages_migrated     712     668
      numa_pte_updates        24021   27276
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,451,543    8,474,013
      migrations                202,804      254,934
      faults                    310,024      320,506
      cache-misses              253,522,507  110,580,458
      sched:sched_move_numa     213          725
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            7
      migrate:mm_migrate_pages  88           145
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11830   22797
      numa_hint_faults_local  11301   21539
      numa_hit                90038   89308
      numa_huge_pte_updates   0       0
      numa_interleave         855     865
      numa_local              89796   88955
      numa_other              242     353
      numa_pages_migrated     88      149
      numa_pte_updates        12039   22930
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,049,153  2,195,628
      migrations                11,405     11,179
      faults                    162,309    149,656
      cache-misses              7,203,343  8,117,515
      sched:sched_move_numa     22         49
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  1          5
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        1693    3577
      numa_hint_faults_local  1669    3476
      numa_hit                25177   26142
      numa_huge_pte_updates   0       0
      numa_interleave         194     358
      numa_local              24993   26042
      numa_other              184     100
      numa_pages_migrated     1       5
      numa_pte_updates        1577    3587
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        94,515,937       100,602,296
      migrations                4,203,554        4,135,630
      faults                    832,697          789,256
      cache-misses              226,248,698,331  226,160,621,058
      sched:sched_move_numa     1,730            1,366
      sched:sched_stick_numa    14               16
      sched:sched_swap_numa     432              374
      migrate:mm_migrate_pages  1,398            1,350
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        80079   47857
      numa_hint_faults_local  68620   39768
      numa_hit                241187  240165
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              241186  240165
      numa_other              1       0
      numa_pages_migrated     1347    1224
      numa_pte_updates        80729   48354
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        63,704,961      58,515,496
      migrations                573,404         564,845
      faults                    230,878         245,807
      cache-misses              76,568,222,781  73,603,757,976
      sched:sched_move_numa     509             996
      sched:sched_stick_numa    31              10
      sched:sched_swap_numa     182             193
      migrate:mm_migrate_pages  541             646
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        8501    13422
      numa_hint_faults_local  2960    5619
      numa_hit                35526   36118
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35526   36116
      numa_other              0       2
      numa_pages_migrated     539     616
      numa_pte_updates        8433    13374
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05cbdf4f
    • S
      sched/numa: Reset scan rate whenever task moves across nodes · 3f9672ba
      Srikar Dronamraju 提交于
      Currently task scan rate is reset when NUMA balancer migrates the task
      to a different node. If NUMA balancer initiates a swap, reset is only
      applicable to the task that initiates the swap. Similarly no scan rate
      reset is done if the task is migrated across nodes by traditional load
      balancer.
      
      Instead move the scan reset to the migrate_task_rq. This ensures the
      task moved out of its preferred node, either gets back to its preferred
      node quickly or finds a new preferred node. Doing so, would be fair to
      all tasks migrating across nodes.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200668  203370   1.3465
      1     321791  328431   2.06345
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     204848  206070   0.59654
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188098  188386   0.153112
      1     200351  201566   0.606436
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58145.9  59157.4  1.73959
      1     103798   105495   1.63491
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,912,183      13,825,492
      migrations                1,155,931       1,152,509
      faults                    367,139         371,948
      cache-misses              54,240,196,814  55,654,206,041
      sched:sched_move_numa     1,571           1,856
      sched:sched_stick_numa    9               4
      sched:sched_swap_numa     463             428
      migrate:mm_migrate_pages  703             898
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        50155   57146
      numa_hint_faults_local  45264   51612
      numa_hit                239652  238164
      numa_huge_pte_updates   36      16
      numa_interleave         68      63
      numa_local              239576  238085
      numa_other              76      79
      numa_pages_migrated     680     883
      numa_pte_updates        71146   67540
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,156,720       3,288,525
      migrations                30,354          38,652
      faults                    97,261          111,678
      cache-misses              12,400,026,826  12,111,197,376
      sched:sched_move_numa     4               900
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               5
      migrate:mm_migrate_pages  20              714
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        272     18572
      numa_hint_faults_local  186     14850
      numa_hit                71362   73197
      numa_huge_pte_updates   0       11
      numa_interleave         23      25
      numa_local              71299   73138
      numa_other              63      59
      numa_pages_migrated     2       712
      numa_pte_updates        0       24021
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,606,824    8,451,543
      migrations                155,352      202,804
      faults                    301,409      310,024
      cache-misses              157,759,224  253,522,507
      sched:sched_move_numa     168          213
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     3            2
      migrate:mm_migrate_pages  125          88
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        4650    11830
      numa_hint_faults_local  3946    11301
      numa_hit                90489   90038
      numa_huge_pte_updates   0       0
      numa_interleave         892     855
      numa_local              90034   89796
      numa_other              455     242
      numa_pages_migrated     124     88
      numa_pte_updates        4818    12039
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,113,167  2,049,153
      migrations                10,533     11,405
      faults                    142,727    162,309
      cache-misses              5,594,192  7,203,343
      sched:sched_move_numa     10         22
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  6          1
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        744     1693
      numa_hint_faults_local  584     1669
      numa_hit                25551   25177
      numa_huge_pte_updates   0       0
      numa_interleave         263     194
      numa_local              25302   24993
      numa_other              249     184
      numa_pages_migrated     6       1
      numa_pte_updates        744     1577
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        101,227,352      94,515,937
      migrations                4,151,829        4,203,554
      faults                    745,233          832,697
      cache-misses              224,669,561,766  226,248,698,331
      sched:sched_move_numa     617              1,730
      sched:sched_stick_numa    2                14
      sched:sched_swap_numa     187              432
      migrate:mm_migrate_pages  316              1,398
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        24195   80079
      numa_hint_faults_local  21639   68620
      numa_hit                238331  241187
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              238331  241186
      numa_other              0       1
      numa_pages_migrated     204     1347
      numa_pte_updates        24561   80729
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        62,738,978      63,704,961
      migrations                562,702         573,404
      faults                    228,465         230,878
      cache-misses              75,778,067,952  76,568,222,781
      sched:sched_move_numa     648             509
      sched:sched_stick_numa    13              31
      sched:sched_swap_numa     137             182
      migrate:mm_migrate_pages  733             541
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        10281   8501
      numa_hint_faults_local  3242    2960
      numa_hit                36338   35526
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36338   35526
      numa_other              0       0
      numa_pages_migrated     706     539
      numa_pte_updates        10176   8433
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f9672ba
    • S
      sched/numa: Pass destination CPU as a parameter to migrate_task_rq · 1327237a
      Srikar Dronamraju 提交于
      This additional parameter (new_cpu) is used later for identifying if
      task migration is across nodes.
      
      No functional change.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203353  200668   -1.32036
      1     328205  321791   -1.95427
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     214384  204848   -4.44809
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188553  188098   -0.241311
      1     196273  200351   2.07772
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     57581.2  58145.9  0.980702
      1     103468   103798   0.318939
      
      Brings out the variance between different specjbb2005 runs.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,941,377      13,912,183
      migrations                1,157,323       1,155,931
      faults                    382,175         367,139
      cache-misses              54,993,823,500  54,240,196,814
      sched:sched_move_numa     2,005           1,571
      sched:sched_stick_numa    14              9
      sched:sched_swap_numa     529             463
      migrate:mm_migrate_pages  1,573           703
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        67099   50155
      numa_hint_faults_local  58456   45264
      numa_hit                240416  239652
      numa_huge_pte_updates   18      36
      numa_interleave         65      68
      numa_local              240339  239576
      numa_other              77      76
      numa_pages_migrated     1574    680
      numa_pte_updates        77182   71146
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,176,453       3,156,720
      migrations                30,238          30,354
      faults                    87,869          97,261
      cache-misses              12,544,479,391  12,400,026,826
      sched:sched_move_numa     23              4
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     6               1
      migrate:mm_migrate_pages  10              20
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        236     272
      numa_hint_faults_local  201     186
      numa_hit                72293   71362
      numa_huge_pte_updates   0       0
      numa_interleave         26      23
      numa_local              72233   71299
      numa_other              60      63
      numa_pages_migrated     8       2
      numa_pte_updates        0       0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,478,820    8,606,824
      migrations                171,323      155,352
      faults                    307,499      301,409
      cache-misses              240,353,599  157,759,224
      sched:sched_move_numa     214          168
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     4            3
      migrate:mm_migrate_pages  89           125
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        5301    4650
      numa_hint_faults_local  4745    3946
      numa_hit                92943   90489
      numa_huge_pte_updates   0       0
      numa_interleave         899     892
      numa_local              92345   90034
      numa_other              598     455
      numa_pages_migrated     88      124
      numa_pte_updates        5505    4818
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,066,172   2,113,167
      migrations                11,076      10,533
      faults                    149,544     142,727
      cache-misses              10,398,067  5,594,192
      sched:sched_move_numa     43          10
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           0
      migrate:mm_migrate_pages  6           6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3552    744
      numa_hint_faults_local  3347    584
      numa_hit                25611   25551
      numa_huge_pte_updates   0       0
      numa_interleave         213     263
      numa_local              25583   25302
      numa_other              28      249
      numa_pages_migrated     6       6
      numa_pte_updates        3535    744
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        99,358,136       101,227,352
      migrations                4,041,607        4,151,829
      faults                    749,653          745,233
      cache-misses              225,562,543,251  224,669,561,766
      sched:sched_move_numa     771              617
      sched:sched_stick_numa    14               2
      sched:sched_swap_numa     204              187
      migrate:mm_migrate_pages  1,180            316
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        27409   24195
      numa_hint_faults_local  20677   21639
      numa_hit                239988  238331
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              239983  238331
      numa_other              5       0
      numa_pages_migrated     1016    204
      numa_pte_updates        27916   24561
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        60,899,307      62,738,978
      migrations                544,668         562,702
      faults                    270,834         228,465
      cache-misses              74,543,455,635  75,778,067,952
      sched:sched_move_numa     735             648
      sched:sched_stick_numa    25              13
      sched:sched_swap_numa     174             137
      migrate:mm_migrate_pages  816             733
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        11059   10281
      numa_hint_faults_local  4733    3242
      numa_hit                41384   36338
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              41383   36338
      numa_other              1       0
      numa_pages_migrated     815     706
      numa_pte_updates        11323   10176
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1327237a
    • S
      sched/numa: Stop multiple tasks from moving to the CPU at the same time · a4739eca
      Srikar Dronamraju 提交于
      Task migration under NUMA balancing can happen in parallel. More than
      one task might choose to migrate to the same CPU at the same time. This
      can result in:
      
      - During task swap, choosing a task that was not part of the evaluation.
      - During task swap, task which just got moved into its preferred node,
        moving to a completely different node.
      - During task swap, task failing to move to the preferred node, will have
        to wait an extra interval for the next migrate opportunity.
      - During task movement, multiple task movements can cause load imbalance.
      
      This problem is more likely if there are more cores per node or more
      nodes in the system.
      
      Use a per run-queue variable to check if NUMA-balance is active on the
      run-queue.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200194  203353   1.57797
      1     311331  328205   5.41995
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     197654  214384   8.46429
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     192605  188553   -2.10379
      1     213402  196273   -8.02664
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     52227.1  57581.2  10.2516
      1     102529   103468   0.915838
      
      There is a regression on power 9 box. If we look at the details,
      that box has a sudden jump in cache-misses with this patch.
      All other parameters seem to be pointing towards NUMA
      consolidation.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,345,784      13,941,377
      migrations                1,127,820       1,157,323
      faults                    374,736         382,175
      cache-misses              55,132,054,603  54,993,823,500
      sched:sched_move_numa     1,923           2,005
      sched:sched_stick_numa    52              14
      sched:sched_swap_numa     595             529
      migrate:mm_migrate_pages  1,932           1,573
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        60605   67099
      numa_hint_faults_local  51804   58456
      numa_hit                239945  240416
      numa_huge_pte_updates   14      18
      numa_interleave         60      65
      numa_local              239865  240339
      numa_other              80      77
      numa_pages_migrated     1931    1574
      numa_pte_updates        67823   77182
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,016,467       3,176,453
      migrations                37,326          30,238
      faults                    115,342         87,869
      cache-misses              11,692,155,554  12,544,479,391
      sched:sched_move_numa     965             23
      sched:sched_stick_numa    8               0
      sched:sched_swap_numa     35              6
      migrate:mm_migrate_pages  1,168           10
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        16286   236
      numa_hint_faults_local  11863   201
      numa_hit                112482  72293
      numa_huge_pte_updates   33      0
      numa_interleave         20      26
      numa_local              112419  72233
      numa_other              63      60
      numa_pages_migrated     1144    8
      numa_pte_updates        32859   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,629,724    8,478,820
      migrations                221,052      171,323
      faults                    308,661      307,499
      cache-misses              135,574,913  240,353,599
      sched:sched_move_numa     147          214
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            4
      migrate:mm_migrate_pages  64           89
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11481   5301
      numa_hint_faults_local  10968   4745
      numa_hit                89773   92943
      numa_huge_pte_updates   0       0
      numa_interleave         1116    899
      numa_local              89220   92345
      numa_other              553     598
      numa_pages_migrated     62      88
      numa_pte_updates        11694   5505
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,272,887  2,066,172
      migrations                12,206     11,076
      faults                    163,704    149,544
      cache-misses              4,801,186  10,398,067
      sched:sched_move_numa     44         43
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  17         6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        2261    3552
      numa_hint_faults_local  1993    3347
      numa_hit                25726   25611
      numa_huge_pte_updates   0       0
      numa_interleave         239     213
      numa_local              25498   25583
      numa_other              228     28
      numa_pages_migrated     17      6
      numa_pte_updates        2266    3535
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        117,980,962      99,358,136
      migrations                3,950,220        4,041,607
      faults                    736,979          749,653
      cache-misses              224,976,072,879  225,562,543,251
      sched:sched_move_numa     504              771
      sched:sched_stick_numa    50               14
      sched:sched_swap_numa     239              204
      migrate:mm_migrate_pages  1,260            1,180
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        18293   27409
      numa_hint_faults_local  11969   20677
      numa_hit                240854  239988
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240851  239983
      numa_other              3       5
      numa_pages_migrated     1190    1016
      numa_pte_updates        18106   27916
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        61,053,158      60,899,307
      migrations                551,586         544,668
      faults                    244,174         270,834
      cache-misses              74,326,766,973  74,543,455,635
      sched:sched_move_numa     344             735
      sched:sched_stick_numa    24              25
      sched:sched_swap_numa     140             174
      migrate:mm_migrate_pages  568             816
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        6461    11059
      numa_hint_faults_local  2283    4733
      numa_hit                35661   41384
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35661   41383
      numa_other              0       1
      numa_pages_migrated     568     815
      numa_pte_updates        6518    11323
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4739eca
  6. 10 9月, 2018 5 次提交
  7. 25 7月, 2018 12 次提交
    • S
      sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() · b6a60cf3
      Srikar Dronamraju 提交于
      numa_migrate_preferred() is called periodically or when task preferred
      node changes. Preferred node evaluations happen once per scan sequence.
      
      If the scan completion happens just after the periodic NUMA migration,
      then we try to migrate to the preferred node and the preferred node might
      change, needing another node migration.
      
      Avoid this by checking for scan sequence completion only when checking
      for periodic migration.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25862.6     26158.1     1.14258
      1     74357       72725       -2.19482
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     117019      113992      -2.58
      1     179095      174947      -2.31
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      449.46      770.77      615.22      101.70
      numa01.sh       Sys:      132.72      208.17      170.46       24.96
      numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
      numa02.sh      Real:       60.85       61.79       61.28        0.37
      numa02.sh       Sys:       15.34       24.71       21.08        3.61
      numa02.sh      User:     5204.41     5249.85     5231.21       17.60
      numa03.sh      Real:      785.50      916.97      840.77       44.98
      numa03.sh       Sys:      108.08      133.60      119.43        8.82
      numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
      numa04.sh      Real:      429.57      587.37      480.80       57.40
      numa04.sh       Sys:      240.61      321.97      290.84       33.58
      numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
      numa05.sh      Real:      392.09      431.25      414.65       13.82
      numa05.sh       Sys:      229.41      372.48      297.54       53.14
      numa05.sh      User:    33390.86    34697.49    34222.43      556.42
      
      Testcase       Time:         Min         Max         Avg      StdDev 	%Change
      numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
      numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
      numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
      numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
      numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
      numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
      numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
      numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
      numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
      numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
      numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
      numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
      numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
      numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
      numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6a60cf3
    • S
      sched/numa: Use group_weights to identify if migration degrades locality · f35678b6
      Srikar Dronamraju 提交于
      On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
      consolidated to the closest group of nodes. In such a case, relying on
      group_fault metric may not always help to consolidate. There can always
      be a case where a node closer to the preferred node may have lesser
      faults than a node further away from the preferred node. In such a case,
      moving to node with more faults might avoid numa consolidation.
      
      Using group_weight would help to consolidate task/memory around the
      preferred_node.
      
      While here, to be on the conservative side, don't override migrate thread
      degrades locality logic for CPU_NEWLY_IDLE load balancing.
      
      Note: Similar problems exist with should_numa_migrate_memory and will be
      dealt separately.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25645.4     25960       1.22
      1     72142       73550       1.95
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     110199      120071      8.958
      1     176303      176249      -0.03
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      490.04      774.86      596.26       96.46
      numa01.sh       Sys:      151.52      242.88      184.82       31.71
      numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
      numa02.sh      Real:       60.14       62.94       60.98        1.00
      numa02.sh       Sys:       16.11       30.77       21.20        5.28
      numa02.sh      User:     5184.33     5311.09     5228.50       44.24
      numa03.sh      Real:      790.95      856.35      826.41       24.11
      numa03.sh       Sys:      114.93      118.85      117.05        1.63
      numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
      numa04.sh      Real:      434.37      597.92      504.87       59.70
      numa04.sh       Sys:      237.63      397.40      289.74       55.98
      numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
      numa05.sh      Real:      386.77      448.90      417.22       22.79
      numa05.sh       Sys:      149.23      379.95      303.04       79.55
      numa05.sh      User:    32951.76    35959.58    34562.18     1034.05
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
      numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
      numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
      numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
      numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
      numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
      numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
      numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
      numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
      numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
      numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
      numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
      numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
      numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
      numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f35678b6
    • S
      sched/numa: Update the scan period without holding the numa_group lock · 30619c89
      Srikar Dronamraju 提交于
      The metrics for updating scan periods are local or task specific.
      Currently this update happens under the numa_group lock, which seems
      unnecessary. Hence move this update outside the lock.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25355.9     25645.4     1.141
      1     72812       72142       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      30619c89
    • S
      sched/numa: Remove numa_has_capacity() · 2d4056fa
      Srikar Dronamraju 提交于
      task_numa_find_cpu() helps to find the CPU to swap/move the task to.
      It's guarded by numa_has_capacity(). However node not having capacity
      shouldn't deter a task swapping if it helps NUMA placement.
      
      Further load_too_imbalanced(), which evaluates possibilities of move/swap,
      provides similar checks as numa_has_capacity.
      
      Hence remove numa_has_capacity() to enhance possibilities of task
      swapping even if load is imbalanced.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25657.9     25804.1     0.569
      1     74435       73413       -1.37
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4056fa
    • S
      sched/numa: Modify migrate_swap() to accept additional parameters · 0ad4e3df
      Srikar Dronamraju 提交于
      There are checks in migrate_swap_stop() that check if the task/CPU
      combination is as per migrate_swap_arg before migrating.
      
      However atleast one of the two tasks to be swapped by migrate_swap() could
      have migrated to a completely different CPU before updating the
      migrate_swap_arg. The new CPU where the task is currently running could
      be a different node too. If the task has migrated, numa balancer might
      end up placing a task in a wrong node.  Instead of achieving node
      consolidation, it may end up spreading the load across nodes.
      
      To avoid that pass the CPUs as additional parameters.
      
      While here, place migrate_swap under CONFIG_NUMA_BALANCING.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25377.3     25226.6     -0.59
      1     72287       73326       1.437
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ad4e3df
    • S
      sched/numa: Remove unused task_capacity from 'struct numa_stats' · 10864a9e
      Srikar Dronamraju 提交于
      The task_capacity field in 'struct numa_stats' is redundant.
      Also move nr_running for better packing within the struct.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25308.6     25377.3     0.271
      1     72964       72287       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10864a9e
    • S
      sched/numa: Skip nodes that are at 'hoplimit' · 0ee7e74d
      Srikar Dronamraju 提交于
      When comparing two nodes at a distance of 'hoplimit', we should consider
      nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
      distance too. Hence two nodes at a distance of 'hoplimit' will have same
      groupweight. Fix this by skipping nodes at hoplimit.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25375.3     25308.6     -0.26
      1     72617       72964       0.477
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113372      108750      -4.07684
      1     177403      183115      3.21979
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      478.45      565.90      515.11       30.87
      numa01.sh       Sys:      207.79      271.04      232.94       21.33
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
      numa02.sh      Real:       60.00       61.46       60.78        0.49
      numa02.sh       Sys:       15.71       25.31       20.69        3.42
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82
      numa03.sh      Real:      776.42      834.85      806.01       23.22
      numa03.sh       Sys:      114.43      128.75      121.65        5.49
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
      numa04.sh      Real:      456.93      511.95      482.91       20.88
      numa04.sh       Sys:      178.09      460.89      356.86       94.58
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
      numa05.sh      Real:      393.98      493.48      436.61       35.59
      numa05.sh       Sys:      164.49      329.15      265.87       61.78
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
      numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
      numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
      numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
      numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
      numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
      numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
      numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
      numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
      numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
      numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
      numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
      numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
      numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
      numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
      
      While there is a regression with this change, this change is needed from a
      correctness perspective. Also it helps consolidation as seen from perf bench
      output.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee7e74d
    • S
      sched/numa: Use task faults only if numa_group is not yet set up · f03bb676
      Srikar Dronamraju 提交于
      When numa_group faults are available, task_numa_placement only uses
      numa_group faults to evaluate preferred node. However it still accounts
      task faults and even evaluates the preferred node just based on task
      faults just to discard it in favour of preferred node chosen on the
      basis of numa_group.
      
      Instead use task faults only if numa_group is not set.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25549.6     25215.7     -1.30
      1     73190       72107       -1.47
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113437      113372      -0.05
      1     196130      177403      -9.54
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      506.35      794.46      599.06      104.26
      numa01.sh       Sys:      150.37      223.56      195.99       24.94
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
      numa02.sh      Real:       60.33       62.40       61.31        0.90
      numa02.sh       Sys:       18.12       31.66       24.28        5.89
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98
      numa03.sh      Real:      696.47      853.62      745.80       57.28
      numa03.sh       Sys:       85.68      123.71       97.89       13.48
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
      numa04.sh      Real:      444.05      514.83      497.06       26.85
      numa04.sh       Sys:      230.39      375.79      316.23       48.58
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
      numa05.sh      Real:      423.09      460.41      439.57       13.92
      numa05.sh       Sys:      287.38      480.15      369.37       68.52
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
      numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
      numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
      numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
      numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
      numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
      numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
      numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
      numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
      numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f03bb676
    • S
      sched/numa: Set preferred_node based on best_cpu · 8cd45eee
      Srikar Dronamraju 提交于
      Currently preferred node is set to dst_nid which is the last node in the
      iteration whose group weight or task weight is greater than the current
      node. However it doesn't guarantee that dst_nid has the numa capacity
      to move. It also doesn't guarantee that dst_nid has the best_cpu which
      is the CPU/node ideal for node migration.
      
      Lets consider faults on a 4 node system with group weight numbers
      in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
      is running on 3 and 0 is its preferred node but its capacity is full.
      Consider nodes 1, 2 and 3 have capacity. Then the task should be
      migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
      points to the last node whose faults were greater than current node.
      
      Modify to set the preferred node based of best_cpu. Earlier setting
      preferred node was skipped if nr_active_nodes is 1. This could result in
      the task being moved out of the preferred node to a random node during
      regular load balancing.
      
      Also while modifying task_numa_migrate(), use sched_setnuma to set
      preferred node. This ensures out numa accounting is correct.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25122.9     25549.6     1.698
      1     73850       73190       -0.89
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     105930      113437      7.08676
      1     178624      196130      9.80047
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      435.78      653.81      534.58       83.20
      numa01.sh       Sys:      121.93      187.18      145.90       23.47
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
      numa02.sh      Real:       60.64       61.63       61.19        0.40
      numa02.sh       Sys:       14.72       25.68       19.06        4.03
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82
      numa03.sh      Real:      746.51      808.24      780.36       23.88
      numa03.sh       Sys:       97.26      108.48      105.07        4.28
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
      numa04.sh      Real:      465.97      519.27      484.81       19.62
      numa04.sh       Sys:      304.43      359.08      334.68       20.64
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
      numa05.sh      Real:      411.57      457.20      433.29       16.58
      numa05.sh       Sys:      230.05      435.48      339.95       67.58
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
      numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
      numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
      numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
      numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
      numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
      numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
      numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
      numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
      numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8cd45eee
    • S
      sched/numa: Simplify load_too_imbalanced() · 5f95ba7a
      Srikar Dronamraju 提交于
      Currently load_too_imbalance() cares about the slope of imbalance.
      It doesn't care of the direction of the imbalance.
      
      However this may not work if nodes that are being compared have
      dissimilar capacities. Few nodes might have more cores than other nodes
      in the system. Also unlike traditional load balance at a NUMA sched
      domain, multiple requests to migrate from the same source node to same
      destination node may run in parallel. This can cause huge load
      imbalance. This is specially true on a larger machines with either large
      cores per node or more number of nodes in the system. Hence allow
      move/swap only if the imbalance is going to reduce.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25058.2     25122.9     0.25
      1     72950       73850       1.23
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      516.14      892.41      739.84      151.32
      numa01.sh       Sys:      153.16      192.99      177.70       14.58
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
      numa02.sh      Real:       60.91       62.35       61.58        0.63
      numa02.sh       Sys:       16.47       26.16       21.20        3.85
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04
      numa03.sh      Real:      739.07      917.73      795.75       64.45
      numa03.sh       Sys:       94.46      136.08      109.48       14.58
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
      numa04.sh      Real:      442.61      715.43      530.31       96.12
      numa04.sh       Sys:      224.90      348.63      285.61       48.83
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
      numa05.sh      Real:      386.13      489.17      434.94       43.59
      numa05.sh       Sys:      144.29      438.56      278.80      105.78
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
      numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
      numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
      numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
      numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
      numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
      numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
      numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
      numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
      numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f95ba7a
    • S
      sched/numa: Evaluate move once per node · 305c1fac
      Srikar Dronamraju 提交于
      task_numa_compare() helps choose the best CPU to move or swap the
      selected task. To achieve this task_numa_compare() is called for every
      CPU in the node. Currently it evaluates if the task can be moved/swapped
      for each of the CPUs. However the move evaluation is mostly independent
      of the CPU. Evaluating the move logic once per node, provides scope for
      simplifying task_numa_compare().
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25705.2     25058.2     -2.51
      1     74433       72950       -1.99
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     96589.6     105930      9.670
      1     181830      178624      -1.76
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      440.65      941.32      758.98      189.17
      numa01.sh       Sys:      183.48      320.07      258.42       50.09
      numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
      numa02.sh      Real:       61.24       65.35       62.49        1.49
      numa02.sh       Sys:       16.83       24.18       21.40        2.60
      numa02.sh      User:     5219.59     5356.34     5264.03       49.07
      numa03.sh      Real:      822.04      912.40      873.55       37.35
      numa03.sh       Sys:      118.80      140.94      132.90        7.60
      numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
      numa04.sh      Real:      690.66      872.12      778.49       65.44
      numa04.sh       Sys:      459.26      563.03      494.03       42.39
      numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
      numa05.sh      Real:      418.37      562.28      525.77       54.27
      numa05.sh       Sys:      299.45      481.00      392.49       64.27
      numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
      numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
      numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
      numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
      numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
      numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
      numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
      numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
      numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
      numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
      numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
      numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
      numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
      numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
      numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      305c1fac
    • V
      sched/fair: Remove #ifdefs from scale_rt_capacity() · 2e62c474
      Vincent Guittot 提交于
      Reuse cpu_util_irq() that has been defined for schedutil and set irq util
      to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.
      
      But the compiler is not able to optimize the sequence (at least with
      aarch64 GCC 7.2.1):
      
      	free *= (max - irq);
      	free /= max;
      
      when irq is fixed to 0
      
      Add a new inline function scale_irq_capacity() that will scale utilization
      when irq is accounted. Reuse this funciton in schedutil which applies
      similar formula.
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2e62c474
  8. 16 7月, 2018 7 次提交
    • V
      sched/core: Remove the rt_avg code · bbb62c0b
      Vincent Guittot 提交于
      rt_avg is not used anywhere anymore, so we can remove all related code.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bbb62c0b
    • V
      sched/core: Use PELT for scale_rt_capacity() · 523e979d
      Vincent Guittot 提交于
      The utilization of the CPU by RT, DL and IRQs are now tracked with
      PELT so we can use these metrics instead of rt_avg to evaluate the remaining
      capacity available for CFS class.
      
      scale_rt_capacity() behavior has been changed and now returns the remaining
      capacity available for CFS instead of a scaling factor because RT, DL and
      IRQ provide now absolute utilization value.
      
      The same formula as schedutil is used:
      
        IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg
      
      but the implementation is different because it doesn't return the same value
      and doesn't benefit of the same optimization.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-10-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      523e979d
    • V
      sched/irq: Add IRQ utilization tracking · 91c27493
      Vincent Guittot 提交于
      interrupt and steal time are the only remaining activities tracked by
      rt_avg. Like for sched classes, we can use PELT to track their average
      utilization of the CPU. But unlike sched class, we don't track when
      entering/leaving interrupt; Instead, we take into account the time spent
      under interrupt context when we update rqs' clock (rq_clock_task).
      This also means that we have to decay the normal context time and account
      for interrupt time during the update.
      
      That's also important to note that because:
      
        rq_clock == rq_clock_task + interrupt time
      
      and rq_clock_task is used by a sched class to compute its utilization, the
      util_avg of a sched class only reflects the utilization of the time spent
      in normal context and not of the whole time of the CPU. The utilization of
      interrupt gives an more accurate level of utilization of CPU.
      
      The CPU utilization is:
      
        avg_irq + (1 - avg_irq / max capacity) * /Sum avg_rq
      
      Most of the time, avg_irq is small and neglictible so the use of the
      approximation CPU utilization = /Sum avg_rq was enough.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91c27493
    • V
      sched/dl: Add dl_rq utilization tracking · 3727e0e1
      Vincent Guittot 提交于
      Similarly to what happens with RT tasks, CFS tasks can be preempted by DL
      tasks and the CFS's utilization might no longer describes the real
      utilization level.
      
      Current DL bandwidth reflects the requirements to meet deadline when tasks are
      enqueued but not the current utilization of the DL sched class. We track
      DL class utilization to estimate the system utilization.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-5-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3727e0e1
    • V
      sched/rt: Add rt_rq utilization tracking · 371bf427
      Vincent Guittot 提交于
      schedutil governor relies on cfs_rq's util_avg to choose the OPP when CFS
      tasks are running. When the CPU is overloaded by CFS and RT tasks, CFS tasks
      are preempted by RT tasks and in this case util_avg reflects the remaining
      capacity but not what CFS want to use. In such case, schedutil can select a
      lower OPP whereas the CPU is overloaded. In order to have a more accurate
      view of the utilization of the CPU, we track the utilization of RT tasks.
      Only util_avg is correctly tracked but not load_avg and runnable_load_avg
      which are useless for rt_rq.
      
      rt_rq uses rq_clock_task and cfs_rq uses cfs_rq_clock_task but they are
      the same at the root group level, so the PELT windows of the util_sum are
      aligned.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      371bf427
    • V
      sched/pelt: Move PELT related code in a dedicated file · c0796298
      Vincent Guittot 提交于
      We want to track rt_rq's utilization as a part of the estimation of the
      whole rq's utilization. This is necessary because rt tasks can steal
      utilization to cfs tasks and make them lighter than they are.
      As we want to use the same load tracking mecanism for both and prevent
      useless dependency between cfs and rt code, PELT code is moved in a
      dedicated file.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: claudio@evidence.eu.com
      Cc: daniel.lezcano@linaro.org
      Cc: dietmar.eggemann@arm.com
      Cc: joel@joelfernandes.org
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: patrick.bellasi@arm.com
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: valentin.schneider@arm.com
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/1530200714-4504-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c0796298
    • Q
      sched/fair: Fix util_avg of new tasks for asymmetric systems · 8fe5c5a9
      Quentin Perret 提交于
      When a new task wakes-up for the first time, its initial utilization
      is set to half of the spare capacity of its CPU. The current
      implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
      directly as a capacity reference. As a result, on a big.LITTLE system, a
      new task waking up on an idle little CPU will be given ~512 of util_avg,
      even if the CPU's capacity is significantly less than that.
      
      Fix this by computing the spare capacity with arch_scale_cpu_capacity().
      Signed-off-by: NQuentin Perret <quentin.perret@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fe5c5a9
  9. 03 7月, 2018 3 次提交
    • V
      sched/util_est: Fix util_est_dequeue() for throttled cfs_rq · 3482d98b
      Vincent Guittot 提交于
      When a cfs_rq is throttled, parent cfs_rq->nr_running is decreased and
      everything happens at cfs_rq level. Currently util_est stays unchanged
      in such case and it keeps accounting the utilization of throttled tasks.
      This can somewhat make sense as we don't dequeue tasks but only throttled
      cfs_rq.
      
      If a task of another group is enqueued/dequeued and root cfs_rq becomes
      idle during the dequeue, util_est will be cleared whereas it was
      accounting util_est of throttled tasks before. So the behavior of util_est
      is not always the same regarding throttled tasks and depends of side
      activity. Furthermore, util_est will not be updated when the cfs_rq is
      unthrottled as everything happens at cfs_rq level. Main results is that
      util_est will stay null whereas we now have running tasks. We have to wait
      for the next dequeue/enqueue of the previously throttled tasks to get an
      up to date util_est.
      
      Remove the assumption that cfs_rq's estimated utilization of a CPU is 0
      if there is no running task so the util_est of a task remains until the
      latter is dequeued even if its cfs_rq has been throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 7f65ea42 ("sched/fair: Add util_est on top of PELT")
      Link: http://lkml.kernel.org/r/1528972380-16268-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3482d98b
    • X
      sched/fair: Advance global expiration when period timer is restarted · f1d1be8a
      Xunlei Pang 提交于
      When period gets restarted after some idle time, start_cfs_bandwidth()
      doesn't update the expiration information, expire_cfs_rq_runtime() will
      see cfs_rq->runtime_expires smaller than rq clock and go to the clock
      drift logic, wasting needless CPU cycles on the scheduler hot path.
      
      Update the global expiration in start_cfs_bandwidth() to avoid frequent
      expire_cfs_rq_runtime() calls once a new period begins.
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180620101834.24455-2-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d1be8a
    • X
      sched/fair: Fix bandwidth timer clock drift condition · 512ac999
      Xunlei Pang 提交于
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      512ac999
  10. 21 6月, 2018 2 次提交
  11. 13 6月, 2018 1 次提交
    • K
      treewide: kzalloc() -> kcalloc() · 6396bb22
      Kees Cook 提交于
      The kzalloc() function has a 2-factor argument form, kcalloc(). This
      patch replaces cases of:
      
              kzalloc(a * b, gfp)
      
      with:
              kcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kzalloc(a * b * c, gfp)
      
      with:
      
              kzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kzalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kzalloc
      + kcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kzalloc(sizeof(THING) * C2, ...)
      |
        kzalloc(sizeof(TYPE) * C2, ...)
      |
        kzalloc(C1 * C2 * C3, ...)
      |
        kzalloc(C1 * C2, ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kzalloc
      + kcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6396bb22