1. 24 4月, 2020 4 次提交
  2. 22 4月, 2020 1 次提交
    • M
      sched: Avoid scale real weight down to zero · 9b83fd88
      Michael Wang 提交于
      fix #26198889
      
      commit 26cf52229efc87e2effa9d788f9b33c40fb3358a linux-next
      
      During our testing, we found a case that shares no longer
      working correctly, the cgroup topology is like:
      
        /sys/fs/cgroup/cpu/A		(shares=102400)
        /sys/fs/cgroup/cpu/A/B	(shares=2)
        /sys/fs/cgroup/cpu/A/B/C	(shares=1024)
      
        /sys/fs/cgroup/cpu/D		(shares=1024)
        /sys/fs/cgroup/cpu/D/E	(shares=1024)
        /sys/fs/cgroup/cpu/D/E/F	(shares=1024)
      
      The same benchmark is running in group C & F, no other tasks are
      running, the benchmark is capable to consumed all the CPUs.
      
      We suppose the group C will win more CPU resources since it could
      enjoy all the shares of group A, but it's F who wins much more.
      
      The reason is because we have group B with shares as 2, since
      A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
      so A->cfs_rq.load.weight become very small.
      
      And in calc_group_shares() we calculate shares as:
      
        load = max(scale_load_down(cfs_rq->load.weight),
      cfs_rq->avg.load_avg);
        shares = (tg_shares * load) / tg_weight;
      
      Since the 'cfs_rq->load.weight' is too small, the load become 0
      after scale down, although 'tg_shares' is 102400, shares of the se
      which stand for group A on root cfs_rq become 2.
      
      While the se of D on root cfs_rq is far more bigger than 2, so it
      wins the battle.
      
      Thus when scale_load_down() scale real weight down to 0, it's no
      longer telling the real story, the caller will have the wrong
      information and the calculation will be buggy.
      
      This patch add check in scale_load_down(), so the real weight will
      be >= MIN_SHARES after scale, after applied the group C wins as
      expected.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.comAcked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      9b83fd88
  3. 17 1月, 2020 2 次提交
    • X
      alinux: hotfix: Add Cloud Kernel hotfix enhancement · f94e5b1a
      Xunlei Pang 提交于
      We reserve some fields beforehand for core structures prone to change,
      so that we won't hurt when extra fields have to be added for hotfix,
      thereby inceasing the success rate, we even can hot add features with
      this enhancement.
      
      After reserving, normally cache does not matter as the reserved fields
      (usually at tail) are not accessed at all.
      
      Currently involve the following structures:
          MM:
          struct zone
          struct pglist_data
          struct mm_struct
          struct vm_area_struct
          struct mem_cgroup
          struct writeback_control
      
          Block:
          struct gendisk
          struct backing_dev_info
          struct bio
          struct queue_limits
          struct request_queue
          struct blkcg
          struct blkcg_policy
          struct blk_mq_hw_ctx
          struct blk_mq_tag_set
          struct blk_mq_queue_data
          struct blk_mq_ops
          struct elevator_mq_ops
          struct inode
          struct dentry
          struct address_space
          struct block_device
          struct hd_struct
          struct bio_set
      
          Network:
          struct sk_buff
          struct sock
          struct net_device_ops
          struct xt_target
          struct dst_entry
          struct dst_ops
          struct fib_rule
      
          Scheduler:
          struct task_struct
          struct cfs_rq
          struct rq
          struct sched_statistics
          struct sched_entity
          struct signal_struct
          struct task_group
          struct cpuacct
      
          cgroup:
          struct cgroup_root
          struct cgroup_subsys_state
          struct cgroup_subsys
          struct css_set
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      [ caspar: use SPDX-License-Identifier ]
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f94e5b1a
    • X
      alinux: psi: Support PSI under cgroup v1 · 1f49a738
      Xunlei Pang 提交于
      Export "cpu|io|memory.pressure" under cgroup v1 "cpuacct" subsystem.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f49a738
  4. 27 12月, 2019 4 次提交
  5. 21 11月, 2019 1 次提交
  6. 13 11月, 2019 1 次提交
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151
  7. 04 6月, 2019 1 次提交
  8. 06 4月, 2019 1 次提交
  9. 20 12月, 2018 1 次提交
  10. 06 12月, 2018 1 次提交
    • T
      sched/smt: Expose sched_smt_present static key · 340693ee
      Thomas Gleixner 提交于
      commit 321a874a upstream
      
      Make the scheduler's 'sched_smt_present' static key globaly available, so
      it can be used in the x86 speculation control code.
      
      Provide a query function and a stub for the CONFIG_SMP=n case.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340693ee
  11. 11 10月, 2018 1 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
  12. 02 10月, 2018 2 次提交
    • S
      sched/numa: Pass destination CPU as a parameter to migrate_task_rq · 1327237a
      Srikar Dronamraju 提交于
      This additional parameter (new_cpu) is used later for identifying if
      task migration is across nodes.
      
      No functional change.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203353  200668   -1.32036
      1     328205  321791   -1.95427
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     214384  204848   -4.44809
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188553  188098   -0.241311
      1     196273  200351   2.07772
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     57581.2  58145.9  0.980702
      1     103468   103798   0.318939
      
      Brings out the variance between different specjbb2005 runs.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,941,377      13,912,183
      migrations                1,157,323       1,155,931
      faults                    382,175         367,139
      cache-misses              54,993,823,500  54,240,196,814
      sched:sched_move_numa     2,005           1,571
      sched:sched_stick_numa    14              9
      sched:sched_swap_numa     529             463
      migrate:mm_migrate_pages  1,573           703
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        67099   50155
      numa_hint_faults_local  58456   45264
      numa_hit                240416  239652
      numa_huge_pte_updates   18      36
      numa_interleave         65      68
      numa_local              240339  239576
      numa_other              77      76
      numa_pages_migrated     1574    680
      numa_pte_updates        77182   71146
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,176,453       3,156,720
      migrations                30,238          30,354
      faults                    87,869          97,261
      cache-misses              12,544,479,391  12,400,026,826
      sched:sched_move_numa     23              4
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     6               1
      migrate:mm_migrate_pages  10              20
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        236     272
      numa_hint_faults_local  201     186
      numa_hit                72293   71362
      numa_huge_pte_updates   0       0
      numa_interleave         26      23
      numa_local              72233   71299
      numa_other              60      63
      numa_pages_migrated     8       2
      numa_pte_updates        0       0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,478,820    8,606,824
      migrations                171,323      155,352
      faults                    307,499      301,409
      cache-misses              240,353,599  157,759,224
      sched:sched_move_numa     214          168
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     4            3
      migrate:mm_migrate_pages  89           125
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        5301    4650
      numa_hint_faults_local  4745    3946
      numa_hit                92943   90489
      numa_huge_pte_updates   0       0
      numa_interleave         899     892
      numa_local              92345   90034
      numa_other              598     455
      numa_pages_migrated     88      124
      numa_pte_updates        5505    4818
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,066,172   2,113,167
      migrations                11,076      10,533
      faults                    149,544     142,727
      cache-misses              10,398,067  5,594,192
      sched:sched_move_numa     43          10
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           0
      migrate:mm_migrate_pages  6           6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3552    744
      numa_hint_faults_local  3347    584
      numa_hit                25611   25551
      numa_huge_pte_updates   0       0
      numa_interleave         213     263
      numa_local              25583   25302
      numa_other              28      249
      numa_pages_migrated     6       6
      numa_pte_updates        3535    744
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        99,358,136       101,227,352
      migrations                4,041,607        4,151,829
      faults                    749,653          745,233
      cache-misses              225,562,543,251  224,669,561,766
      sched:sched_move_numa     771              617
      sched:sched_stick_numa    14               2
      sched:sched_swap_numa     204              187
      migrate:mm_migrate_pages  1,180            316
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        27409   24195
      numa_hint_faults_local  20677   21639
      numa_hit                239988  238331
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              239983  238331
      numa_other              5       0
      numa_pages_migrated     1016    204
      numa_pte_updates        27916   24561
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        60,899,307      62,738,978
      migrations                544,668         562,702
      faults                    270,834         228,465
      cache-misses              74,543,455,635  75,778,067,952
      sched:sched_move_numa     735             648
      sched:sched_stick_numa    25              13
      sched:sched_swap_numa     174             137
      migrate:mm_migrate_pages  816             733
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        11059   10281
      numa_hint_faults_local  4733    3242
      numa_hit                41384   36338
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              41383   36338
      numa_other              1       0
      numa_pages_migrated     815     706
      numa_pte_updates        11323   10176
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1327237a
    • S
      sched/numa: Stop multiple tasks from moving to the CPU at the same time · a4739eca
      Srikar Dronamraju 提交于
      Task migration under NUMA balancing can happen in parallel. More than
      one task might choose to migrate to the same CPU at the same time. This
      can result in:
      
      - During task swap, choosing a task that was not part of the evaluation.
      - During task swap, task which just got moved into its preferred node,
        moving to a completely different node.
      - During task swap, task failing to move to the preferred node, will have
        to wait an extra interval for the next migrate opportunity.
      - During task movement, multiple task movements can cause load imbalance.
      
      This problem is more likely if there are more cores per node or more
      nodes in the system.
      
      Use a per run-queue variable to check if NUMA-balance is active on the
      run-queue.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200194  203353   1.57797
      1     311331  328205   5.41995
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     197654  214384   8.46429
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     192605  188553   -2.10379
      1     213402  196273   -8.02664
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     52227.1  57581.2  10.2516
      1     102529   103468   0.915838
      
      There is a regression on power 9 box. If we look at the details,
      that box has a sudden jump in cache-misses with this patch.
      All other parameters seem to be pointing towards NUMA
      consolidation.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,345,784      13,941,377
      migrations                1,127,820       1,157,323
      faults                    374,736         382,175
      cache-misses              55,132,054,603  54,993,823,500
      sched:sched_move_numa     1,923           2,005
      sched:sched_stick_numa    52              14
      sched:sched_swap_numa     595             529
      migrate:mm_migrate_pages  1,932           1,573
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        60605   67099
      numa_hint_faults_local  51804   58456
      numa_hit                239945  240416
      numa_huge_pte_updates   14      18
      numa_interleave         60      65
      numa_local              239865  240339
      numa_other              80      77
      numa_pages_migrated     1931    1574
      numa_pte_updates        67823   77182
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,016,467       3,176,453
      migrations                37,326          30,238
      faults                    115,342         87,869
      cache-misses              11,692,155,554  12,544,479,391
      sched:sched_move_numa     965             23
      sched:sched_stick_numa    8               0
      sched:sched_swap_numa     35              6
      migrate:mm_migrate_pages  1,168           10
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        16286   236
      numa_hint_faults_local  11863   201
      numa_hit                112482  72293
      numa_huge_pte_updates   33      0
      numa_interleave         20      26
      numa_local              112419  72233
      numa_other              63      60
      numa_pages_migrated     1144    8
      numa_pte_updates        32859   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,629,724    8,478,820
      migrations                221,052      171,323
      faults                    308,661      307,499
      cache-misses              135,574,913  240,353,599
      sched:sched_move_numa     147          214
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            4
      migrate:mm_migrate_pages  64           89
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11481   5301
      numa_hint_faults_local  10968   4745
      numa_hit                89773   92943
      numa_huge_pte_updates   0       0
      numa_interleave         1116    899
      numa_local              89220   92345
      numa_other              553     598
      numa_pages_migrated     62      88
      numa_pte_updates        11694   5505
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,272,887  2,066,172
      migrations                12,206     11,076
      faults                    163,704    149,544
      cache-misses              4,801,186  10,398,067
      sched:sched_move_numa     44         43
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  17         6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        2261    3552
      numa_hint_faults_local  1993    3347
      numa_hit                25726   25611
      numa_huge_pte_updates   0       0
      numa_interleave         239     213
      numa_local              25498   25583
      numa_other              228     28
      numa_pages_migrated     17      6
      numa_pte_updates        2266    3535
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        117,980,962      99,358,136
      migrations                3,950,220        4,041,607
      faults                    736,979          749,653
      cache-misses              224,976,072,879  225,562,543,251
      sched:sched_move_numa     504              771
      sched:sched_stick_numa    50               14
      sched:sched_swap_numa     239              204
      migrate:mm_migrate_pages  1,260            1,180
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        18293   27409
      numa_hint_faults_local  11969   20677
      numa_hit                240854  239988
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240851  239983
      numa_other              3       5
      numa_pages_migrated     1190    1016
      numa_pte_updates        18106   27916
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        61,053,158      60,899,307
      migrations                551,586         544,668
      faults                    244,174         270,834
      cache-misses              74,326,766,973  74,543,455,635
      sched:sched_move_numa     344             735
      sched:sched_stick_numa    24              25
      sched:sched_swap_numa     140             174
      migrate:mm_migrate_pages  568             816
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        6461    11059
      numa_hint_faults_local  2283    4733
      numa_hit                35661   41384
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35661   41383
      numa_other              0       1
      numa_pages_migrated     568     815
      numa_pte_updates        6518    11323
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4739eca
  13. 25 7月, 2018 2 次提交
  14. 16 7月, 2018 9 次提交
  15. 03 7月, 2018 2 次提交
    • X
      sched/fair: Fix bandwidth timer clock drift condition · 512ac999
      Xunlei Pang 提交于
      I noticed that cgroup task groups constantly get throttled even
      if they have low CPU usage, this causes some jitters on the response
      time to some of our business containers when enabling CPU quotas.
      
      It's very simple to reproduce:
      
        mkdir /sys/fs/cgroup/cpu/test
        cd /sys/fs/cgroup/cpu/test
        echo 100000 > cpu.cfs_quota_us
        echo $$ > tasks
      
      then repeat:
      
        cat cpu.stat | grep nr_throttled  # nr_throttled will increase steadily
      
      After some analysis, we found that cfs_rq::runtime_remaining will
      be cleared by expire_cfs_rq_runtime() due to two equal but stale
      "cfs_{b|q}->runtime_expires" after period timer is re-armed.
      
      The current condition to judge clock drift in expire_cfs_rq_runtime()
      is wrong, the two runtime_expires are actually the same when clock
      drift happens, so this condtion can never hit. The orginal design was
      correctly done by this commit:
      
        a9cf55b2 ("sched: Expire invalid runtime")
      
      ... but was changed to be the current implementation due to its locking bug.
      
      This patch introduces another way, it adds a new field in both structures
      cfs_rq and cfs_bandwidth to record the expiration update sequence, and
      uses them to figure out if clock drift happens (true if they are equal).
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
      Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      512ac999
    • V
      sched/rt: Fix call to cpufreq_update_util() · 296b2ffe
      Vincent Guittot 提交于
      With commit:
      
        8f111bc3 ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
      
      the schedutil governor uses rq->rt.rt_nr_running to detect whether an
      RT task is currently running on the CPU and to set frequency to max
      if necessary.
      
      cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
      rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
      called so schedutil still considers that an RT task is running when the
      last task is dequeued. The update of rq->rt.rt_nr_running happens later
      in dequeue_rt_stack().
      
      In fact, we can take advantage of the sequence that the dequeue then
      re-enqueue rt entities when a rt task is enqueued or dequeued;
      As a result enqueue_top_rt_rq() is always called when a task is
      enqueued or dequeued and also when groups are throttled or unthrottled.
      The only place that not use enqueue_top_rt_rq() is when root rt_rq is
      throttled.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: juri.lelli@redhat.com
      Cc: patrick.bellasi@arm.com
      Cc: viresh.kumar@linaro.org
      Fixes: 8f111bc3 ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
      Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      296b2ffe
  16. 31 5月, 2018 1 次提交
  17. 18 5月, 2018 1 次提交
  18. 14 5月, 2018 1 次提交
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  19. 05 4月, 2018 1 次提交
  20. 20 3月, 2018 1 次提交
    • P
      sched/cpufreq/schedutil: Use util_est for OPP selection · a07630b8
      Patrick Bellasi 提交于
      When schedutil looks at the CPU utilization, the current PELT value for
      that CPU is returned straight away. In certain scenarios this can have
      undesired side effects and delays on frequency selection.
      
      For example, since the task utilization is decayed at wakeup time, a
      long sleeping big task newly enqueued does not add immediately a
      significant contribution to the target CPU. This introduces some latency
      before schedutil will be able to detect the best frequency required by
      that task.
      
      Moreover, the PELT signal build-up time is a function of the current
      frequency, because of the scale invariant load tracking support. Thus,
      starting from a lower frequency, the utilization build-up time will
      increase even more and further delays the selection of the actual
      frequency which better serves the task requirements.
      
      In order to reduce these kind of latencies, we integrate the usage
      of the CPU's estimated utilization in the sugov_get_util function.
      
      This allows to properly consider the expected utilization of a CPU which,
      for example, has just got a big task running after a long sleep period.
      Ultimately this allows to select the best frequency to run a task
      right after its wake-up.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a07630b8
  21. 09 3月, 2018 2 次提交