1. 02 9月, 2020 15 次提交
  2. 29 6月, 2020 4 次提交
  3. 24 6月, 2020 14 次提交
    • D
      sched/fair: Remove sgs->sum_weighted_load · 980d06cf
      Dietmar Eggemann 提交于
      to #28739709
      
      commit af75d1a9a9f75bf030c2f35705f1ff6d226f96fe upstream
      
      Since sg_lb_stats::sum_weighted_load is now identical with
      sg_lb_stats::group_load remove it and replace its use case
      (calculating load per task) with the latter.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Link: https://lkml.kernel.org/r/20190527062116.11512-7-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      980d06cf
    • D
      sched/core: Remove sd->*_idx · b21904d4
      Dietmar Eggemann 提交于
      to #28739709
      
      commit 0e1fef63d92d61ed561e504c3a078a827a0f9bfe upstream
      
      The sched domain per rq load index files also disappear from the
      /proc/sys/kernel/sched_domain/cpuX/domainY directories.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      b21904d4
    • D
      sched/core: Remove rq->cpu_load[] · c3055505
      Dietmar Eggemann 提交于
      to #28739709
      
      commit 55627e3cd22c315c4a02fe3bbbb7234ec439cb1d upstream
      
      The per rq load array values also disappear from the cpu#X sections in
      /proc/sched_debug.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      c3055505
    • D
      sched/debug: Remove sd->*_idx range on sysctl · 606eca95
      Dietmar Eggemann 提交于
      to #28739709
      
      commit 3d8d53554405952993bb0279ef3ebebc51740074 upstream
      
      This reverts:
      
        commit 201c373e ("sched/debug: Limit sd->*_idx range on sysctl")
      
      Load indexes (sd->*_idx) are no longer needed without rq->cpu_load[].
      The range check for load indexes can be removed as well. Get rid of it
      before the rq->cpu_load[] since it uses CPU_LOAD_IDX_MAX.
      
      At the same time, fix the following coding style issues detected by
      scripts/checkpatch.pl:
      
        ERROR: space prohibited before that ','
        ERROR: space prohibited before that close parenthesis ')'
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20190527062116.11512-4-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      606eca95
    • D
      sched/fair: Replace source_load() & target_load() with weighted_cpuload() · d24a239d
      Dietmar Eggemann 提交于
      to #28739709
      
      commit 1c1b8a7b03ef50f80f5d0c871ee261c04a6c967e upstream
      
      With LB_BIAS disabled, source_load() & target_load() return
      weighted_cpuload(). Replace both with calls to weighted_cpuload().
      
      The function to obtain the load index (sd->*_idx) for an sd,
      get_sd_load_idx(), can be removed as well.
      
      Finally, get rid of the sched feature LB_BIAS.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20190527062116.11512-3-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      d24a239d
    • D
      sched/fair: Remove the rq->cpu_load[] update code · 112598d6
      Dietmar Eggemann 提交于
      to #28739709
      
      commit 5e83eafbfd3b351537c0d74467fc43e8a88f4ae4 upstream
      
      With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
      any more.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Valentin Schneider <valentin.schneider@arm.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      112598d6
    • D
      sched/fair: Remove rq->load · be11b02d
      Dietmar Eggemann 提交于
      to #28739709
      
      commit f2bedc4705659216bd60948029ad8dfedf923ad9 upstream
      
      The CFS class is the only one maintaining and using the CPU wide load
      (rq->load(.weight)). The last use case of the CPU wide load in CFS's
      set_next_entity() can be replaced by using the load of the CFS class
      (rq->cfs.load(.weight)) instead.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      be11b02d
    • D
      cpuidle: menu: Remove get_loadavg() from the performance multiplier · 800bf05d
      Daniel Lezcano 提交于
      to #28739709
      
      commit a7fe5190c03f8137ef08db84a58dd4daf2c4785d upstream
      
      The function get_loadavg() returns almost always zero. To be more
      precise, statistically speaking for a total of 1023379 times passing
      in the function, the load is equal to zero 1020728 times, greater than
      100, 610 times, the remaining is between 0 and 5.
      
      In 2011, the get_loadavg() was removed from the Android tree because
      of the above [1]. At this time, the load was:
      
      unsigned long this_cpu_load(void)
      {
              struct rq *this = this_rq();
              return this->cpu_load[0];
      }
      
      In 2014, the code was changed by commit 372ba8cb (cpuidle: menu: Lookup CPU
      runqueues less) and the load is:
      
      void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
      {
              struct rq *rq = this_rq();
              *nr_waiters = atomic_read(&rq->nr_iowait);
              *load = rq->load.weight;
      }
      
      with the same result.
      
      Both measurements show using the load in this code path does no matter
      anymore. Removing it.
      
      [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      800bf05d
    • D
      sched/fair: Disable LB_BIAS by default · bae52979
      Dietmar Eggemann 提交于
      to #28739709
      
      commit fdf5f315d5cfaefb7bb8a62ec4bf37b9891837aa upstream
      
      LB_BIAS allows the adjustment on how conservative load should be
      balanced.
      
      The rq->cpu_load[idx] array is used for this functionality. It contains
      weighted CPU load decayed average values over different intervals
      (idx = 1..4). Idx = 0 is the weighted CPU load itself.
      
      The values are updated during scheduler_tick, before idle balance and at
      nohz exit.
      
      There are 5 different types of idx's per sched domain (sd). Each of them
      is used to index into the rq->cpu_load[idx] array in a specific scenario
      (busy, idle and newidle for load balancing, forkexec for wake-up
      slow-path load balancing and wake for affine wakeup based on weight).
      Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
      respectively. All the other sd idx's are set to 0.
      
      Conservative load balancing is achieved for sd idx's >= 1 by using the
      min/max (source_load()/target_load()) value between the current weighted
      CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
      CPU load in load balancing or vice versa in the wake-up slow-path load
      balancing.
      There is no conservative balancing for sd idx = 0 since only current
      weighted CPU load is used in this case.
      
      It is very likely that LB_BIAS' influence on load balancing can be
      neglected (see test results below). This is further supported by:
      
      (1) Weighted CPU load today is by itself a decayed average value (PELT)
          (cfs_rq->avg->runnable_load_avg) and not the instantaneous load
          (rq->load.weight) it was when LB_BIAS was introduced.
      
      (2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
          to sd's newidle and busy idx) in find_busiest_group() when comparing
          busiest and local avg load to make load balancing even more
          conservative.
      
      (3) The sd forkexec and newidle idx are always set to 0 so there is no
          adjustment on how conservatively load balancing is done here.
      
      (4) Affine wakeup based on weight (wake_affine_weight()) will not be
          impacted since the sd wake idx is always set to 0.
      
      Let's disable LB_BIAS by default for a few kernel releases to make sure
      that no workload and no scheduler topology is affected. The benefit of
      being able to remove the LB_BIAS dependency from source_load() and
      target_load() is that the entire rq->cpu_load[idx] code could be removed
      in this case.
      
      It is really hard to say if there is no regression w/o testing this with
      a lot of different workloads on a lot of different platforms, especially
      NUMA machines.
      The following 104 LKP (Linux Kernel Performance) tests were run by the
      0-Day guys mostly on multi-socket hosts with a larger number of logical
      cpus (88, 192).
      The base for the test was commit b3dae109 ("sched/swait: Rename to
      exclusive") (tip/sched/core v4.18-rc1).
      Only 2 out of the 104 tests had a significant change in one of the
      metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
      files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
      Tests which showed a change in one of the metrics are marked with a '*'
      and this change is listed as well.
      
      (a) lkp-bdw-ep3:
            88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G
      
          dd-write/10m-1HDD-cfq-btrfs-100dd-performance
          fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
        * fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
            7.50  7%  8.00  ±  6%  fsmark.files_per_sec
          fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
          fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
          fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
          kbuild/300s-50%-vmlinux_prereq-performance
          kbuild/300s-200%-vmlinux_prereq-performance
          kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
          kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4
      
      (b) lkp-skl-4sp1:
            192 threads Intel(R) Xeon(R) Platinum 8160 768G
      
          dbench/100%-performance
          ebizzy/200%-100x-10s-performance
          hackbench/1600%-process-pipe-performance
          iperf/300s-cs-localhost-tcp-performance
          iperf/300s-cs-localhost-udp-performance
          perf-bench-numa-mem/2t-300M-performance
          perf-bench-sched-pipe/10000000ops-process-performance
          perf-bench-sched-pipe/10000000ops-threads-performance
          schbench/2-16-300-30000-30000-performance
          tbench/100%-cs-localhost-performance
      
      (c) lkp-bdw-ep6:
            88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G
      
          stress-ng/100%-60s-pipe-performance
          unixbench/300s-1-whetstone-double-performance
          unixbench/300s-1-shell1-performance
          unixbench/300s-1-shell8-performance
          unixbench/300s-1-pipe-performance
        * unixbench/300s-1-context1-performance
            312  315  unixbench.score
          unixbench/300s-1-spawn-performance
          unixbench/300s-1-syscall-performance
          unixbench/300s-1-dhry2reg-performance
          unixbench/300s-1-fstime-performance
          unixbench/300s-1-fsbuffer-performance
          unixbench/300s-1-fsdisk-performance
          unixbench/300s-100%-whetstone-double-performance
          unixbench/300s-100%-shell1-performance
          unixbench/300s-100%-shell8-performance
          unixbench/300s-100%-pipe-performance
          unixbench/300s-100%-context1-performance
          unixbench/300s-100%-spawn-performance
        * unixbench/300s-100%-syscall-performance
            3571  ±  3%  -11%  3183  ±  4%  unixbench.score
          unixbench/300s-100%-dhry2reg-performance
          unixbench/300s-100%-fstime-performance
          unixbench/300s-100%-fsbuffer-performance
          unixbench/300s-100%-fsdisk-performance
          unixbench/300s-1-execl-performance
          unixbench/300s-100%-execl-performance
        * will-it-scale/brk1-performance
            365004  360387  will-it-scale.per_thread_ops
        * will-it-scale/dup1-performance
            432401  437596  will-it-scale.per_thread_ops
          will-it-scale/eventfd1-performance
          will-it-scale/futex1-performance
          will-it-scale/futex2-performance
          will-it-scale/futex3-performance
          will-it-scale/futex4-performance
          will-it-scale/getppid1-performance
          will-it-scale/lock1-performance
          will-it-scale/lseek1-performance
          will-it-scale/lseek2-performance
        * will-it-scale/malloc1-performance
            47025  45817  will-it-scale.per_thread_ops
            77499  76529  will-it-scale.per_process_ops
          will-it-scale/malloc2-performance
        * will-it-scale/mmap1-performance
            123399  120815  will-it-scale.per_thread_ops
            152219  149833  will-it-scale.per_process_ops
        * will-it-scale/mmap2-performance
            107327  104714  will-it-scale.per_thread_ops
            136405  133765  will-it-scale.per_process_ops
          will-it-scale/open1-performance
        * will-it-scale/open2-performance
            171570  168805  will-it-scale.per_thread_ops
            532644  526202  will-it-scale.per_process_ops
          will-it-scale/page_fault1-performance
          will-it-scale/page_fault2-performance
          will-it-scale/page_fault3-performance
          will-it-scale/pipe1-performance
          will-it-scale/poll1-performance
        * will-it-scale/poll2-performance
            176134  172848  will-it-scale.per_thread_ops
            281361  275053  will-it-scale.per_process_ops
          will-it-scale/posix_semaphore1-performance
          will-it-scale/pread1-performance
          will-it-scale/pread2-performance
          will-it-scale/pread3-performance
          will-it-scale/pthread_mutex1-performance
          will-it-scale/pthread_mutex2-performance
          will-it-scale/pwrite1-performance
          will-it-scale/pwrite2-performance
          will-it-scale/pwrite3-performance
        * will-it-scale/read1-performance
            1190563  1174833  will-it-scale.per_thread_ops
        * will-it-scale/read2-performance
            1105369  1080427  will-it-scale.per_thread_ops
          will-it-scale/readseek1-performance
        * will-it-scale/readseek2-performance
            261818  259040  will-it-scale.per_thread_ops
          will-it-scale/readseek3-performance
        * will-it-scale/sched_yield-performance
            2408059  2382034  will-it-scale.per_thread_ops
          will-it-scale/signal1-performance
          will-it-scale/unix1-performance
          will-it-scale/unlink1-performance
          will-it-scale/unlink2-performance
        * will-it-scale/write1-performance
            976701  961588  will-it-scale.per_thread_ops
        * will-it-scale/writeseek1-performance
            831898  822448  will-it-scale.per_thread_ops
        * will-it-scale/writeseek2-performance
            228248  225065  will-it-scale.per_thread_ops
        * will-it-scale/writeseek3-performance
            226670  224058  will-it-scale.per_thread_ops
          will-it-scale/context_switch1-performance
          aim7/performance-fork_test-2000
        * aim7/performance-brk_test-3000
            74869  76676  aim7.jobs-per-min
          aim7/performance-disk_cp-3000
          aim7/performance-disk_rd-3000
          aim7/performance-sieve-3000
          aim7/performance-page_test-3000
          aim7/performance-creat-clo-3000
          aim7/performance-mem_rtns_1-8000
          aim7/performance-disk_wrt-8000
          aim7/performance-pipe_cpy-8000
          aim7/performance-ram_copy-8000
      
      (d) lkp-avoton3:
            8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G
      
          netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Li Zhijian <zhijianx.li@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      bae52979
    • Y
      alinux: sched: Finer grain of sched latency · fa418988
      Yihao Wu 提交于
      to #28739709
      
      Many samples are between 10ms-50ms. To display more informative
      distribution of latency, divide 10ms-50ms into 5 parts uniformly.
      
      Example:
      
        $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
      	0-1ms: 	59726433
      	1-4ms: 	167
      	4-7ms: 	0
      	7-10ms: 	0
      	10-20ms: 	5
      	20-30ms: 	0
      	30-40ms: 	3
      	40-50ms: 	0
      	50-100ms: 	0
      	100-500ms: 	0
      	500-1000ms: 	0
      	1000-5000ms: 	0
      	5000-10000ms: 	0
      	>=10000ms: 	0
      	total(ms): 	45554
      	nr: 	59726600
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      fa418988
    • Y
      alinux: sched: Add "nr" to sched latency histogram · 2abfd07b
      Yihao Wu 提交于
      to #28739709
      
      Sometimes histogram is not precise enough because each sample is
      roughly accounted into a histogram bar. And average latency is more
      pratical for some users.
      
      This patch adds a "nr" field in 4 latency histogram interfaces, so
      
      	lat(avg) = total(ms) / nr
      
      And compared to histogram, average latency is better to be used as a
      SLI because of simplicity.
      
      Example
      
          $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
            0-1ms:  4139
            1-4ms:  317
            4-7ms:  568
            7-10ms:         0
            10-100ms:       42324
            100-500ms:      9131
            500-1000ms:     95
            1000-5000ms:    134
            5000-10000ms:   0
            >=10000ms:      0
            total(ms):      4256455
            nr:      182128
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      2abfd07b
    • Y
      alinux: sched: Add cgroup's scheduling latency histograms · 6dbaddaa
      Yihao Wu 提交于
      to #28739709
      
      This patch adds cpuacct.cgroup_wait_latency interface. It exports the
      histogram of the sched entity's schedule latency. Unlike wait_latency,
      the sched entity is a cgroup rather than task.
      
      This is useful when tasks are not directly clustered under one cgroup.
      For examples:
      
      cgroup1 --- cgroupA --- task1
              --- cgroupB --- task2
      cgroup2 --- cgroupC --- task3
              --- cgroupD --- task4
      
      This is a common cgroup hierarchy used by many applications. With
      cgroup_wait_latency, we can just read from cgroup1 to know aggregated
      wait latency information of task1 and task2.
      
      The interface output format is identical to cpuacct.wait_latency.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      6dbaddaa
    • Y
      alinux: sched: Add cgroup-level blocked time histograms · a055ee2c
      Yihao Wu 提交于
      to #28739709
      
      This patch measures time that tasks in cpuacct cgroup blocks. There
      are two types: blocked due to IO, and others like locks. And they
      are exported in"cpuacct.ioblock_latency" and "cpuacct.block_latency"
      respectively.
      
      According to histogram, we know the detailed distribution of the
      duration. And according to total(ms), we know the percentage of time
      tasks spent off rq, waiting for resources:
      
      (△ioblock_latency.total(ms) + △block_latency.total(ms)) / △wall_time
      
      The interface output format is identical to cpuacct.wait_latency.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      a055ee2c
    • Y
      alinux: sched: Introduce cfs scheduling latency histograms · 76d98609
      Yihao Wu 提交于
      to #28739709
      
      Export wait_latency in "cpuacct.wait_latency", which indicates the
      time that tasks in a cpuacct cgroup wait on a cfs_rq to be scheduled.
      
      This is like "perf sched", but it gives smaller overhead. So it can
      be used as monitor constantly.
      
      wait_latency is useful to debug application's high RT problem. It can
      tell if it's caused by scheduling or not. If it is, loadavg can tell
      if it's caused by bad scheduling bahaviour or system overloads.
      
      System admins can also use wait_latency to define SLA. To ensure SLA
      is guaranteed, there are various ways to decrease wait_latency.
      
      This feature is disabled by default for performance concerns. It can
      be switched on dynamically by "echo 0 > /proc/cpusli/sched_lat_enable"
      
      Example:
      
        $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
          0-1ms:  4139
          1-4ms:  317
          4-7ms:  568
          7-10ms:         0
          10-100ms:       42324
          100-500ms:      9131
          500-1000ms:     95
          1000-5000ms:    134
          5000-10000ms:   0
          >=10000ms:      0
          total(ms):      4256455
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      76d98609
  4. 23 6月, 2020 2 次提交
    • Y
      alinux: sched: Add switch for scheduler_tick load tracking · bcaf8afd
      Yihao Wu 提交于
      to #28739709
      
      Assume workloads are composed of massive short tasks. Then periodical
      load tracking is unnecessary. Because load tracking should be already
      guaranteed by frequent sleep and wake-up.
      
      If these massive short tasks run in their individual cgroups, the load
      tracking becomes extremely heavy.
      
      This patch adds a switch to bypass scheduler_tick load tracking, in
      order to reduce scheduler overhead, without sacrificing much balance
      in this scenario.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 0.74% -> 0.48%
      
      	(This test's baseline is from the previous patch)
      
      2) sysbench-threads with 96 threads, running for 5min
      
      	latency_ms 95th: 63.07 -> 54.01
      
      Besides these, no regression is found on our test platform.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bcaf8afd
    • Y
      alinux: sched: Add switch for update_blocked_averages · bb48b716
      Yihao Wu 提交于
      to #28739709
      
      Unless the workloads are IO-bounded, update_blocked_averages doesn't help
      load balance. This patch adds a switch to bypass update_blocked_averages
      if prior knowledge about workloads indicates IO is negligible.
      
      Performance Tests:
      
      1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine
      
      	sched overhead(each HT): 3.78% -> 0.74%
      
      2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake
      
      	overhead: 21.06 -> 18.08
      
      3) unixbench context1 with 96 threads running for 1min
      
      	Score: 15409.40 -> 16821.77
      
      Besides these, UnixBench has some performance ups and downs. But
      generally, the performance of UnixBench hasn't changed.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      bb48b716
  5. 22 6月, 2020 1 次提交
    • Y
      alinux: sched: Fix %steal in cpuacct.proc_stat in guest OS · 1c5ab7a7
      Yihao Wu 提交于
      to #28143829
      
      rq_clock_task is less than rq_clock when in VM, or when IRQ_TIME_ACCOUNTING
      is on. So they are not comparable when accounting elapsed time. This bug is
      not observed on host yet, because neither of these two conditions are met.
      
      Use rq_clock at both begin and end of exec_start_raw accumulation to fix
      this bug, because we expect steal% in cpuacct.proc_stat of VM's cgroups can
      reflect the cpu time the host steal from the guest.
      
      Fixes: c7552980 ("alinux: sched: Introduce per-cgroup steal accounting")
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      1c5ab7a7
  6. 09 6月, 2020 4 次提交
    • Y
      psi: Move PF_MEMSTALL out of task->flags · e7b88a8a
      Yafang Shao 提交于
      task #28327019
      
      commit 1066d1b6974e095d5a6c472ad9180a957b496cd6 upstream
      
      The task->flags is a 32-bits flag, in which 31 bits have already been
      consumed. So it is hardly to introduce other new per process flag.
      Currently there're still enough spaces in the bit-field section of
      task_struct, so we can define the memstall state as a single bit in
      task_struct instead.
      This patch also removes an out-of-date comment pointed by Matthew.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      e7b88a8a
    • J
      psi: Optimize switching tasks inside shared cgroups · 0e5c5cd8
      Johannes Weiner 提交于
      task #28327019
      
      commit 36b238d5717279163859fb6ba0f4360abcafab83 upstream
      
      When switching tasks running on a CPU, the psi state of a cgroup
      containing both of these tasks does not change. Right now, we don't
      exploit that, and can perform many unnecessary state changes in nested
      hierarchies, especially when most activity comes from one leaf cgroup.
      
      This patch implements an optimization where we only update cgroups
      whose state actually changes during a task switch. These are all
      cgroups that contain one task but not the other, up to the first
      shared ancestor. When both tasks are in the same group, we don't need
      to update anything at all.
      
      We can identify the first shared ancestor by walking the groups of the
      incoming task until we see TSK_ONCPU set on the local CPU; that's the
      first group that also contains the outgoing task.
      
      The new psi_task_switch() is similar to psi_task_change(). To allow
      code reuse, move the task flag maintenance code into a new function
      and the poll/avg worker wakeups into the shared psi_group_change().
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      0e5c5cd8
    • J
      sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime · 837b1ac1
      Johannes Weiner 提交于
      task #28327019
      
      commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 upstream
      
      Jingfeng reports rare div0 crashes in psi on systems with some uptime:
      
      [58914.066423] divide error: 0000 [#1] SMP
      [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
      [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
      [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
      [58914.102728] Workqueue: events psi_update_work
      [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
      [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
      [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
      [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
      [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
      [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
      [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
      [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
      [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
      [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
      [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [58914.200360] PKRU: 55555554
      [58914.203221] Stack:
      [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
      [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
      [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
      [58914.228734] Call Trace:
      [58914.231337] [] psi_update_work+0x22/0x60
      [58914.237067] [] process_one_work+0x189/0x420
      [58914.243063] [] worker_thread+0x4e/0x4b0
      [58914.248701] [] ? process_one_work+0x420/0x420
      [58914.254869] [] kthread+0xe6/0x100
      [58914.259994] [] ? kthread_park+0x60/0x60
      [58914.265640] [] ret_from_fork+0x39/0x50
      [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1
      [58914.279691] RIP [] psi_update_stats+0x1c1/0x330
      
      The crashing instruction is trying to divide the observed stall time
      by the sampling period. The period, stored in R8, is not 0, but we are
      dividing by the lower 32 bits only, which are all 0 in this instance.
      
      We could switch to a 64-bit division, but the period shouldn't be that
      big in the first place. It's the time between the last update and the
      next scheduled one, and so should always be around 2s and comfortably
      fit into 32 bits.
      
      The bug is in the initialization of new cgroups: we schedule the first
      sampling event in a cgroup as an offset of sched_clock(), but fail to
      initialize the last_update timestamp, and it defaults to 0. That
      results in a bogusly large sampling period the first time we run the
      sampling code, and consequently we underreport pressure for the first
      2s of a cgroup's life. But worse, if sched_clock() is sufficiently
      advanced on the system, and the user gets unlucky, the period's lower
      32 bits can all be 0 and the sampling division will crash.
      
      Fix this by initializing the last update timestamp to the creation
      time of the cgroup, thus correctly marking the start of the first
      pressure sampling period in a new cgroup.
      Reported-by: NJingfeng Xie <xiejingfeng@linux.alibaba.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      837b1ac1
    • J
      psi: Fix cpu.pressure for cpu.max and competing cgroups · 9bf3d89c
      Johannes Weiner 提交于
      task #28327019
      
      commit b05e75d611380881e73edc58a20fd8c6bb71720b upstream
      
      For simplicity, cpu pressure is defined as having more than one
      runnable task on a given CPU. This works on the system-level, but it
      has limitations in a cgrouped reality: When cpu.max is in use, it
      doesn't capture the time in which a task is not executing on the CPU
      due to throttling. Likewise, it doesn't capture the time in which a
      competing cgroup is occupying the CPU - meaning it only reflects
      cgroup-internal competitive pressure, not outside pressure.
      
      Enable tracking of currently executing tasks, and then change the
      definition of cpu pressure in a cgroup from
      
      	NR_RUNNING > 1
      
      to
      
      	NR_RUNNING > ON_CPU
      
      which will capture the effects of cpu.max as well as competition from
      outside the cgroup.
      
      After this patch, a cgroup running `stress -c 1` with a cpu.max
      setting of 5000 10000 shows ~50% continuous CPU pressure.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9bf3d89c