1. 24 6月, 2020 7 次提交
    • D
      cpuidle: menu: Remove get_loadavg() from the performance multiplier · 800bf05d
      Daniel Lezcano 提交于
      to #28739709
      
      commit a7fe5190c03f8137ef08db84a58dd4daf2c4785d upstream
      
      The function get_loadavg() returns almost always zero. To be more
      precise, statistically speaking for a total of 1023379 times passing
      in the function, the load is equal to zero 1020728 times, greater than
      100, 610 times, the remaining is between 0 and 5.
      
      In 2011, the get_loadavg() was removed from the Android tree because
      of the above [1]. At this time, the load was:
      
      unsigned long this_cpu_load(void)
      {
              struct rq *this = this_rq();
              return this->cpu_load[0];
      }
      
      In 2014, the code was changed by commit 372ba8cb (cpuidle: menu: Lookup CPU
      runqueues less) and the load is:
      
      void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
      {
              struct rq *rq = this_rq();
              *nr_waiters = atomic_read(&rq->nr_iowait);
              *load = rq->load.weight;
      }
      
      with the same result.
      
      Both measurements show using the load in this code path does no matter
      anymore. Removing it.
      
      [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      800bf05d
    • D
      sched/fair: Disable LB_BIAS by default · bae52979
      Dietmar Eggemann 提交于
      to #28739709
      
      commit fdf5f315d5cfaefb7bb8a62ec4bf37b9891837aa upstream
      
      LB_BIAS allows the adjustment on how conservative load should be
      balanced.
      
      The rq->cpu_load[idx] array is used for this functionality. It contains
      weighted CPU load decayed average values over different intervals
      (idx = 1..4). Idx = 0 is the weighted CPU load itself.
      
      The values are updated during scheduler_tick, before idle balance and at
      nohz exit.
      
      There are 5 different types of idx's per sched domain (sd). Each of them
      is used to index into the rq->cpu_load[idx] array in a specific scenario
      (busy, idle and newidle for load balancing, forkexec for wake-up
      slow-path load balancing and wake for affine wakeup based on weight).
      Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2
      respectively. All the other sd idx's are set to 0.
      
      Conservative load balancing is achieved for sd idx's >= 1 by using the
      min/max (source_load()/target_load()) value between the current weighted
      CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local
      CPU load in load balancing or vice versa in the wake-up slow-path load
      balancing.
      There is no conservative balancing for sd idx = 0 since only current
      weighted CPU load is used in this case.
      
      It is very likely that LB_BIAS' influence on load balancing can be
      neglected (see test results below). This is further supported by:
      
      (1) Weighted CPU load today is by itself a decayed average value (PELT)
          (cfs_rq->avg->runnable_load_avg) and not the instantaneous load
          (rq->load.weight) it was when LB_BIAS was introduced.
      
      (2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate
          to sd's newidle and busy idx) in find_busiest_group() when comparing
          busiest and local avg load to make load balancing even more
          conservative.
      
      (3) The sd forkexec and newidle idx are always set to 0 so there is no
          adjustment on how conservatively load balancing is done here.
      
      (4) Affine wakeup based on weight (wake_affine_weight()) will not be
          impacted since the sd wake idx is always set to 0.
      
      Let's disable LB_BIAS by default for a few kernel releases to make sure
      that no workload and no scheduler topology is affected. The benefit of
      being able to remove the LB_BIAS dependency from source_load() and
      target_load() is that the entire rq->cpu_load[idx] code could be removed
      in this case.
      
      It is really hard to say if there is no regression w/o testing this with
      a lot of different workloads on a lot of different platforms, especially
      NUMA machines.
      The following 104 LKP (Linux Kernel Performance) tests were run by the
      0-Day guys mostly on multi-socket hosts with a larger number of logical
      cpus (88, 192).
      The base for the test was commit b3dae109 ("sched/swait: Rename to
      exclusive") (tip/sched/core v4.18-rc1).
      Only 2 out of the 104 tests had a significant change in one of the
      metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7%
      files_per_sec, unixbench/300s-100%-syscall-performance -11% score).
      Tests which showed a change in one of the metrics are marked with a '*'
      and this change is listed as well.
      
      (a) lkp-bdw-ep3:
            88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G
      
          dd-write/10m-1HDD-cfq-btrfs-100dd-performance
          fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance
        * fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance
            7.50  7%  8.00  ±  6%  fsmark.files_per_sec
          fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance
          fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance
          fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance
          kbuild/300s-50%-vmlinux_prereq-performance
          kbuild/300s-200%-vmlinux_prereq-performance
          kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4
          kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4
      
      (b) lkp-skl-4sp1:
            192 threads Intel(R) Xeon(R) Platinum 8160 768G
      
          dbench/100%-performance
          ebizzy/200%-100x-10s-performance
          hackbench/1600%-process-pipe-performance
          iperf/300s-cs-localhost-tcp-performance
          iperf/300s-cs-localhost-udp-performance
          perf-bench-numa-mem/2t-300M-performance
          perf-bench-sched-pipe/10000000ops-process-performance
          perf-bench-sched-pipe/10000000ops-threads-performance
          schbench/2-16-300-30000-30000-performance
          tbench/100%-cs-localhost-performance
      
      (c) lkp-bdw-ep6:
            88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G
      
          stress-ng/100%-60s-pipe-performance
          unixbench/300s-1-whetstone-double-performance
          unixbench/300s-1-shell1-performance
          unixbench/300s-1-shell8-performance
          unixbench/300s-1-pipe-performance
        * unixbench/300s-1-context1-performance
            312  315  unixbench.score
          unixbench/300s-1-spawn-performance
          unixbench/300s-1-syscall-performance
          unixbench/300s-1-dhry2reg-performance
          unixbench/300s-1-fstime-performance
          unixbench/300s-1-fsbuffer-performance
          unixbench/300s-1-fsdisk-performance
          unixbench/300s-100%-whetstone-double-performance
          unixbench/300s-100%-shell1-performance
          unixbench/300s-100%-shell8-performance
          unixbench/300s-100%-pipe-performance
          unixbench/300s-100%-context1-performance
          unixbench/300s-100%-spawn-performance
        * unixbench/300s-100%-syscall-performance
            3571  ±  3%  -11%  3183  ±  4%  unixbench.score
          unixbench/300s-100%-dhry2reg-performance
          unixbench/300s-100%-fstime-performance
          unixbench/300s-100%-fsbuffer-performance
          unixbench/300s-100%-fsdisk-performance
          unixbench/300s-1-execl-performance
          unixbench/300s-100%-execl-performance
        * will-it-scale/brk1-performance
            365004  360387  will-it-scale.per_thread_ops
        * will-it-scale/dup1-performance
            432401  437596  will-it-scale.per_thread_ops
          will-it-scale/eventfd1-performance
          will-it-scale/futex1-performance
          will-it-scale/futex2-performance
          will-it-scale/futex3-performance
          will-it-scale/futex4-performance
          will-it-scale/getppid1-performance
          will-it-scale/lock1-performance
          will-it-scale/lseek1-performance
          will-it-scale/lseek2-performance
        * will-it-scale/malloc1-performance
            47025  45817  will-it-scale.per_thread_ops
            77499  76529  will-it-scale.per_process_ops
          will-it-scale/malloc2-performance
        * will-it-scale/mmap1-performance
            123399  120815  will-it-scale.per_thread_ops
            152219  149833  will-it-scale.per_process_ops
        * will-it-scale/mmap2-performance
            107327  104714  will-it-scale.per_thread_ops
            136405  133765  will-it-scale.per_process_ops
          will-it-scale/open1-performance
        * will-it-scale/open2-performance
            171570  168805  will-it-scale.per_thread_ops
            532644  526202  will-it-scale.per_process_ops
          will-it-scale/page_fault1-performance
          will-it-scale/page_fault2-performance
          will-it-scale/page_fault3-performance
          will-it-scale/pipe1-performance
          will-it-scale/poll1-performance
        * will-it-scale/poll2-performance
            176134  172848  will-it-scale.per_thread_ops
            281361  275053  will-it-scale.per_process_ops
          will-it-scale/posix_semaphore1-performance
          will-it-scale/pread1-performance
          will-it-scale/pread2-performance
          will-it-scale/pread3-performance
          will-it-scale/pthread_mutex1-performance
          will-it-scale/pthread_mutex2-performance
          will-it-scale/pwrite1-performance
          will-it-scale/pwrite2-performance
          will-it-scale/pwrite3-performance
        * will-it-scale/read1-performance
            1190563  1174833  will-it-scale.per_thread_ops
        * will-it-scale/read2-performance
            1105369  1080427  will-it-scale.per_thread_ops
          will-it-scale/readseek1-performance
        * will-it-scale/readseek2-performance
            261818  259040  will-it-scale.per_thread_ops
          will-it-scale/readseek3-performance
        * will-it-scale/sched_yield-performance
            2408059  2382034  will-it-scale.per_thread_ops
          will-it-scale/signal1-performance
          will-it-scale/unix1-performance
          will-it-scale/unlink1-performance
          will-it-scale/unlink2-performance
        * will-it-scale/write1-performance
            976701  961588  will-it-scale.per_thread_ops
        * will-it-scale/writeseek1-performance
            831898  822448  will-it-scale.per_thread_ops
        * will-it-scale/writeseek2-performance
            228248  225065  will-it-scale.per_thread_ops
        * will-it-scale/writeseek3-performance
            226670  224058  will-it-scale.per_thread_ops
          will-it-scale/context_switch1-performance
          aim7/performance-fork_test-2000
        * aim7/performance-brk_test-3000
            74869  76676  aim7.jobs-per-min
          aim7/performance-disk_cp-3000
          aim7/performance-disk_rd-3000
          aim7/performance-sieve-3000
          aim7/performance-page_test-3000
          aim7/performance-creat-clo-3000
          aim7/performance-mem_rtns_1-8000
          aim7/performance-disk_wrt-8000
          aim7/performance-pipe_cpy-8000
          aim7/performance-ram_copy-8000
      
      (d) lkp-avoton3:
            8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G
      
          netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Li Zhijian <zhijianx.li@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      bae52979
    • Y
      alinux: sched: Finer grain of sched latency · fa418988
      Yihao Wu 提交于
      to #28739709
      
      Many samples are between 10ms-50ms. To display more informative
      distribution of latency, divide 10ms-50ms into 5 parts uniformly.
      
      Example:
      
        $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
      	0-1ms: 	59726433
      	1-4ms: 	167
      	4-7ms: 	0
      	7-10ms: 	0
      	10-20ms: 	5
      	20-30ms: 	0
      	30-40ms: 	3
      	40-50ms: 	0
      	50-100ms: 	0
      	100-500ms: 	0
      	500-1000ms: 	0
      	1000-5000ms: 	0
      	5000-10000ms: 	0
      	>=10000ms: 	0
      	total(ms): 	45554
      	nr: 	59726600
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      fa418988
    • Y
      alinux: sched: Add "nr" to sched latency histogram · 2abfd07b
      Yihao Wu 提交于
      to #28739709
      
      Sometimes histogram is not precise enough because each sample is
      roughly accounted into a histogram bar. And average latency is more
      pratical for some users.
      
      This patch adds a "nr" field in 4 latency histogram interfaces, so
      
      	lat(avg) = total(ms) / nr
      
      And compared to histogram, average latency is better to be used as a
      SLI because of simplicity.
      
      Example
      
          $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
            0-1ms:  4139
            1-4ms:  317
            4-7ms:  568
            7-10ms:         0
            10-100ms:       42324
            100-500ms:      9131
            500-1000ms:     95
            1000-5000ms:    134
            5000-10000ms:   0
            >=10000ms:      0
            total(ms):      4256455
            nr:      182128
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      2abfd07b
    • Y
      alinux: sched: Add cgroup's scheduling latency histograms · 6dbaddaa
      Yihao Wu 提交于
      to #28739709
      
      This patch adds cpuacct.cgroup_wait_latency interface. It exports the
      histogram of the sched entity's schedule latency. Unlike wait_latency,
      the sched entity is a cgroup rather than task.
      
      This is useful when tasks are not directly clustered under one cgroup.
      For examples:
      
      cgroup1 --- cgroupA --- task1
              --- cgroupB --- task2
      cgroup2 --- cgroupC --- task3
              --- cgroupD --- task4
      
      This is a common cgroup hierarchy used by many applications. With
      cgroup_wait_latency, we can just read from cgroup1 to know aggregated
      wait latency information of task1 and task2.
      
      The interface output format is identical to cpuacct.wait_latency.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      6dbaddaa
    • Y
      alinux: sched: Add cgroup-level blocked time histograms · a055ee2c
      Yihao Wu 提交于
      to #28739709
      
      This patch measures time that tasks in cpuacct cgroup blocks. There
      are two types: blocked due to IO, and others like locks. And they
      are exported in"cpuacct.ioblock_latency" and "cpuacct.block_latency"
      respectively.
      
      According to histogram, we know the detailed distribution of the
      duration. And according to total(ms), we know the percentage of time
      tasks spent off rq, waiting for resources:
      
      (△ioblock_latency.total(ms) + △block_latency.total(ms)) / △wall_time
      
      The interface output format is identical to cpuacct.wait_latency.
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      a055ee2c
    • Y
      alinux: sched: Introduce cfs scheduling latency histograms · 76d98609
      Yihao Wu 提交于
      to #28739709
      
      Export wait_latency in "cpuacct.wait_latency", which indicates the
      time that tasks in a cpuacct cgroup wait on a cfs_rq to be scheduled.
      
      This is like "perf sched", but it gives smaller overhead. So it can
      be used as monitor constantly.
      
      wait_latency is useful to debug application's high RT problem. It can
      tell if it's caused by scheduling or not. If it is, loadavg can tell
      if it's caused by bad scheduling bahaviour or system overloads.
      
      System admins can also use wait_latency to define SLA. To ensure SLA
      is guaranteed, there are various ways to decrease wait_latency.
      
      This feature is disabled by default for performance concerns. It can
      be switched on dynamically by "echo 0 > /proc/cpusli/sched_lat_enable"
      
      Example:
      
        $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency
          0-1ms:  4139
          1-4ms:  317
          4-7ms:  568
          7-10ms:         0
          10-100ms:       42324
          100-500ms:      9131
          500-1000ms:     95
          1000-5000ms:    134
          5000-10000ms:   0
          >=10000ms:      0
          total(ms):      4256455
      Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      76d98609
  2. 23 6月, 2020 13 次提交
  3. 22 6月, 2020 3 次提交
  4. 19 6月, 2020 1 次提交
    • X
      configs: arm64: use 48-bit virtual address · f44f084b
      Xu Yu 提交于
      fix #28506983
      
      Some ARM machines may have large memory capacity (e.g., more than 256G),
      or large hole(s) in memory layout among nodes.
      
      Kernel with CONFIG_ARM64_VA_BITS as 39 has the linear region size as
      256G, and the memory that we will not be able to cover with the linear
      mapping shall be removed. This may cause part of the physical memory to
      become unavailable, system deadlock on memory, or even boot failure, on
      such ARM machines.
      
      This changes CONFIG_ARM64_VA_BITS to 48 which supports 128T linear
      mapping, in order to adapt to most scenarios.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f44f084b
  5. 16 6月, 2020 16 次提交