1. 18 11月, 2021 1 次提交
    • B
      sched: Add cluster scheduler level in core and related Kconfig for ARM64 · ac032ae3
      Barry Song 提交于
      mainline inclusion
      from tip/sched/core for v5.16
      commit: 778c558f
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4H33U
      CVE: NA
      Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
      
      ------------------------------------------------------------------------
      
      This patch adds scheduler level for clusters and automatically enables
      the load balance among clusters. It will directly benefit a lot of
      workload which loves more resources such as memory bandwidth, caches.
      
      Testing has widely been done in two different hardware configurations of
      Kunpeng920:
      
       24 cores in one NUMA(6 clusters in each NUMA node);
       32 cores in one NUMA(8 clusters in each NUMA node)
      
      Workload is running on either one NUMA node or four NUMA nodes, thus,
      this can estimate the effect of cluster spreading w/ and w/o NUMA load
      balance.
      
      * Stream benchmark:
      
      4threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
      MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
      MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
      MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)
      
      6threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
      MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
      MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
      MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)
      
      12threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
      MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
      MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
      MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)
      
      Thus, it could help memory-bound workload especially under medium load.
      Similar improvement is also seen in lkp-pbzip2:
      
      * lkp-pbzip2 benchmark
      
      2-96 threads (on 4NUMA * 24cores = 96cores)
                        lkp-pbzip2              lkp-pbzip2
                        w/o patch               w/ patch
      Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
      Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
      Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
      Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
      Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
      Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
      Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
      Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
      Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*
      
      2-24 threads (on 1NUMA * 24cores = 24cores)
                       lkp-pbzip2               lkp-pbzip2
                       w/o patch                w/ patch
      Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
      Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
      Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
      Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
      Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
      Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
      Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*
      
      In the case of 6 threads and 8 threads, we see the greatest performance
      improvement.
      
      Similar improvement can be seen on lkp-pixz though the improvement is
      smaller:
      
      * lkp-pixz benchmark
      
      2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                        lkp-pixz               lkp-pixz
                        w/o patch              w/ patch
      Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
      Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
      Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
      Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
      Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
      Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
      Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*
      
      * SPECrate benchmark
      
      4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
      		Base     	 	Base
      		Run Time   	 	Rate
      		-------  	 	---------
      4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
      8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
      16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)
      
      32 Copies(on 4NUMA * 32 cores = 128cores)
      [w/o patch]
                       Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        584       87.2  *
      502.gcc_r            32        503       90.2  *
      505.mcf_r            32        745       69.4  *
      520.omnetpp_r        32       1031       40.7  *
      523.xalancbmk_r      32        597       56.6  *
      525.x264_r            1         --            CE
      531.deepsjeng_r      32        336      109    *
      541.leela_r          32        556       95.4  *
      548.exchange2_r      32        513      163    *
      557.xz_r             32        530       65.2  *
       Est. SPECrate2017_int_base              80.3
      
      [w/ patch]
                        Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        580      87.8 (+0.688%)  *
      502.gcc_r            32        477      95.1 (+5.432%)  *
      505.mcf_r            32        644      80.3 (+13.574%) *
      520.omnetpp_r        32        942      44.6 (+9.58%)   *
      523.xalancbmk_r      32        560      60.4 (+6.714%%) *
      525.x264_r            1         --           CE
      531.deepsjeng_r      32        337      109  (+0.000%) *
      541.leela_r          32        554      95.6 (+0.210%) *
      548.exchange2_r      32        515      163  (+0.000%) *
      557.xz_r             32        524      66.0 (+1.227%) *
       Est. SPECrate2017_int_base              83.7 (+4.062%)
      
      On the other hand, it is slightly helpful to CPU-bound tasks like
      kernbench:
      
      * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                           kernbench              kernbench
                           w/o cluster            w/ cluster
      Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
      Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
      Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
      Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
      Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
      Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
      Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
      Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
      Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
      Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
      Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
      Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
      Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
      Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
      Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
      Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
      Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
      Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)
      
      Note this patch isn't an universal win, it might hurt those workload
      which can benefit from packing. Though tasks which want to take
      advantages of lower communication latency of one cluster won't
      necessarily been packed in one cluster while kernel is not aware of
      clusters, they have some chance to be randomly packed. But this
      patch will make them more likely spread.
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Tested-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Reviewed-by: Ntao zeng <prime.zeng@hisilicon.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ac032ae3
  2. 21 10月, 2021 1 次提交
  3. 31 8月, 2021 2 次提交
    • Z
      arm64: Add memmap parameter and register pmem · 94dc364f
      ZhuLing 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: NA
      
      ------------------------------
      
      Register pmem in arm64:
      Use memmap(memmap=nn[KMG]!ss[KMG]) reserve memory and
      e820(driver/nvdimm/e820.c) function to register persistent
      memory in arm64. when the kernel restart or update, the data
      in PMEM will not be lost and can be loaded faster. this is a
      general features.
      
      driver/nvdimm/e820.c:
      The function of this file is scan "iomem_resource" and take
      advantage of nvdimm resource discovery mechanism by registering
      a resource named "Persistent Memory (legacy)", this function
      doesn't depend on architecture.
      
      We will push the feature to linux kernel community and discuss to
      modify the file name. because people have a mistaken notion that
      the e820.c is depend on x86.
      
      If you want use this features, you need do as follows:
      1.Reserve memory: add memmap to reserve memory in grub.cfg
        memmap=nn[KMG]!ss[KMG] exp:memmap=100K!0x1a0000000.
      2.Insmod nd_e820.ko: modprobe nd_e820.
      3.Check pmem device in /dev exp: /dev/pmem0
      Signed-off-by: NZhuLing <zhuling8@huawei.com>
      Signed-off-by: NSang Yan <sangyan@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      94dc364f
    • S
      arm64: smp: Add support for cpu park · 0262388a
      Sang Yan 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: N/A
      
      ------------------------------
      
      Introducing a feature of CPU PARK in order to save time
      of cpus down and up during kexec, which may cost 250ms of
      per cpu's down and 30ms of up.
      
      As a result, for 128 cores, it costs more than 30 seconds
      to down and up cpus during kexec. Think about 256 cores and more.
      
      CPU PARK is a state that cpu power-on and staying in spin loop, polling
      for exit chances, such as writing exit address.
      
      Reserving a block of memory, to fill with cpu park text section,
      exit address and park-magic-flag of each cpu. In implementation,
      reserved one page for one cpu core.
      
      Cpus going to park state instead of down in machine_shutdown().
      Cpus going out of park state in smp_init instead of brought up.
      
      One of cpu park sections in pre-reserved memory blocks,:
      +--------------+
      + exit address +
      +--------------+
      + park magic   +
      +--------------+
      + park codes   +
      +      .       +
      +      .       +
      +      .       +
      +--------------+
      Signed-off-by: NSang Yan <sangyan@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0262388a
  4. 16 7月, 2021 1 次提交
  5. 14 7月, 2021 2 次提交
    • W
      locking/qspinlock: Add CNA support for ARM64 · 0532ec6d
      Wei Li 提交于
      hulk inclusion
      category: feature
      bugzilla: 169576
      CVE: NA
      
      -------------------------------------------------
      
      Enabling CNA is controlled via a new configuration option
      (NUMA_AWARE_SPINLOCKS). Add it for arm64.
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0532ec6d
    • K
      arm64: mremap speedup - enable HAVE_MOVE_PUD · 44c0d24d
      Kalesh Singh 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit f5308c89
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI
      CVE: NA
      
      -------------------------------------------------
      
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source
      and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
      a 19x improvement in performance on arm64.  (See data below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping a
      PUD-aligned, 1GB sized region to a PUD-aligned destination.  The results
      from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on arm64. All times are in nanoseconds.
      
        Control          HAVE_MOVE_PUD
      
        1247761          74271
        1219896          46771
        1094792          59687
        1227760          48385
        1043698          76666
        1101771          50365
        1159896          52500
        1143594          75261
        1025833          61354
        1078125          48697
      
        1134312.6        59395.7    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~1.1 milliseconds to ~59
      microseconds on arm64.  (~19x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-5-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      44c0d24d
  6. 07 7月, 2021 1 次提交
    • K
      arm64: errata: add option to disable cache readunique prefetch on HIP08 · 3b876a78
      Kai Shen 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFV2
      CVE: NA
      
      -----------------------------------------------------------
      
      Random performance decreases appear on cases of Hackbench which test
      pipe or socket communication among multi-threads on Hisi HIP08 SoC.
      Cache sharing which caused by the change of the data layout and the
      cache readunique prefetch mechanism both lead to this problem.
      
      Readunique mechanism which may caused by store operation will invalid
      cachelines on other cores during data fetching stage which can cause
      cacheline invalidation happens frequently in a sharing data access
      situation.
      
      Disable cache readunique prefetch can trackle this problem.
      Test cases are like:
          for i in 20;do
              echo "--------pipe thread num=$i----------"
              for j in $(seq 1 10);do
                  ./hackbench -pipe $i thread 1000
              done
          done
      
      We disable readunique prefetch only in el2 for in el1 disabling
      readunique prefetch may cause panic due to lack of related priority
      which often be set in BIOS.
      
      Introduce CONFIG_HISILICON_ERRATUM_HIP08_RU_PREFETCH and disable RU
      prefetch using boot cmdline 'readunique_prefetch=off'.
      Signed-off-by: NKai Shen <shenkai8@huawei.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      [XQ: adjusted context]
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3b876a78
  7. 06 7月, 2021 1 次提交
  8. 04 6月, 2021 2 次提交
  9. 26 4月, 2021 1 次提交
  10. 13 4月, 2021 5 次提交
  11. 09 4月, 2021 4 次提交
  12. 08 2月, 2021 1 次提交
  13. 07 1月, 2021 3 次提交
  14. 01 12月, 2020 1 次提交
  15. 03 11月, 2020 1 次提交
  16. 29 10月, 2020 1 次提交
  17. 15 10月, 2020 1 次提交
  18. 14 10月, 2020 1 次提交
  19. 09 10月, 2020 1 次提交
  20. 29 9月, 2020 1 次提交
    • W
      arm64: Remove Spectre-related CONFIG_* options · 6e5f0927
      Will Deacon 提交于
      The spectre mitigations are too configurable for their own good, leading
      to confusing logic trying to figure out when we should mitigate and when
      we shouldn't. Although the plethora of command-line options need to stick
      around for backwards compatibility, the default-on CONFIG options that
      depend on EXPERT can be dropped, as the mitigations only do anything if
      the system is vulnerable, a mitigation is available and the command-line
      hasn't disabled it.
      
      Remove CONFIG_HARDEN_BRANCH_PREDICTOR and CONFIG_ARM64_SSBD in favour of
      enabling this code unconditionally.
      Signed-off-by: NWill Deacon <will@kernel.org>
      6e5f0927
  21. 18 9月, 2020 1 次提交
  22. 14 9月, 2020 1 次提交
    • M
      arm64: Allow IPIs to be handled as normal interrupts · d3afc7f1
      Marc Zyngier 提交于
      In order to deal with IPIs as normal interrupts, let's add
      a new way to register them with the architecture code.
      
      set_smp_ipi_range() takes a range of interrupts, and allows
      the arch code to request them as if the were normal interrupts.
      A standard handler is then called by the core IRQ code to deal
      with the IPI.
      
      This means that we don't need to call irq_enter/irq_exit, and
      that we don't need to deal with set_irq_regs either. So let's
      move the dispatcher into its own function, and leave handle_IPI()
      as a compatibility function.
      
      On the sending side, let's make use of ipi_send_mask, which
      already exists for this purpose.
      
      One of the major difference is that we end up, in some cases
      (such as when performing IRQ time accounting on the scheduler
      IPI), end up with nested irq_enter()/irq_exit() pairs.
      Other than the (relatively small) overhead, there should be
      no consequences to it (these pairs are designed to nest
      correctly, and the accounting shouldn't be off).
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      d3afc7f1
  23. 11 9月, 2020 3 次提交
  24. 09 9月, 2020 1 次提交
  25. 04 9月, 2020 1 次提交
  26. 29 7月, 2020 1 次提交