1. 15 10月, 2021 5 次提交
    • T
      sched: Add cluster scheduler level for x86 · 66558b73
      Tim Chen 提交于
      There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
      shared among a cluster of cores instead of being exclusive to one
      single core.
      
      To prevent oversubscription of L2 cache, load should be balanced
      between such L2 clusters, especially for tasks with no shared data.
      On benchmark such as SPECrate mcf test, this change provides a boost
      to performance especially on medium load system on Jacobsville.  on a
      Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
      cores each, the benchmark number is as follow:
      
       Improvement over baseline kernel for mcf_r
       copies		run time	base rate
       1		-0.1%		-0.2%
       6		25.1%		25.1%
       12		18.8%		19.0%
       24		0.3%		0.3%
      
      So this looks pretty good. In terms of the system's task distribution,
      some pretty bad clumping can be seen for the vanilla kernel without
      the L2 cluster domain for the 6 and 12 copies case. With the extra
      domain for cluster, the load does get evened out between the clusters.
      
      Note this patch isn't an universal win as spreading isn't necessarily
      a win, particually for those workload who can benefit from packing.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
      66558b73
    • B
      sched: Add cluster scheduler level in core and related Kconfig for ARM64 · 778c558f
      Barry Song 提交于
      This patch adds scheduler level for clusters and automatically enables
      the load balance among clusters. It will directly benefit a lot of
      workload which loves more resources such as memory bandwidth, caches.
      
      Testing has widely been done in two different hardware configurations of
      Kunpeng920:
      
       24 cores in one NUMA(6 clusters in each NUMA node);
       32 cores in one NUMA(8 clusters in each NUMA node)
      
      Workload is running on either one NUMA node or four NUMA nodes, thus,
      this can estimate the effect of cluster spreading w/ and w/o NUMA load
      balance.
      
      * Stream benchmark:
      
      4threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
      MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
      MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
      MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)
      
      6threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
      MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
      MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
      MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)
      
      12threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
      MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
      MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
      MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)
      
      Thus, it could help memory-bound workload especially under medium load.
      Similar improvement is also seen in lkp-pbzip2:
      
      * lkp-pbzip2 benchmark
      
      2-96 threads (on 4NUMA * 24cores = 96cores)
                        lkp-pbzip2              lkp-pbzip2
                        w/o patch               w/ patch
      Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
      Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
      Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
      Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
      Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
      Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
      Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
      Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
      Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*
      
      2-24 threads (on 1NUMA * 24cores = 24cores)
                       lkp-pbzip2               lkp-pbzip2
                       w/o patch                w/ patch
      Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
      Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
      Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
      Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
      Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
      Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
      Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*
      
      In the case of 6 threads and 8 threads, we see the greatest performance
      improvement.
      
      Similar improvement can be seen on lkp-pixz though the improvement is
      smaller:
      
      * lkp-pixz benchmark
      
      2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                        lkp-pixz               lkp-pixz
                        w/o patch              w/ patch
      Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
      Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
      Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
      Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
      Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
      Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
      Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*
      
      * SPECrate benchmark
      
      4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
      		Base     	 	Base
      		Run Time   	 	Rate
      		-------  	 	---------
      4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
      8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
      16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)
      
      32 Copies(on 4NUMA * 32 cores = 128cores)
      [w/o patch]
                       Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        584       87.2  *
      502.gcc_r            32        503       90.2  *
      505.mcf_r            32        745       69.4  *
      520.omnetpp_r        32       1031       40.7  *
      523.xalancbmk_r      32        597       56.6  *
      525.x264_r            1         --            CE
      531.deepsjeng_r      32        336      109    *
      541.leela_r          32        556       95.4  *
      548.exchange2_r      32        513      163    *
      557.xz_r             32        530       65.2  *
       Est. SPECrate2017_int_base              80.3
      
      [w/ patch]
                        Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        580      87.8 (+0.688%)  *
      502.gcc_r            32        477      95.1 (+5.432%)  *
      505.mcf_r            32        644      80.3 (+13.574%) *
      520.omnetpp_r        32        942      44.6 (+9.58%)   *
      523.xalancbmk_r      32        560      60.4 (+6.714%%) *
      525.x264_r            1         --           CE
      531.deepsjeng_r      32        337      109  (+0.000%) *
      541.leela_r          32        554      95.6 (+0.210%) *
      548.exchange2_r      32        515      163  (+0.000%) *
      557.xz_r             32        524      66.0 (+1.227%) *
       Est. SPECrate2017_int_base              83.7 (+4.062%)
      
      On the other hand, it is slightly helpful to CPU-bound tasks like
      kernbench:
      
      * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                           kernbench              kernbench
                           w/o cluster            w/ cluster
      Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
      Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
      Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
      Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
      Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
      Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
      Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
      Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
      Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
      Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
      Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
      Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
      Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
      Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
      Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
      Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
      Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
      Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)
      
      Note this patch isn't an universal win, it might hurt those workload
      which can benefit from packing. Though tasks which want to take
      advantages of lower communication latency of one cluster won't
      necessarily been packed in one cluster while kernel is not aware of
      clusters, they have some chance to be randomly packed. But this
      patch will make them more likely spread.
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Tested-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      778c558f
    • J
      topology: Represent clusters of CPUs within a die · c5e22fef
      Jonathan Cameron 提交于
      Both ACPI and DT provide the ability to describe additional layers of
      topology between that of individual cores and higher level constructs
      such as the level at which the last level cache is shared.
      In ACPI this can be represented in PPTT as a Processor Hierarchy
      Node Structure [1] that is the parent of the CPU cores and in turn
      has a parent Processor Hierarchy Nodes Structure representing
      a higher level of topology.
      
      For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
      cluster has 4 cpus. All clusters share L3 cache data, but each cluster
      has local L3 tag. On the other hand, each clusters will share some
      internal system bus.
      
      +-----------------------------------+                          +---------+
      |  +------+    +------+             +--------------------------+         |
      |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   +----+    L3     |         |         |
      |  +------+    +------+   cluster   |    |    tag    |         |         |
      |  | CPU2 |    | CPU3 |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   |    |    L3     |         |         |
      |  +------+    +------+             +----+    tag    |         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |   L3    |
                                                                     |   data  |
      +-----------------------------------+                          |         |
      |  +------+    +------+             |    +-----------+         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             +----+    L3     |         |         |
      |                                   |    |    tag    |         |         |
      |  +------+    +------+             |    |           |         |         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             +--------------------------+         |
      +-----------------------------------|                          |         |
      +-----------------------------------|                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   +----+    L3     |         |         |
      |  +------+    +------+             |    |    tag    |         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |   +-----------+          |         |
      |  +------+    +------+             |   |           |          |         |
      |                                   |   |    L3     |          |         |
      |  +------+    +------+             +---+    tag    |          |         |
      |  |      |    |      |             |   |           |          |         |
      |  +------+    +------+             |   +-----------+          |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |  +-----------+           |         |
      |  +------+    +------+             |  |           |           |         |
      |                                   |  |    L3     |           |         |
      |  +------+    +------+             +--+    tag    |           |         |
      |  |      |    |      |             |  |           |           |         |
      |  +------+    +------+             |  +-----------+           |         |
      |                                   |                          +---------+
      +-----------------------------------+
      
      That means spreading tasks among clusters will bring more bandwidth
      while packing tasks within one cluster will lead to smaller cache
      synchronization latency. So both kernel and userspace will have
      a chance to leverage this topology to deploy tasks accordingly to
      achieve either smaller cache latency within one cluster or an even
      distribution of load among clusters for higher throughput.
      
      This patch exposes cluster topology to both kernel and userspace.
      Libraried like hwloc will know cluster by cluster_cpus and related
      sysfs attributes. PoC of HWLOC support at [2].
      
      Note this patch only handle the ACPI case.
      
      Special consideration is needed for SMT processors, where it is
      necessary to move 2 levels up the hierarchy from the leaf nodes
      (thus skipping the processor core level).
      
      Note that arm64 / ACPI does not provide any means of identifying
      a die level in the topology but that may be unrelate to the cluster
      level.
      
      [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
          structure (Type 0)
      [2] https://github.com/hisilicon/hwloc/tree/linux-clusterSigned-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.com
      c5e22fef
    • K
      sched: Add wrapper for get_wchan() to keep task blocked · 42a20f86
      Kees Cook 提交于
      Having a stable wchan means the process must be blocked and for it to
      stay that way while performing stack unwinding.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm]
      Tested-by: Mark Rutland <mark.rutland@arm.com> [arm64]
      Link: https://lkml.kernel.org/r/20211008111626.332092234@infradead.org
      42a20f86
    • Q
      x86: Fix get_wchan() to support the ORC unwinder · bc9bbb81
      Qi Zheng 提交于
      Currently, the kernel CONFIG_UNWINDER_ORC option is enabled by default
      on x86, but the implementation of get_wchan() is still based on the frame
      pointer unwinder, so the /proc/<pid>/wchan usually returned 0 regardless
      of whether the task <pid> is running.
      
      Reimplement get_wchan() by calling stack_trace_save_tsk(), which is
      adapted to the ORC and frame pointer unwinders.
      
      Fixes: ee9f8fce ("x86/unwind: Add the ORC unwinder")
      Signed-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211008111626.271115116@infradead.org
      bc9bbb81
  2. 05 10月, 2021 1 次提交
  3. 04 10月, 2021 1 次提交
    • L
      kvm: fix objtool relocation warning · 291073a5
      Linus Torvalds 提交于
      The recent change to make objtool aware of more symbol relocation types
      (commit 24ff6525: "objtool: Teach get_alt_entry() about more
      relocation types") also added another check, and resulted in this
      objtool warning when building kvm on x86:
      
          arch/x86/kvm/emulate.o: warning: objtool: __ex_table+0x4: don't know how to handle reloc symbol type: kvm_fastop_exception
      
      The reason seems to be that kvm_fastop_exception() is marked as a global
      symbol, which causes the relocation to ke kept around for objtool.  And
      at the same time, the kvm_fastop_exception definition (which is done as
      an inline asm statement) doesn't actually set the type of the global,
      which then makes objtool unhappy.
      
      The minimal fix is to just not mark kvm_fastop_exception as being a
      global symbol.  It's only used in that one compilation unit anyway, so
      it was always pointless.  That's how all the other local exception table
      labels are done.
      
      I'm not entirely happy about the kinds of games that the kvm code plays
      with doing its own exception handling, and the fact that it confused
      objtool is most definitely a symptom of the code being a bit too subtle
      and ad-hoc.  But at least this trivial one-liner makes objtool no longer
      upset about what is going on.
      
      Fixes: 24ff6525 ("objtool: Teach get_alt_entry() about more relocation types")
      Link: https://lore.kernel.org/lkml/CAHk-=wiZwq-0LknKhXN4M+T8jbxn_2i9mcKpO+OaBSSq_Eh7tg@mail.gmail.com/
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      291073a5
  4. 01 10月, 2021 2 次提交
  5. 30 9月, 2021 2 次提交
    • S
      KVM: x86: Swap order of CPUID entry "index" vs. "significant flag" checks · e8a747d0
      Sean Christopherson 提交于
      Check whether a CPUID entry's index is significant before checking for a
      matching index to hack-a-fix an undefined behavior bug due to consuming
      uninitialized data.  RESET/INIT emulation uses kvm_cpuid() to retrieve
      CPUID.0x1, which does _not_ have a significant index, and fails to
      initialize the dummy variable that doubles as EBX/ECX/EDX output _and_
      ECX, a.k.a. index, input.
      
      Practically speaking, it's _extremely_  unlikely any compiler will yield
      code that causes problems, as the compiler would need to inline the
      kvm_cpuid() call to detect the uninitialized data, and intentionally hose
      the kernel, e.g. insert ud2, instead of simply ignoring the result of
      the index comparison.
      
      Although the sketchy "dummy" pattern was introduced in SVM by commit
      66f7b72e ("KVM: x86: Make register state after reset conform to
      specification"), it wasn't actually broken until commit 7ff6c035
      ("KVM: x86: Remove stateful CPUID handling") arbitrarily swapped the
      order of operations such that "index" was checked before the significant
      flag.
      
      Avoid consuming uninitialized data by reverting to checking the flag
      before the index purely so that the fix can be easily backported; the
      offending RESET/INIT code has been refactored, moved, and consolidated
      from vendor code to common x86 since the bug was introduced.  A future
      patch will directly address the bad RESET/INIT behavior.
      
      The undefined behavior was detected by syzbot + KernelMemorySanitizer.
      
        BUG: KMSAN: uninit-value in cpuid_entry2_find arch/x86/kvm/cpuid.c:68
        BUG: KMSAN: uninit-value in kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103
        BUG: KMSAN: uninit-value in kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
         cpuid_entry2_find arch/x86/kvm/cpuid.c:68 [inline]
         kvm_find_cpuid_entry arch/x86/kvm/cpuid.c:1103 [inline]
         kvm_cpuid+0x456/0x28f0 arch/x86/kvm/cpuid.c:1183
         kvm_vcpu_reset+0x13fb/0x1c20 arch/x86/kvm/x86.c:10885
         kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
         vcpu_enter_guest+0xfd2/0x6d80 arch/x86/kvm/x86.c:9534
         vcpu_run+0x7f5/0x18d0 arch/x86/kvm/x86.c:9788
         kvm_arch_vcpu_ioctl_run+0x245b/0x2d10 arch/x86/kvm/x86.c:10020
      
        Local variable ----dummy@kvm_vcpu_reset created at:
         kvm_vcpu_reset+0x1fb/0x1c20 arch/x86/kvm/x86.c:10812
         kvm_apic_accept_events+0x58f/0x8c0 arch/x86/kvm/lapic.c:2923
      
      Reported-by: syzbot+f3985126b746b3d59c9d@syzkaller.appspotmail.com
      Reported-by: NAlexander Potapenko <glider@google.com>
      Fixes: 2a24be79 ("KVM: VMX: Set EDX at INIT with CPUID.0x1, Family-Model-Stepping")
      Fixes: 7ff6c035 ("KVM: x86: Remove stateful CPUID handling")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20210929222426.1855730-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8a747d0
    • Z
      x86/kvmclock: Move this_cpu_pvti into kvmclock.h · ad9af930
      Zelin Deng 提交于
      There're other modules might use hv_clock_per_cpu variable like ptp_kvm,
      so move it into kvmclock.h and export the symbol to make it visiable to
      other modules.
      Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Message-Id: <1632892429-101194-2-git-send-email-zelin.deng@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad9af930
  6. 28 9月, 2021 1 次提交
    • J
      bpf, x86: Fix bpf mapping of atomic fetch implementation · ced18582
      Johan Almbladh 提交于
      Fix the case where the dst register maps to %rax as otherwise this produces
      an incorrect mapping with the implementation in 981f94c3 ("bpf: Add
      bitwise atomic instructions") as %rax is clobbered given it's part of the
      cmpxchg as operand.
      
      The issue is similar to b29dd96b ("bpf, x86: Fix BPF_FETCH atomic and/or/
      xor with r0 as src") just that the case of dst register was missed.
      
      Before, dst=r0 (%rax) src=r2 (%rsi):
      
        [...]
        c5:   mov    %rax,%r10
        c8:   mov    0x0(%rax),%rax       <---+ (broken)
        cc:   mov    %rax,%r11                |
        cf:   and    %rsi,%r11                |
        d2:   lock cmpxchg %r11,0x0(%rax) <---+
        d8:   jne    0x00000000000000c8       |
        da:   mov    %rax,%rsi                |
        dd:   mov    %r10,%rax                |
        [...]                                 |
                                              |
      After, dst=r0 (%rax) src=r2 (%rsi):     |
                                              |
        [...]                                 |
        da:	mov    %rax,%r10                |
        dd:	mov    0x0(%r10),%rax       <---+ (fixed)
        e1:	mov    %rax,%r11                |
        e4:	and    %rsi,%r11                |
        e7:	lock cmpxchg %r11,0x0(%r10) <---+
        ed:	jne    0x00000000000000dd
        ef:	mov    %rax,%rsi
        f2:	mov    %r10,%rax
        [...]
      
      The remaining combinations were fine as-is though:
      
      After, dst=r9 (%r15) src=r0 (%rax):
      
        [...]
        dc:	mov    %rax,%r10
        df:	mov    0x0(%r15),%rax
        e3:	mov    %rax,%r11
        e6:	and    %r10,%r11
        e9:	lock cmpxchg %r11,0x0(%r15)
        ef:	jne    0x00000000000000df      _
        f1:	mov    %rax,%r10                | (unneeded, but
        f4:	mov    %r10,%rax               _|  not a problem)
        [...]
      
      After, dst=r9 (%r15) src=r4 (%rcx):
      
        [...]
        de:	mov    %rax,%r10
        e1:	mov    0x0(%r15),%rax
        e5:	mov    %rax,%r11
        e8:	and    %rcx,%r11
        eb:	lock cmpxchg %r11,0x0(%r15)
        f1:	jne    0x00000000000000e1
        f3:	mov    %rax,%rcx
        f6:	mov    %r10,%rax
        [...]
      
      The case of dst == src register is rejected by the verifier and
      therefore not supported, but x86 JIT also handles this case just
      fine.
      
      After, dst=r0 (%rax) src=r0 (%rax):
      
        [...]
        eb:	mov    %rax,%r10
        ee:	mov    0x0(%r10),%rax
        f2:	mov    %rax,%r11
        f5:	and    %r10,%r11
        f8:	lock cmpxchg %r11,0x0(%r10)
        fe:	jne    0x00000000000000ee
       100:	mov    %rax,%r10
       103:	mov    %r10,%rax
        [...]
      
      Fixes: 981f94c3 ("bpf: Add bitwise atomic instructions")
      Reported-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Signed-off-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NBrendan Jackman <jackmanb@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      ced18582
  7. 27 9月, 2021 2 次提交
  8. 25 9月, 2021 1 次提交
  9. 24 9月, 2021 13 次提交
  10. 23 9月, 2021 7 次提交
  11. 22 9月, 2021 5 次提交