1. 11 3月, 2022 3 次提交
  2. 10 3月, 2022 1 次提交
    • T
      bpf: Add "live packet" mode for XDP in BPF_PROG_RUN · b530e9e1
      Toke Høiland-Jørgensen 提交于
      This adds support for running XDP programs through BPF_PROG_RUN in a mode
      that enables live packet processing of the resulting frames. Previous uses
      of BPF_PROG_RUN for XDP returned the XDP program return code and the
      modified packet data to userspace, which is useful for unit testing of XDP
      programs.
      
      The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
      ifindex and RXQ number as part of the context object being passed to the
      kernel. This patch reuses that code, but adds a new mode with different
      semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
      flag.
      
      When running BPF_PROG_RUN in this mode, the XDP program return codes will
      be honoured: returning XDP_PASS will result in the frame being injected
      into the networking stack as if it came from the selected networking
      interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
      being transmitted out that interface. XDP_TX is translated into an
      XDP_REDIRECT operation to the same interface, since the real XDP_TX action
      is only possible from within the network drivers themselves, not from the
      process context where BPF_PROG_RUN is executed.
      
      Internally, this new mode of operation creates a page pool instance while
      setting up the test run, and feeds pages from that into the XDP program.
      The setup cost of this is amortised over the number of repetitions
      specified by userspace.
      
      To support the performance testing use case, we further optimise the setup
      step so that all pages in the pool are pre-initialised with the packet
      data, and pre-computed context and xdp_frame objects stored at the start of
      each page. This makes it possible to entirely avoid touching the page
      content on each XDP program invocation, and enables sending up to 9
      Mpps/core on my test box.
      
      Because the data pages are recycled by the page pool, and the test runner
      doesn't re-initialise them for each run, subsequent invocations of the XDP
      program will see the packet data in the state it was after the last time it
      ran on that particular page. This means that an XDP program that modifies
      the packet before redirecting it has to be careful about which assumptions
      it makes about the packet content, but that is only an issue for the most
      naively written programs.
      
      Enabling the new flag is only allowed when not setting ctx_out and data_out
      in the test specification, since using it means frames will be redirected
      somewhere else, so they can't be returned.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
      b530e9e1
  3. 03 3月, 2022 1 次提交
    • M
      bpf: Add __sk_buff->delivery_time_type and bpf_skb_set_skb_delivery_time() · 8d21ec0e
      Martin KaFai Lau 提交于
      * __sk_buff->delivery_time_type:
      This patch adds __sk_buff->delivery_time_type.  It tells if the
      delivery_time is stored in __sk_buff->tstamp or not.
      
      It will be most useful for ingress to tell if the __sk_buff->tstamp
      has the (rcv) timestamp or delivery_time.  If delivery_time_type
      is 0 (BPF_SKB_DELIVERY_TIME_NONE), it has the (rcv) timestamp.
      
      Two non-zero types are defined for the delivery_time_type,
      BPF_SKB_DELIVERY_TIME_MONO and BPF_SKB_DELIVERY_TIME_UNSPEC.  For UNSPEC,
      it can only happen in egress because only mono delivery_time can be
      forwarded to ingress now.  The clock of UNSPEC delivery_time
      can be deduced from the skb->sk->sk_clockid which is how
      the sch_etf doing it also.
      
      * Provide forwarded delivery_time to tc-bpf@ingress:
      With the help of the new delivery_time_type, the tc-bpf has a way
      to tell if the __sk_buff->tstamp has the (rcv) timestamp or
      the delivery_time.  During bpf load time, the verifier will learn if
      the bpf prog has accessed the new __sk_buff->delivery_time_type.
      If it does, it means the tc-bpf@ingress is expecting the
      skb->tstamp could have the delivery_time.  The kernel will then
      read the skb->tstamp as-is during bpf insn rewrite without
      checking the skb->mono_delivery_time.  This is done by adding a
      new prog->delivery_time_access bit.  The same goes for
      writing skb->tstamp.
      
      * bpf_skb_set_delivery_time():
      The bpf_skb_set_delivery_time() helper is added to allow setting both
      delivery_time and the delivery_time_type at the same time.  If the
      tc-bpf does not need to change the delivery_time_type, it can directly
      write to the __sk_buff->tstamp as the existing tc-bpf has already been
      doing.  It will be most useful at ingress to change the
      __sk_buff->tstamp from the (rcv) timestamp to
      a mono delivery_time and then bpf_redirect_*().
      
      bpf only has mono clock helper (bpf_ktime_get_ns), and
      the current known use case is the mono EDT for fq, and
      only mono delivery time can be kept during forward now,
      so bpf_skb_set_delivery_time() only supports setting
      BPF_SKB_DELIVERY_TIME_MONO.  It can be extended later when use cases
      come up and the forwarding path also supports other clock bases.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d21ec0e
  4. 22 2月, 2022 1 次提交
  5. 21 2月, 2022 1 次提交
  6. 17 2月, 2022 1 次提交
  7. 10 2月, 2022 1 次提交
  8. 06 2月, 2022 2 次提交
  9. 02 2月, 2022 1 次提交
    • A
      tools headers UAPI: Sync linux/prctl.h with the kernel sources · fc45e658
      Arnaldo Carvalho de Melo 提交于
      To pick the changes in:
      
        9a10064f ("mm: add a field to store names for private anonymous memory")
      
      That don't result in any changes in tooling:
      
        $ tools/perf/trace/beauty/prctl_option.sh > before
        $ cp include/uapi/linux/prctl.h tools/include/uapi/linux/prctl.h
        $ tools/perf/trace/beauty/prctl_option.sh > after
        $ diff -u before after
        $
      
      This actually adds a new prctl arg, but it has to be dealt with
      differently, as it is not in sequence with the other arguments.
      
      Just silences this perf tools build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/prctl.h' differs from latest version at 'include/uapi/linux/prctl.h'
        diff -u tools/include/uapi/linux/prctl.h include/uapi/linux/prctl.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fc45e658
  10. 01 2月, 2022 3 次提交
  11. 28 1月, 2022 1 次提交
  12. 26 1月, 2022 1 次提交
  13. 25 1月, 2022 1 次提交
  14. 22 1月, 2022 3 次提交
  15. 20 1月, 2022 5 次提交
    • A
      tools headers UAPI: Sync files changed by new set_mempolicy_home_node syscall · 6e10e219
      Arnaldo Carvalho de Melo 提交于
      To pick the changes in these csets:
      
        21b084fd ("mm/mempolicy: wire up syscall set_mempolicy_home_node")
      
      That add support for this new syscall in tools such as 'perf trace'.
      
      For instance, this is now possible:
      
        [root@five ~]# perf trace -e set_mempolicy_home_node
        ^C[root@five ~]#
        [root@five ~]# perf trace -v -e set_mempolicy_home_node
        Using CPUID AuthenticAMD-25-21-0
        event qualifier tracepoint filter: (common_pid != 253729 && common_pid != 3585) && (id == 450)
        mmap size 528384B
        ^C[root@five ~]
        [root@five ~]# perf trace -v -e set*  --max-events 5
        Using CPUID AuthenticAMD-25-21-0
        event qualifier tracepoint filter: (common_pid != 253734 && common_pid != 3585) && (id == 38 || id == 54 || id == 105 || id == 106 || id == 109 || id == 112 || id == 113 || id == 114 || id == 116 || id == 117 || id == 119 || id == 122 || id == 123 || id == 141 || id == 160 || id == 164 || id == 170 || id == 171 || id == 188 || id == 205 || id == 218 || id == 238 || id == 273 || id == 308 || id == 450)
        mmap size 528384B
             0.000 ( 0.008 ms): bash/253735 setpgid(pid: 253735 (bash), pgid: 253735 (bash))      = 0
          6849.011 ( 0.008 ms): bash/16046 setpgid(pid: 253736 (bash), pgid: 253736 (bash))       = 0
          6849.080 ( 0.005 ms): bash/253736 setpgid(pid: 253736 (bash), pgid: 253736 (bash))      = 0
          7437.718 ( 0.009 ms): gnome-shell/253737 set_robust_list(head: 0x7f34b527e920, len: 24) = 0
         13445.986 ( 0.010 ms): bash/16046 setpgid(pid: 253738 (bash), pgid: 253738 (bash))       = 0
        [root@five ~]#
      
      That is the filter expression attached to the raw_syscalls:sys_{enter,exit}
      tracepoints.
      
        $ find tools/perf/arch/ -name "syscall*tbl" | xargs grep -w set_mempolicy_home_node
        tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl:450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
        tools/perf/arch/powerpc/entry/syscalls/syscall.tbl:450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
        tools/perf/arch/s390/entry/syscalls/syscall.tbl:450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
        tools/perf/arch/x86/entry/syscalls/syscall_64.tbl:450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
        $
      
        $ grep -w set_mempolicy_home_node /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c
      	[450] = "set_mempolicy_home_node",
        $
      
      This addresses these perf build warnings:
      
        Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
        diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
        Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
        diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
        Warning: Kernel ABI header at 'tools/perf/arch/powerpc/entry/syscalls/syscall.tbl' differs from latest version at 'arch/powerpc/kernel/syscalls/syscall.tbl'
        diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
        Warning: Kernel ABI header at 'tools/perf/arch/s390/entry/syscalls/syscall.tbl' differs from latest version at 'arch/s390/kernel/syscalls/syscall.tbl'
        diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
        Warning: Kernel ABI header at 'tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl' differs from latest version at 'arch/mips/kernel/syscalls/syscall_n64.tbl'
        diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl
      
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6e10e219
    • I
      hash.h: remove unused define directive · fd0a1462
      Isabella Basso 提交于
      Patch series "test_hash.c: refactor into KUnit", v3.
      
      We refactored the lib/test_hash.c file into KUnit as part of the student
      group LKCAMP [1] introductory hackathon for kernel development.
      
      This test was pointed to our group by Daniel Latypov [2], so its full
      conversion into a pure KUnit test was our goal in this patch series, but
      we ran into many problems relating to it not being split as unit tests,
      which complicated matters a bit, as the reasoning behind the original
      tests is quite cryptic for those unfamiliar with hash implementations.
      
      Some interesting developments we'd like to highlight are:
      
       - In patch 1/5 we noticed that there was an unused define directive
         that could be removed.
      
       - In patch 4/5 we noticed how stringhash and hash tests are all under
         the lib/test_hash.c file, which might cause some confusion, and we
         also broke those kernel config entries up.
      
      Overall KUnit developments have been made in the other patches in this
      series:
      
      In patches 2/5, 3/5 and 5/5 we refactored the lib/test_hash.c file so as
      to make it more compatible with the KUnit style, whilst preserving the
      original idea of the maintainer who designed it (i.e.  George Spelvin),
      which might be undesirable for unit tests, but we assume it is enough
      for a first patch.
      
      This patch (of 5):
      
      Currently, there exist hash_32() and __hash_32() functions, which were
      introduced in a patch [1] targeting architecture specific optimizations.
      These functions can be overridden on a per-architecture basis to achieve
      such optimizations.  They must set their corresponding define directive
      (HAVE_ARCH_HASH_32 and HAVE_ARCH__HASH_32, respectively) so that header
      files can deal with these overrides properly.
      
      As the supported 32-bit architectures that have their own hash function
      implementation (i.e.  m68k, Microblaze, H8/300, pa-risc) have only been
      making use of the (more general) __hash_32() function (which only lacks
      a right shift operation when compared to the hash_32() function), remove
      the define directive corresponding to the arch-specific hash_32()
      implementation.
      
      [1] https://lore.kernel.org/lkml/20160525073311.5600.qmail@ns.sciencehorizons.net/
      
      [akpm@linux-foundation.org: hash_32_generic() becomes hash_32()]
      
      Link: https://lkml.kernel.org/r/20211208183711.390454-1-isabbasso@riseup.net
      Link: https://lkml.kernel.org/r/20211208183711.390454-2-isabbasso@riseup.netReviewed-by: NDavid Gow <davidgow@google.com>
      Tested-by: NDavid Gow <davidgow@google.com>
      Co-developed-by: NAugusto Durães Camargo <augusto.duraes33@gmail.com>
      Signed-off-by: NAugusto Durães Camargo <augusto.duraes33@gmail.com>
      Co-developed-by: NEnzo Ferreira <ferreiraenzoa@gmail.com>
      Signed-off-by: NEnzo Ferreira <ferreiraenzoa@gmail.com>
      Signed-off-by: NIsabella Basso <isabbasso@riseup.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Brendan Higgins <brendanhiggins@google.com>
      Cc: Daniel Latypov <dlatypov@google.com>
      Cc: Shuah Khan <skhan@linuxfoundation.org>
      Cc: Rodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd0a1462
    • Y
      bpf: Add cgroup helpers bpf_{get,set}_retval to get/set syscall return value · b44123b4
      YiFei Zhu 提交于
      The helpers continue to use int for retval because all the hooks
      are int-returning rather than long-returning. The return value of
      bpf_set_retval is int for future-proofing, in case in the future
      there may be errors trying to set the retval.
      
      After the previous patch, if a program rejects a syscall by
      returning 0, an -EPERM will be generated no matter if the retval
      is already set to -err. This patch change it being forced only if
      retval is not -err. This is because we want to support, for
      example, invoking bpf_set_retval(-EINVAL) and return 0, and have
      the syscall return value be -EINVAL not -EPERM.
      
      For BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY, the prior behavior is
      that, if the return value is NET_XMIT_DROP, the packet is silently
      dropped. We preserve this behavior for backward compatibility
      reasons, so even if an errno is set, the errno does not return to
      caller. However, setting a non-err to retval cannot propagate so
      this is not allowed and we return a -EFAULT in that case.
      Signed-off-by: NYiFei Zhu <zhuyifei@google.com>
      Reviewed-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/b4013fd5d16bed0b01977c1fafdeae12e1de61fb.1639619851.git.zhuyifei@google.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      b44123b4
    • P
      kvm: selftests: sync uapi/linux/kvm.h with Linux header · fa681181
      Paolo Bonzini 提交于
      KVM_CAP_XSAVE2 is out of sync due to a conflict.  Copy the whole
      file while at it.
      Reported-by: NYang Zhong <yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fa681181
    • U
      uapi/bpf: Add missing description and returns for helper documentation · e40fbbf0
      Usama Arif 提交于
      Both description and returns section will become mandatory
      for helpers and syscalls in a later commit to generate man pages.
      
      This commit also adds in the documentation that BPF_PROG_RUN is
      an alias for BPF_PROG_TEST_RUN for anyone searching for the
      syscall in the generated man pages.
      Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
      Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20220119114442.1452088-1-usama.arif@bytedance.com
      e40fbbf0
  16. 16 1月, 2022 1 次提交
  17. 15 1月, 2022 1 次提交
  18. 13 1月, 2022 1 次提交
  19. 06 1月, 2022 1 次提交
  20. 22 12月, 2021 1 次提交
    • K
      tools headers UAPI: Add new macros for mem_hops field to perf_event.h · 7fbddf40
      Kajol Jain 提交于
      Add new macros for mem_hops field which can be used to represent
      remote-node, socket and board level details.
      
      Currently the code had macro for HOPS_0 which, corresponds to data
      coming from another core but same node.  Add new macros for HOPS_1 to
      HOPS_3 to represent remote-node, socket and board level data.
      
      Also add corresponding strings in the mem_hops array to represent
      mem_hop field data in perf_mem__lvl_scnprintf function
      
      Incase mem_hops field is used, PERF_MEM_LVLNUM field also need to be set
      inorder to represent the data source. Hence printing data source via
      PERF_MEM_LVL field can be skip in that scenario.
      
      For ex: Encodings for mem_hops fields with L2 cache:
      
        L2                      - local L2
        L2 | REMOTE | HOPS_0    - remote core, same node L2
        L2 | REMOTE | HOPS_1    - remote node, same socket L2
        L2 | REMOTE | HOPS_2    - remote socket, same board L2
        L2 | REMOTE | HOPS_3    - remote board L2
      Signed-off-by: NKajol Jain <kjain@linux.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Athira Jajeev <atrajeev@linux.vnet.ibm.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Kan Liang <kan.liang@linux.intel.com>
      Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nageswara R Sastry <rnsastry@linux.ibm.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lore.kernel.org/lkml/20211206091749.87585-3-kjain@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      7fbddf40
  21. 14 12月, 2021 1 次提交
    • J
      bpf: Add get_func_[arg|ret|arg_cnt] helpers · f92c1e18
      Jiri Olsa 提交于
      Adding following helpers for tracing programs:
      
      Get n-th argument of the traced function:
        long bpf_get_func_arg(void *ctx, u32 n, u64 *value)
      
      Get return value of the traced function:
        long bpf_get_func_ret(void *ctx, u64 *value)
      
      Get arguments count of the traced function:
        long bpf_get_func_arg_cnt(void *ctx)
      
      The trampoline now stores number of arguments on ctx-8
      address, so it's easy to verify argument index and find
      return value argument's position.
      
      Moving function ip address on the trampoline stack behind
      the number of functions arguments, so it's now stored on
      ctx-16 address if it's needed.
      
      All helpers above are inlined by verifier.
      
      Also bit unrelated small change - using newly added function
      bpf_prog_has_trampoline in check_get_func_ip.
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20211208193245.172141-5-jolsa@kernel.org
      f92c1e18
  22. 12 12月, 2021 1 次提交
  23. 11 12月, 2021 1 次提交
    • S
      tools: fix ARRAY_SIZE defines in tools and selftests hdrs · 066b34aa
      Shuah Khan 提交于
      tools/include/linux/kernel.h and kselftest_harness.h are missing
      ifndef guard around ARRAY_SIZE define. Fix them to avoid duplicate
      define errors during compile when another file defines it. This
      problem was found when compiling selftests that include a header
      with ARRAY_SIZE define.
      
      ARRAY_SIZE is defined in several selftests. There are about 25+
      duplicate defines in various selftests source and header files.
      Add ARRAY_SIZE to kselftest.h in preparation for removing duplicate
      ARRAY_SIZE defines from individual test files.
      Signed-off-by: NShuah Khan <skhan@linuxfoundation.org>
      066b34aa
  24. 10 12月, 2021 1 次提交
  25. 03 12月, 2021 2 次提交
  26. 01 12月, 2021 3 次提交
    • M
      tools/nolibc: Implement gettid() · b0fe9dec
      Mark Brown 提交于
      Allow test programs to determine their thread ID.
      Signed-off-by: NMark Brown <broonie@kernel.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      b0fe9dec
    • A
      tools/nolibc: x86-64: Use `mov $60,%eax` instead of `mov $60,%rax` · 7bdc0e7a
      Ammar Faizi 提交于
      Note that mov to 32-bit register will zero extend to 64-bit register.
      Thus `mov $60,%eax` has the same effect with `mov $60,%rax`. Use the
      shorter opcode to achieve the same thing.
      ```
        b8 3c 00 00 00       	mov    $60,%eax (5 bytes) [1]
        48 c7 c0 3c 00 00 00 	mov    $60,%rax (7 bytes) [2]
      ```
      Currently, we use [2]. Change it to [1] for shorter code.
      Signed-off-by: NAmmar Faizi <ammar.faizi@students.amikom.ac.id>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      7bdc0e7a
    • A
      tools/nolibc: x86: Remove `r8`, `r9` and `r10` from the clobber list · bf916669
      Ammar Faizi 提交于
      Linux x86-64 syscall only clobbers rax, rcx and r11 (and "memory").
      
        - rax for the return value.
        - rcx to save the return address.
        - r11 to save the rflags.
      
      Other registers are preserved.
      
      Having r8, r9 and r10 in the syscall clobber list is harmless, but this
      results in a missed-optimization.
      
      As the syscall doesn't clobber r8-r10, GCC should be allowed to reuse
      their value after the syscall returns to userspace. But since they are
      in the clobber list, GCC will always miss this opportunity.
      
      Remove them from the x86-64 syscall clobber list to help GCC generate
      better code and fix the comment.
      
      See also the x86-64 ABI, section A.2 AMD64 Linux Kernel Conventions,
      A.2.1 Calling Conventions [1].
      
      Extra note:
      Some people may think it does not really give a benefit to remove r8,
      r9 and r10 from the syscall clobber list because the impression of
      syscall is a C function call, and function call always clobbers those 3.
      
      However, that is not the case for nolibc.h, because we have a potential
      to inline the "syscall" instruction (which its opcode is "0f 05") to the
      user functions.
      
      All syscalls in the nolibc.h are written as a static function with inline
      ASM and are likely always inline if we use optimization flag, so this is
      a profit not to have r8, r9 and r10 in the clobber list.
      
      Here is the example where this matters.
      
      Consider the following C code:
      ```
        #include "tools/include/nolibc/nolibc.h"
        #define read_abc(a, b, c) __asm__ volatile("nop"::"r"(a),"r"(b),"r"(c))
      
        int main(void)
        {
        	int a = 0xaa;
        	int b = 0xbb;
        	int c = 0xcc;
      
        	read_abc(a, b, c);
        	write(1, "test\n", 5);
        	read_abc(a, b, c);
      
        	return 0;
        }
      ```
      
      Compile with:
          gcc -Os test.c -o test -nostdlib
      
      With r8, r9, r10 in the clobber list, GCC generates this:
      
      0000000000001000 <main>:
          1000:	f3 0f 1e fa          	endbr64
          1004:	41 54                	push   %r12
          1006:	41 bc cc 00 00 00    	mov    $0xcc,%r12d
          100c:	55                   	push   %rbp
          100d:	bd bb 00 00 00       	mov    $0xbb,%ebp
          1012:	53                   	push   %rbx
          1013:	bb aa 00 00 00       	mov    $0xaa,%ebx
          1018:	90                   	nop
          1019:	b8 01 00 00 00       	mov    $0x1,%eax
          101e:	bf 01 00 00 00       	mov    $0x1,%edi
          1023:	ba 05 00 00 00       	mov    $0x5,%edx
          1028:	48 8d 35 d1 0f 00 00 	lea    0xfd1(%rip),%rsi
          102f:	0f 05                	syscall
          1031:	90                   	nop
          1032:	31 c0                	xor    %eax,%eax
          1034:	5b                   	pop    %rbx
          1035:	5d                   	pop    %rbp
          1036:	41 5c                	pop    %r12
          1038:	c3                   	ret
      
      GCC thinks that syscall will clobber r8, r9, r10. So it spills 0xaa,
      0xbb and 0xcc to callee saved registers (r12, rbp and rbx). This is
      clearly extra memory access and extra stack size for preserving them.
      
      But syscall does not actually clobber them, so this is a missed
      optimization.
      
      Now without r8, r9, r10 in the clobber list, GCC generates better code:
      
      0000000000001000 <main>:
          1000:	f3 0f 1e fa          	endbr64
          1004:	41 b8 aa 00 00 00    	mov    $0xaa,%r8d
          100a:	41 b9 bb 00 00 00    	mov    $0xbb,%r9d
          1010:	41 ba cc 00 00 00    	mov    $0xcc,%r10d
          1016:	90                   	nop
          1017:	b8 01 00 00 00       	mov    $0x1,%eax
          101c:	bf 01 00 00 00       	mov    $0x1,%edi
          1021:	ba 05 00 00 00       	mov    $0x5,%edx
          1026:	48 8d 35 d3 0f 00 00 	lea    0xfd3(%rip),%rsi
          102d:	0f 05                	syscall
          102f:	90                   	nop
          1030:	31 c0                	xor    %eax,%eax
          1032:	c3                   	ret
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: x86@kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NAmmar Faizi <ammar.faizi@students.amikom.ac.id>
      Link: https://gitlab.com/x86-psABIs/x86-64-ABI/-/wikis/x86-64-psABI [1]
      Link: https://lore.kernel.org/lkml/20211011040344.437264-1-ammar.faizi@students.amikom.ac.id/Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      bf916669