1. 14 12月, 2021 1 次提交
  2. 01 12月, 2021 1 次提交
  3. 16 11月, 2021 1 次提交
    • T
      bpf: Change value of MAX_TAIL_CALL_CNT from 32 to 33 · ebf7f6f0
      Tiezhu Yang 提交于
      In the current code, the actual max tail call count is 33 which is greater
      than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
      with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
      We can see the historical evolution from commit 04fd61ab ("bpf: allow
      bpf programs to tail-call other bpf programs") and commit f9dabe01
      ("bpf: Undo off-by-one in interpreter tail call count limit"). In order
      to avoid changing existing behavior, the actual limit is 33 now, this is
      reasonable.
      
      After commit 874be05f ("bpf, tests: Add tail call test suite"), we can
      see there exists failed testcase.
      
      On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
      
      On some archs:
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf
       # dmesg | grep -w FAIL
       Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
      
      Although the above failed testcase has been fixed in commit 18935a72
      ("bpf/tests: Fix error in tail call limit tests"), it would still be good
      to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
      more readable.
      
      The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
      limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
      mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
      For the riscv JIT, use RV_REG_TCC directly to save one register move as
      suggested by Björn Töpel. For the other implementations, no function changes,
      it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
      can reflect the actual max tail call count, the related tail call testcases
      in test_bpf module and selftests can work well for the interpreter and the
      JIT.
      
      Here are the test results on x86_64:
      
       # uname -m
       x86_64
       # echo 0 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
       # rmmod test_bpf
       # echo 1 > /proc/sys/net/core/bpf_jit_enable
       # modprobe test_bpf test_suite=test_tail_calls
       # dmesg | tail -1
       test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
       # rmmod test_bpf
       # ./test_progs -t tailcalls
       #142 tailcalls:OK
       Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Tested-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Tested-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Acked-by: NBjörn Töpel <bjorn@kernel.org>
      Acked-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Acked-by: NIlya Leoshkevich <iii@linux.ibm.com>
      Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
      ebf7f6f0
  4. 29 10月, 2021 3 次提交
  5. 08 10月, 2021 1 次提交
  6. 06 10月, 2021 1 次提交
  7. 28 9月, 2021 1 次提交
    • J
      bpf, x86: Fix bpf mapping of atomic fetch implementation · ced18582
      Johan Almbladh 提交于
      Fix the case where the dst register maps to %rax as otherwise this produces
      an incorrect mapping with the implementation in 981f94c3 ("bpf: Add
      bitwise atomic instructions") as %rax is clobbered given it's part of the
      cmpxchg as operand.
      
      The issue is similar to b29dd96b ("bpf, x86: Fix BPF_FETCH atomic and/or/
      xor with r0 as src") just that the case of dst register was missed.
      
      Before, dst=r0 (%rax) src=r2 (%rsi):
      
        [...]
        c5:   mov    %rax,%r10
        c8:   mov    0x0(%rax),%rax       <---+ (broken)
        cc:   mov    %rax,%r11                |
        cf:   and    %rsi,%r11                |
        d2:   lock cmpxchg %r11,0x0(%rax) <---+
        d8:   jne    0x00000000000000c8       |
        da:   mov    %rax,%rsi                |
        dd:   mov    %r10,%rax                |
        [...]                                 |
                                              |
      After, dst=r0 (%rax) src=r2 (%rsi):     |
                                              |
        [...]                                 |
        da:	mov    %rax,%r10                |
        dd:	mov    0x0(%r10),%rax       <---+ (fixed)
        e1:	mov    %rax,%r11                |
        e4:	and    %rsi,%r11                |
        e7:	lock cmpxchg %r11,0x0(%r10) <---+
        ed:	jne    0x00000000000000dd
        ef:	mov    %rax,%rsi
        f2:	mov    %r10,%rax
        [...]
      
      The remaining combinations were fine as-is though:
      
      After, dst=r9 (%r15) src=r0 (%rax):
      
        [...]
        dc:	mov    %rax,%r10
        df:	mov    0x0(%r15),%rax
        e3:	mov    %rax,%r11
        e6:	and    %r10,%r11
        e9:	lock cmpxchg %r11,0x0(%r15)
        ef:	jne    0x00000000000000df      _
        f1:	mov    %rax,%r10                | (unneeded, but
        f4:	mov    %r10,%rax               _|  not a problem)
        [...]
      
      After, dst=r9 (%r15) src=r4 (%rcx):
      
        [...]
        de:	mov    %rax,%r10
        e1:	mov    0x0(%r15),%rax
        e5:	mov    %rax,%r11
        e8:	and    %rcx,%r11
        eb:	lock cmpxchg %r11,0x0(%r15)
        f1:	jne    0x00000000000000e1
        f3:	mov    %rax,%rcx
        f6:	mov    %r10,%rax
        [...]
      
      The case of dst == src register is rejected by the verifier and
      therefore not supported, but x86 JIT also handles this case just
      fine.
      
      After, dst=r0 (%rax) src=r0 (%rax):
      
        [...]
        eb:	mov    %rax,%r10
        ee:	mov    0x0(%r10),%rax
        f2:	mov    %rax,%r11
        f5:	and    %r10,%r11
        f8:	lock cmpxchg %r11,0x0(%r10)
        fe:	jne    0x00000000000000ee
       100:	mov    %rax,%r10
       103:	mov    %r10,%rax
        [...]
      
      Fixes: 981f94c3 ("bpf: Add bitwise atomic instructions")
      Reported-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Signed-off-by: NJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NBrendan Jackman <jackmanb@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      ced18582
  8. 15 9月, 2021 2 次提交
  9. 13 9月, 2021 1 次提交
    • T
      x86/extable: Rework the exception table mechanics · 46d28947
      Thomas Gleixner 提交于
      The exception table entries contain the instruction address, the fixup
      address and the handler address. All addresses are relative. Storing the
      handler address has a few downsides:
      
       1) Most handlers need to be exported
      
       2) Handlers can be defined everywhere and there is no overview about the
          handler types
      
       3) MCE needs to check the handler type to decide whether an in kernel #MC
          can be recovered. The functionality of the handler itself is not in any
          way special, but for these checks there need to be separate functions
          which in the worst case have to be exported.
      
          Some of these 'recoverable' exception fixups are pretty obscure and
          just reuse some other handler to spare code. That obfuscates e.g. the
          #MC safe copy functions. Cleaning that up would require more handlers
          and exports
      
      Rework the exception fixup mechanics by storing a fixup type number instead
      of the handler address and invoke the proper handler for each fixup
      type. Also teach the extable sort to leave the type field alone.
      
      This makes most handlers static except for special cases like the MCE
      MSR fixup and the BPF fixup. This allows to add more types for cleaning up
      the obscure places without adding more handler code and exports.
      
      There is a marginal code size reduction for a production config and it
      removes _eight_ exported symbols.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lkml.kernel.org/r/20210908132525.211958725@linutronix.de
      46d28947
  10. 29 7月, 2021 1 次提交
    • D
      bpf: Introduce BPF nospec instruction for mitigating Spectre v4 · f5e81d11
      Daniel Borkmann 提交于
      In case of JITs, each of the JIT backends compiles the BPF nospec instruction
      /either/ to a machine instruction which emits a speculation barrier /or/ to
      /no/ machine instruction in case the underlying architecture is not affected
      by Speculative Store Bypass or has different mitigations in place already.
      
      This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
      instruction for mitigation. In case of arm64, we rely on the firmware mitigation
      as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
      it works for all of the kernel code with no need to provide any additional
      instructions here (hence only comment in arm64 JIT). Other archs can follow
      as needed. The BPF nospec instruction is specifically targeting Spectre v4
      since i) we don't use a serialization barrier for the Spectre v1 case, and
      ii) mitigation instructions for v1 and v4 might be different on some archs.
      
      The BPF nospec is required for a future commit, where the BPF verifier does
      annotate intermediate BPF programs with speculation barriers.
      Co-developed-by: NPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      f5e81d11
  11. 16 7月, 2021 1 次提交
  12. 09 7月, 2021 1 次提交
    • J
      bpf: Track subprog poke descriptors correctly and fix use-after-free · f263a814
      John Fastabend 提交于
      Subprograms are calling map_poke_track(), but on program release there is no
      hook to call map_poke_untrack(). However, on program release, the aux memory
      (and poke descriptor table) is freed even though we still have a reference to
      it in the element list of the map aux data. When we run map_poke_run(), we then
      end up accessing free'd memory, triggering KASAN in prog_array_map_poke_run():
      
        [...]
        [  402.824689] BUG: KASAN: use-after-free in prog_array_map_poke_run+0xc2/0x34e
        [  402.824698] Read of size 4 at addr ffff8881905a7940 by task hubble-fgs/4337
        [  402.824705] CPU: 1 PID: 4337 Comm: hubble-fgs Tainted: G          I       5.12.0+ #399
        [  402.824715] Call Trace:
        [  402.824719]  dump_stack+0x93/0xc2
        [  402.824727]  print_address_description.constprop.0+0x1a/0x140
        [  402.824736]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824740]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824744]  kasan_report.cold+0x7c/0xd8
        [  402.824752]  ? prog_array_map_poke_run+0xc2/0x34e
        [  402.824757]  prog_array_map_poke_run+0xc2/0x34e
        [  402.824765]  bpf_fd_array_map_update_elem+0x124/0x1a0
        [...]
      
      The elements concerned are walked as follows:
      
          for (i = 0; i < elem->aux->size_poke_tab; i++) {
                 poke = &elem->aux->poke_tab[i];
          [...]
      
      The access to size_poke_tab is a 4 byte read, verified by checking offsets
      in the KASAN dump:
      
        [  402.825004] The buggy address belongs to the object at ffff8881905a7800
                       which belongs to the cache kmalloc-1k of size 1024
        [  402.825008] The buggy address is located 320 bytes inside of
                       1024-byte region [ffff8881905a7800, ffff8881905a7c00)
      
      The pahole output of bpf_prog_aux:
      
        struct bpf_prog_aux {
          [...]
          /* --- cacheline 5 boundary (320 bytes) --- */
          u32                        size_poke_tab;        /*   320     4 */
          [...]
      
      In general, subprograms do not necessarily manage their own data structures.
      For example, BTF func_info and linfo are just pointers to the main program
      structure. This allows reference counting and cleanup to be done on the latter
      which simplifies their management a bit. The aux->poke_tab struct, however,
      did not follow this logic. The initial proposed fix for this use-after-free
      bug further embedded poke data tracking into the subprogram with proper
      reference counting. However, Daniel and Alexei questioned why we were treating
      these objects special; I agree, its unnecessary. The fix here removes the per
      subprogram poke table allocation and map tracking and instead simply points
      the aux->poke_tab pointer at the main programs poke table. This way, map
      tracking is simplified to the main program and we do not need to manage them
      per subprogram.
      
      This also means, bpf_prog_free_deferred(), which unwinds the program reference
      counting and kfrees objects, needs to ensure that we don't try to double free
      the poke_tab when free'ing the subprog structures. This is easily solved by
      NULL'ing the poke_tab pointer. The second detail is to ensure that per
      subprogram JIT logic only does fixups on poke_tab[] entries it owns. To do
      this, we add a pointer in the poke structure to point at the subprogram value
      so JITs can easily check while walking the poke_tab structure if the current
      entry belongs to the current program. The aux pointer is stable and therefore
      suitable for such comparison. On the jit_subprogs() error path, we omit
      cleaning up the poke->aux field because these are only ever referenced from
      the JIT side, but on error we will never make it to the JIT, so its fine to
      leave them dangling. Removing these pointers would complicate the error path
      for no reason. However, we do need to untrack all poke descriptors from the
      main program as otherwise they could race with the freeing of JIT memory from
      the subprograms. Lastly, a748c697 ("bpf: propagate poke descriptors to
      subprograms") had an off-by-one on the subprogram instruction index range
      check as it was testing 'insn_idx >= subprog_start && insn_idx <= subprog_end'.
      However, subprog_end is the next subprogram's start instruction.
      
      Fixes: a748c697 ("bpf: propagate poke descriptors to subprograms")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210707223848.14580-2-john.fastabend@gmail.com
      f263a814
  13. 28 6月, 2021 1 次提交
  14. 24 6月, 2021 1 次提交
  15. 08 4月, 2021 1 次提交
    • P
      bpf, x86: Validate computation of branch displacements for x86-64 · e4d4d456
      Piotr Krysiuk 提交于
      The branch displacement logic in the BPF JIT compilers for x86 assumes
      that, for any generated branch instruction, the distance cannot
      increase between optimization passes.
      
      But this assumption can be violated due to how the distances are
      computed. Specifically, whenever a backward branch is processed in
      do_jit(), the distance is computed by subtracting the positions in the
      machine code from different optimization passes. This is because part
      of addrs[] is already updated for the current optimization pass, before
      the branch instruction is visited.
      
      And so the optimizer can expand blocks of machine code in some cases.
      
      This can confuse the optimizer logic, where it assumes that a fixed
      point has been reached for all machine code blocks once the total
      program size stops changing. And then the JIT compiler can output
      abnormal machine code containing incorrect branch displacements.
      
      To mitigate this issue, we assert that a fixed point is reached while
      populating the output image. This rejects any problematic programs.
      The issue affects both x86-32 and x86-64. We mitigate separately to
      ease backporting.
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Reviewed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e4d4d456
  16. 27 3月, 2021 1 次提交
    • M
      bpf: Support bpf program calling kernel function · e6ac2450
      Martin KaFai Lau 提交于
      This patch adds support to BPF verifier to allow bpf program calling
      kernel function directly.
      
      The use case included in this set is to allow bpf-tcp-cc to directly
      call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
      functions have already been used by some kernel tcp-cc implementations.
      
      This set will also allow the bpf-tcp-cc program to directly call the
      kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
      implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
      from the kernel tcp_dctcp.c instead of reimplementing (or
      copy-and-pasting) them.
      
      The tcp-cc kernel functions mentioned above will be white listed
      for the struct_ops bpf-tcp-cc programs to use in a later patch.
      The white listed functions are not bounded to a fixed ABI contract.
      Those functions have already been used by the existing kernel tcp-cc.
      If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
      implementations have to be changed.  The same goes for the struct_ops
      bpf-tcp-cc programs which have to be adjusted accordingly.
      
      This patch is to make the required changes in the bpf verifier.
      
      First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
      When the passed in "btf->kernel_btf == true", it means matching the
      verifier regs' states with a kernel function.  This will handle the
      PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
      and PTR_TO_TCP_SOCK to its kernel's btf_id.
      
      In the later libbpf patch, the insn calling a kernel function will
      look like:
      
      insn->code == (BPF_JMP | BPF_CALL)
      insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
      insn->imm == func_btf_id /* btf_id of the running kernel */
      
      [ For the future calling function-in-kernel-module support, an array
        of module btf_fds can be passed at the load time and insn->off
        can be used to index into this array. ]
      
      At the early stage of verifier, the verifier will collect all kernel
      function calls into "struct bpf_kfunc_desc".  Those
      descriptors are stored in "prog->aux->kfunc_tab" and will
      be available to the JIT.  Since this "add" operation is similar
      to the current "add_subprog()" and looking for the same insn->code,
      they are done together in the new "add_subprog_and_kfunc()".
      
      In the "do_check()" stage, the new "check_kfunc_call()" is added
      to verify the kernel function call instruction:
      1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
         A new bpf_verifier_ops "check_kfunc_call" is added to do that.
         The bpf-tcp-cc struct_ops program will implement this function in
         a later patch.
      2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
         used as the args of a kernel function.
      3. Mark the regs' type, subreg_def, and zext_dst.
      
      At the later do_misc_fixups() stage, the new fixup_kfunc_call()
      will replace the insn->imm with the function address (relative
      to __bpf_call_base).  If needed, the jit can find the btf_func_model
      by calling the new bpf_jit_find_kfunc_model(prog, insn).
      With the imm set to the function address, "bpftool prog dump xlated"
      will be able to display the kernel function calls the same way as
      it displays other bpf helper calls.
      
      gpl_compatible program is required to call kernel function.
      
      This feature currently requires JIT.
      
      The verifier selftests are adjusted because of the changes in
      the verbose log in add_subprog_and_kfunc().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
      e6ac2450
  17. 20 3月, 2021 1 次提交
  18. 18 3月, 2021 2 次提交
    • I
      x86: Fix various typos in comments · d9f6e12f
      Ingo Molnar 提交于
      Fix ~144 single-word typos in arch/x86/ code comments.
      
      Doing this in a single commit should reduce the churn.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: linux-kernel@vger.kernel.org
      d9f6e12f
    • A
      bpf: Fix fexit trampoline. · e21aa341
      Alexei Starovoitov 提交于
      The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
      The synchronize_rcu_tasks() will not wait for such tasks to complete.
      In such case the trampoline image will be freed and when the task
      wakes up the return IP will point to freed memory causing the crash.
      Solve this by adding percpu_ref_get/put for the duration of trampoline
      and separate trampoline vs its image life times.
      The "half page" optimization has to be removed, since
      first_half->second_half->first_half transition cannot be guaranteed to
      complete in deterministic time. Every trampoline update becomes a new image.
      The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
      call_rcu_tasks. Together they will wait for the original function and
      trampoline asm to complete. The trampoline is patched from nop to jmp to skip
      fexit progs. They are freed independently from the trampoline. The image with
      fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
      will wait for both sleepable and non-sleepable progs to complete.
      
      Fixes: fec56f58 ("bpf: Introduce BPF trampoline")
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
      Link: https://lore.kernel.org/bpf/20210316210007.38949-1-alexei.starovoitov@gmail.com
      e21aa341
  19. 15 3月, 2021 1 次提交
    • P
      x86: Remove dynamic NOP selection · a89dfde3
      Peter Zijlstra 提交于
      This ensures that a NOP is a NOP and not a random other instruction that
      is also a NOP. It allows simplification of dynamic code patching that
      wants to verify existing code before writing new instructions (ftrace,
      jump_label, static_call, etc..).
      
      Differentiating on NOPs is not a feature.
      
      This pessimises 32bit (DONTCARE) and 32bit on 64bit CPUs (CARELESS).
      32bit is not a performance target.
      
      Everything x86_64 since AMD K10 (2007) and Intel IvyBridge (2012) is
      fine with using NOPL (as opposed to prefix NOP). And per FEATURE_NOPL
      being required for x86_64, all x86_64 CPUs can use NOPL. So stop
      caring about NOPs, simplify things and get on with life.
      
      [ The problem seems to be that some uarchs can only decode NOPL on a
      single front-end port while others have severe decode penalties for
      excessive prefixes. All modern uarchs can handle both, except Atom,
      which has prefix penalties. ]
      
      [ Also, much doubt you can actually measure any of this on normal
      workloads. ]
      
      After this, FEATURE_NOPL is unused except for required-features for
      x86_64. FEATURE_K8 is only used for PTI.
      
       [ bp: Kernel build measurements showed ~0.3s slowdown on Sandybridge
         which is hardly a slowdown. Get rid of X86_FEATURE_K7, while at it. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> # bpf
      Acked-by: NLinus Torvalds <torvalds@linuxfoundation.org>
      Link: https://lkml.kernel.org/r/20210312115749.065275711@infradead.org
      a89dfde3
  20. 10 3月, 2021 1 次提交
    • Y
      bpf, x86: Use kvmalloc_array instead kmalloc_array in bpf_jit_comp · de920fc6
      Yonghong Song 提交于
      x86 bpf_jit_comp.c used kmalloc_array to store jited addresses
      for each bpf insn. With a large bpf program, we have see the
      following allocation failures in our production server:
      
          page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
                                   nodemask=(null),cpuset=/,mems_allowed=0"
          Call Trace:
          dump_stack+0x50/0x70
          warn_alloc.cold.120+0x72/0xd2
          ? __alloc_pages_direct_compact+0x157/0x160
          __alloc_pages_slowpath+0xcdb/0xd00
          ? get_page_from_freelist+0xe44/0x1600
          ? vunmap_page_range+0x1ba/0x340
          __alloc_pages_nodemask+0x2c9/0x320
          kmalloc_order+0x18/0x80
          kmalloc_order_trace+0x1d/0xa0
          bpf_int_jit_compile+0x1e2/0x484
          ? kmalloc_order_trace+0x1d/0xa0
          bpf_prog_select_runtime+0xc3/0x150
          bpf_prog_load+0x480/0x720
          ? __mod_memcg_lruvec_state+0x21/0x100
          __do_sys_bpf+0xc31/0x2040
          ? close_pdeo+0x86/0xe0
          do_syscall_64+0x42/0x110
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
          RIP: 0033:0x7f2f300f7fa9
          Code: Bad RIP value.
      
      Dumped assembly:
      
          ffffffff810b6d70 <bpf_int_jit_compile>:
          ; {
          ffffffff810b6d70: e8 eb a5 b4 00        callq   0xffffffff81c01360 <__fentry__>
          ffffffff810b6d75: 41 57                 pushq   %r15
          ...
          ffffffff810b6f39: e9 72 fe ff ff        jmp     0xffffffff810b6db0 <bpf_int_jit_compile+0x40>
          ;       addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
          ffffffff810b6f3e: 8b 45 0c              movl    12(%rbp), %eax
          ;       return __kmalloc(bytes, flags);
          ffffffff810b6f41: be c0 0c 00 00        movl    $3264, %esi
          ;       addrs = kmalloc_array(prog->len + 1, sizeof(*addrs), GFP_KERNEL);
          ffffffff810b6f46: 8d 78 01              leal    1(%rax), %edi
          ;       if (unlikely(check_mul_overflow(n, size, &bytes)))
          ffffffff810b6f49: 48 c1 e7 02           shlq    $2, %rdi
          ;       return __kmalloc(bytes, flags);
          ffffffff810b6f4d: e8 8e 0c 1d 00        callq   0xffffffff81287be0 <__kmalloc>
          ;       if (!addrs) {
          ffffffff810b6f52: 48 85 c0              testq   %rax, %rax
      
      Change kmalloc_array() to kvmalloc_array() to avoid potential
      allocation error for big bpf programs.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210309015647.3657852-1-yhs@fb.com
      de920fc6
  21. 23 2月, 2021 1 次提交
  22. 11 2月, 2021 2 次提交
  23. 04 2月, 2021 1 次提交
  24. 21 1月, 2021 1 次提交
    • G
      bpf,x64: Pad NOPs to make images converge more easily · 93c5aecc
      Gary Lin 提交于
      The x64 bpf jit expects bpf images converge within the given passes, but
      it could fail to do so with some corner cases. For example:
      
            l0:     ja 40
            l1:     ja 40
      
              [... repeated ja 40 ]
      
            l39:    ja 40
            l40:    ret #0
      
      This bpf program contains 40 "ja 40" instructions which are effectively
      NOPs and designed to be replaced with valid code dynamically. Ideally,
      bpf jit should optimize those "ja 40" instructions out when translating
      the bpf instructions into x64 machine code. However, do_jit() can only
      remove one "ja 40" for offset==0 on each pass, so it requires at least
      40 runs to eliminate those JMPs and exceeds the current limit of
      passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
      is set even though it's legit as a classic socket filter.
      
      To make bpf images more likely converge within 20 passes, this commit
      pads some instructions with NOPs in the last 5 passes:
      
      1. conditional jumps
        A possible size variance comes from the adoption of imm8 JMP. If the
        offset is imm8, we calculate the size difference of this BPF instruction
        between the previous and the current pass and fill the gap with NOPs.
        To avoid the recalculation of jump offset, those NOPs are inserted before
        the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
        calculating the NOP number.
      
      2. BPF_JA
        There are two conditions for BPF_JA.
        a.) nop jumps
          If this instruction is not optimized out in the previous pass,
          instead of removing it, we insert the equivalent size of NOPs.
        b.) label jumps
          Similar to condition jumps, we prepend NOPs right before the JMP
          code.
      
      To make the code concise, emit_nops() is modified to use the signed len and
      return the number of inserted NOPs.
      
      For bpf-to-bpf, we always enable padding for the extra pass since there
      is only one extra run and the jump padding doesn't affected the images
      that converge without padding.
      
      After applying this patch, the corner case was loaded with the following
      jit code:
      
          flen=45 proglen=77 pass=17 image=ffffffffc03367d4 from=jump pid=10097
          JIT code: 00000000: 0f 1f 44 00 00 55 48 89 e5 53 41 55 31 c0 45 31
          JIT code: 00000010: ed 48 89 fb eb 30 eb 2e eb 2c eb 2a eb 28 eb 26
          JIT code: 00000020: eb 24 eb 22 eb 20 eb 1e eb 1c eb 1a eb 18 eb 16
          JIT code: 00000030: eb 14 eb 12 eb 10 eb 0e eb 0c eb 0a eb 08 eb 06
          JIT code: 00000040: eb 04 eb 02 66 90 31 c0 41 5d 5b c9 c3
      
           0: 0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
           5: 55                      push   rbp
           6: 48 89 e5                mov    rbp,rsp
           9: 53                      push   rbx
           a: 41 55                   push   r13
           c: 31 c0                   xor    eax,eax
           e: 45 31 ed                xor    r13d,r13d
          11: 48 89 fb                mov    rbx,rdi
          14: eb 30                   jmp    0x46
          16: eb 2e                   jmp    0x46
              ...
          3e: eb 06                   jmp    0x46
          40: eb 04                   jmp    0x46
          42: eb 02                   jmp    0x46
          44: 66 90                   xchg   ax,ax
          46: 31 c0                   xor    eax,eax
          48: 41 5d                   pop    r13
          4a: 5b                      pop    rbx
          4b: c9                      leave
          4c: c3                      ret
      
      At the 16th pass, 15 jumps were already optimized out, and one jump was
      replaced with NOPs at 44 and the image converged at the 17th pass.
      
      v4:
        - Add the detailed comments about the possible padding bytes
      
      v3:
        - Copy the instructions of prologue separately or the size calculation
          of the first BPF instruction would include the prologue.
        - Replace WARN_ONCE() with pr_err() and EFAULT
        - Use MAX_PASSES in the for loop condition check
        - Remove the "padded" flag from x64_jit_data. For the extra pass of
          subprogs, padding is always enabled since it won't hurt the images
          that converge without padding.
      
      v2:
        - Simplify the sample code in the description and provide the jit code
        - Check the expected padding bytes with WARN_ONCE
        - Move the 'padded' flag to 'struct x64_jit_data'
      Signed-off-by: NGary Lin <glin@suse.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210119102501.511-2-glin@suse.com
      93c5aecc
  25. 15 1月, 2021 7 次提交
  26. 30 9月, 2020 2 次提交
    • M
      bpf: x64: Do not emit sub/add 0, %rsp when !stack_depth · 4d0b8c0b
      Maciej Fijalkowski 提交于
      There is no particular reason for keeping the "sub 0, %rsp" insn within
      the BPF's x64 JIT prologue.
      
      When tail call code was skipping the whole prologue section these 7
      bytes that represent the rsp subtraction could not be simply discarded
      as the jump target address would be broken. An option to address that
      would be to substitute it with nop7.
      
      Right now tail call is skipping only first 11 bytes of target program's
      prologue and "sub X, %rsp" is the first insn that is processed, so if
      stack depth is zero then this insn could be omitted without the need for
      nop7 swap.
      
      Therefore, do not emit the "sub 0, %rsp" in prologue when program is not
      making use of R10 register. Also, make the emission of "add X, %rsp"
      conditional in tail call code logic and take into account the presence
      of mentioned insn when calculating the jump offsets.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200929204653.4325-3-maciej.fijalkowski@intel.com
      4d0b8c0b
    • M
      bpf, x64: Drop "pop %rcx" instruction on BPF JIT epilogue · d207929d
      Maciej Fijalkowski 提交于
      Back when all of the callee-saved registers where always pushed to stack
      in x64 JIT prologue, tail call counter was placed at the bottom of the
      BPF program's stack frame that had a following layout:
      
      +-------------+
      |  ret addr   |
      +-------------+
      |     rbp     | <- rbp
      +-------------+
      |             |
      | free space  |
      | from:       |
      | sub $x,%rsp |
      |             |
      +-------------+
      |     rbx     |
      +-------------+
      |     r13     |
      +-------------+
      |     r14     |
      +-------------+
      |     r15     |
      +-------------+
      |  tail call  | <- rsp
      |   counter   |
      +-------------+
      
      In order to restore the callee saved registers, epilogue needed to
      explicitly toss away the tail call counter via "pop %rbx" insn, so that
      %rsp would be back at the place where %r15 was stored.
      
      Currently, the tail call counter is placed on stack *before* the callee
      saved registers (brackets on rbx through r15 mean that they are now
      pushed to stack only if they are used):
      
      +-------------+
      |  ret addr   |
      +-------------+
      |     rbp     | <- rbp
      +-------------+
      |             |
      | free space  |
      | from:       |
      | sub $x,%rsp |
      |             |
      +-------------+
      |  tail call  |
      |   counter   |
      +-------------+
      (     rbx     )
      +-------------+
      (     r13     )
      +-------------+
      (     r14     )
      +-------------+
      (     r15     ) <- rsp
      +-------------+
      
      For the record, the epilogue insns consist of (assuming all of the
      callee saved registers are used by program):
      pop    %r15
      pop    %r14
      pop    %r13
      pop    %rbx
      pop    %rcx
      leaveq
      retq
      
      "pop %rbx" for getting rid of tail call counter was not an option
      anymore as it would overwrite the restored value of %rbx register, so it
      was changed to use the %rcx register.
      
      Since epilogue can start popping the callee saved registers right away
      without any additional work, the "pop %rcx" could be dropped altogether
      as "leave" insn will simply move the %rbp to %rsp. IOW, tail call
      counter does not need the explicit handling.
      
      Having in mind the explanation above and the actual reason for that,
      let's piggy back on "leave" insn for discarding the tail call counter
      from stack and remove the "pop %rcx" from epilogue.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200929204653.4325-2-maciej.fijalkowski@intel.com
      d207929d
  27. 18 9月, 2020 2 次提交
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski 提交于
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      cf71b174