1. 19 8月, 2021 1 次提交
  2. 29 7月, 2021 1 次提交
    • D
      bpf: Introduce BPF nospec instruction for mitigating Spectre v4 · f5e81d11
      Daniel Borkmann 提交于
      In case of JITs, each of the JIT backends compiles the BPF nospec instruction
      /either/ to a machine instruction which emits a speculation barrier /or/ to
      /no/ machine instruction in case the underlying architecture is not affected
      by Speculative Store Bypass or has different mitigations in place already.
      
      This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
      instruction for mitigation. In case of arm64, we rely on the firmware mitigation
      as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
      it works for all of the kernel code with no need to provide any additional
      instructions here (hence only comment in arm64 JIT). Other archs can follow
      as needed. The BPF nospec instruction is specifically targeting Spectre v4
      since i) we don't use a serialization barrier for the Spectre v1 case, and
      ii) mitigation instructions for v1 and v4 might be different on some archs.
      
      The BPF nospec is required for a future commit, where the BPF verifier does
      annotate intermediate BPF programs with speculation barriers.
      Co-developed-by: NPiotr Krysiuk <piotras@gmail.com>
      Co-developed-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
      Signed-off-by: NBenedict Schlueter <benedict.schlueter@rub.de>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      f5e81d11
  3. 25 6月, 2021 1 次提交
    • T
      xdp: Add proper __rcu annotations to redirect map entries · 782347b6
      Toke Høiland-Jørgensen 提交于
      XDP_REDIRECT works by a three-step process: the bpf_redirect() and
      bpf_redirect_map() helpers will lookup the target of the redirect and store
      it (along with some other metadata) in a per-CPU struct bpf_redirect_info.
      Next, when the program returns the XDP_REDIRECT return code, the driver
      will call xdp_do_redirect() which will use the information thus stored to
      actually enqueue the frame into a bulk queue structure (that differs
      slightly by map type, but shares the same principle). Finally, before
      exiting its NAPI poll loop, the driver will call xdp_do_flush(), which will
      flush all the different bulk queues, thus completing the redirect.
      
      Pointers to the map entries will be kept around for this whole sequence of
      steps, protected by RCU. However, there is no top-level rcu_read_lock() in
      the core code; instead drivers add their own rcu_read_lock() around the XDP
      portions of the code, but somewhat inconsistently as Martin discovered[0].
      However, things still work because everything happens inside a single NAPI
      poll sequence, which means it's between a pair of calls to
      local_bh_disable()/local_bh_enable(). So Paul suggested[1] that we could
      document this intention by using rcu_dereference_check() with
      rcu_read_lock_bh_held() as a second parameter, thus allowing sparse and
      lockdep to verify that everything is done correctly.
      
      This patch does just that: we add an __rcu annotation to the map entry
      pointers and remove the various comments explaining the NAPI poll assurance
      strewn through devmap.c in favour of a longer explanation in filter.c. The
      goal is to have one coherent documentation of the entire flow, and rely on
      the RCU annotations as a "standard" way of communicating the flow in the
      map code (which can additionally be understood by sparse and lockdep).
      
      The RCU annotation replacements result in a fairly straight-forward
      replacement where READ_ONCE() becomes rcu_dereference_check(), WRITE_ONCE()
      becomes rcu_assign_pointer() and xchg() and cmpxchg() gets wrapped in the
      proper constructs to cast the pointer back and forth between __rcu and
      __kernel address space (for the benefit of sparse). The one complication is
      that xskmap has a few constructions where double-pointers are passed back
      and forth; these simply all gain __rcu annotations, and only the final
      reference/dereference to the inner-most pointer gets changed.
      
      With this, everything can be run through sparse without eliciting
      complaints, and lockdep can verify correctness even without the use of
      rcu_read_lock() in the drivers. Subsequent patches will clean these up from
      the drivers.
      
      [0] https://lore.kernel.org/bpf/20210415173551.7ma4slcbqeyiba2r@kafai-mbp.dhcp.thefacebook.com/
      [1] https://lore.kernel.org/bpf/20210419165837.GA975577@paulmck-ThinkPad-P17-Gen-1/Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210624160609.292325-6-toke@redhat.com
      782347b6
  4. 16 6月, 2021 1 次提交
  5. 26 5月, 2021 1 次提交
    • H
      xdp: Extend xdp_redirect_map with broadcast support · e624d4ed
      Hangbin Liu 提交于
      This patch adds two flags BPF_F_BROADCAST and BPF_F_EXCLUDE_INGRESS to
      extend xdp_redirect_map for broadcast support.
      
      With BPF_F_BROADCAST the packet will be broadcasted to all the interfaces
      in the map. with BPF_F_EXCLUDE_INGRESS the ingress interface will be
      excluded when do broadcasting.
      
      When getting the devices in dev hash map via dev_map_hash_get_next_key(),
      there is a possibility that we fall back to the first key when a device
      was removed. This will duplicate packets on some interfaces. So just walk
      the whole buckets to avoid this issue. For dev array map, we also walk the
      whole map to find valid interfaces.
      
      Function bpf_clear_redirect_map() was removed in
      commit ee75aef2 ("bpf, xdp: Restructure redirect actions").
      Add it back as we need to use ri->map again.
      
      With test topology:
        +-------------------+             +-------------------+
        | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
        +-------------------+             |                   |
                                          |   Host B          |
        +-------------------+             |                   |
        | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
        +-------------------+             |                   |
                                          |          +------+ |
                                          | veth0 -- | Peer | |
                                          | veth1 -- |      | |
                                          | veth2 -- |  NS  | |
                                          |          +------+ |
                                          +-------------------+
      
      On Host A:
       # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64
      
      On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
      Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
      All the veth peers in the NS have a XDP_DROP program loaded. The
      forward_map max_entries in xdp_redirect_map_multi is modify to 4.
      
      Testing the performance impact on the regular xdp_redirect path with and
      without patch (to check impact of additional check for broadcast mode):
      
      5.12 rc4         | redirect_map        i40e->i40e      |    2.0M |  9.7M
      5.12 rc4         | redirect_map        i40e->veth      |    1.7M | 11.8M
      5.12 rc4 + patch | redirect_map        i40e->i40e      |    2.0M |  9.6M
      5.12 rc4 + patch | redirect_map        i40e->veth      |    1.7M | 11.7M
      
      Testing the performance when cloning packets with the redirect_map_multi
      test, using a redirect map size of 4, filled with 1-3 devices:
      
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x1) |    1.7M | 11.4M
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x2) |    1.1M |  4.3M
      5.12 rc4 + patch | redirect_map multi  i40e->veth (x3) |    0.8M |  2.6M
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210519090747.1655268-3-liuhangbin@gmail.com
      e624d4ed
  6. 31 3月, 2021 1 次提交
  7. 27 3月, 2021 2 次提交
    • M
      bpf: Support bpf program calling kernel function · e6ac2450
      Martin KaFai Lau 提交于
      This patch adds support to BPF verifier to allow bpf program calling
      kernel function directly.
      
      The use case included in this set is to allow bpf-tcp-cc to directly
      call some tcp-cc helper functions (e.g. "tcp_cong_avoid_ai()").  Those
      functions have already been used by some kernel tcp-cc implementations.
      
      This set will also allow the bpf-tcp-cc program to directly call the
      kernel tcp-cc implementation,  For example, a bpf_dctcp may only want to
      implement its own dctcp_cwnd_event() and reuse other dctcp_*() directly
      from the kernel tcp_dctcp.c instead of reimplementing (or
      copy-and-pasting) them.
      
      The tcp-cc kernel functions mentioned above will be white listed
      for the struct_ops bpf-tcp-cc programs to use in a later patch.
      The white listed functions are not bounded to a fixed ABI contract.
      Those functions have already been used by the existing kernel tcp-cc.
      If any of them has changed, both in-tree and out-of-tree kernel tcp-cc
      implementations have to be changed.  The same goes for the struct_ops
      bpf-tcp-cc programs which have to be adjusted accordingly.
      
      This patch is to make the required changes in the bpf verifier.
      
      First change is in btf.c, it adds a case in "btf_check_func_arg_match()".
      When the passed in "btf->kernel_btf == true", it means matching the
      verifier regs' states with a kernel function.  This will handle the
      PTR_TO_BTF_ID reg.  It also maps PTR_TO_SOCK_COMMON, PTR_TO_SOCKET,
      and PTR_TO_TCP_SOCK to its kernel's btf_id.
      
      In the later libbpf patch, the insn calling a kernel function will
      look like:
      
      insn->code == (BPF_JMP | BPF_CALL)
      insn->src_reg == BPF_PSEUDO_KFUNC_CALL /* <- new in this patch */
      insn->imm == func_btf_id /* btf_id of the running kernel */
      
      [ For the future calling function-in-kernel-module support, an array
        of module btf_fds can be passed at the load time and insn->off
        can be used to index into this array. ]
      
      At the early stage of verifier, the verifier will collect all kernel
      function calls into "struct bpf_kfunc_desc".  Those
      descriptors are stored in "prog->aux->kfunc_tab" and will
      be available to the JIT.  Since this "add" operation is similar
      to the current "add_subprog()" and looking for the same insn->code,
      they are done together in the new "add_subprog_and_kfunc()".
      
      In the "do_check()" stage, the new "check_kfunc_call()" is added
      to verify the kernel function call instruction:
      1. Ensure the kernel function can be used by a particular BPF_PROG_TYPE.
         A new bpf_verifier_ops "check_kfunc_call" is added to do that.
         The bpf-tcp-cc struct_ops program will implement this function in
         a later patch.
      2. Call "btf_check_kfunc_args_match()" to ensure the regs can be
         used as the args of a kernel function.
      3. Mark the regs' type, subreg_def, and zext_dst.
      
      At the later do_misc_fixups() stage, the new fixup_kfunc_call()
      will replace the insn->imm with the function address (relative
      to __bpf_call_base).  If needed, the jit can find the btf_func_model
      by calling the new bpf_jit_find_kfunc_model(prog, insn).
      With the imm set to the function address, "bpftool prog dump xlated"
      will be able to display the kernel function calls the same way as
      it displays other bpf helper calls.
      
      gpl_compatible program is required to call kernel function.
      
      This feature currently requires JIT.
      
      The verifier selftests are adjusted because of the changes in
      the verbose log in add_subprog_and_kfunc().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015142.1544736-1-kafai@fb.com
      e6ac2450
    • M
      bpf: Simplify freeing logic in linfo and jited_linfo · e16301fb
      Martin KaFai Lau 提交于
      This patch simplifies the linfo freeing logic by combining
      "bpf_prog_free_jited_linfo()" and "bpf_prog_free_unused_jited_linfo()"
      into the new "bpf_prog_jit_attempt_done()".
      It is a prep work for the kernel function call support.  In a later
      patch, freeing the kernel function call descriptors will also
      be done in the "bpf_prog_jit_attempt_done()".
      
      "bpf_prog_free_linfo()" is removed since it is only called by
      "__bpf_prog_put_noref()".  The kvfree() are directly called
      instead.
      
      It also takes this chance to s/kcalloc/kvcalloc/ for the jited_linfo
      allocation.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210325015130.1544323-1-kafai@fb.com
      e16301fb
  8. 10 3月, 2021 2 次提交
  9. 11 2月, 2021 3 次提交
  10. 21 1月, 2021 1 次提交
    • S
      bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt · 20f2505f
      Stanislav Fomichev 提交于
      When we attach a bpf program to cgroup/getsockopt any other getsockopt()
      syscall starts incurring kzalloc/kfree cost.
      
      Let add a small buffer on the stack and use it for small (majority)
      {s,g}etsockopt values. The buffer is small enough to fit into
      the cache line and cover the majority of simple options (most
      of them are 4 byte ints).
      
      It seems natural to do the same for setsockopt, but it's a bit more
      involved when the BPF program modifies the data (where we have to
      kmalloc). The assumption is that for the majority of setsockopt
      calls (which are doing pure BPF options or apply policy) this
      will bring some benefit as well.
      
      Without this patch (we remove about 1% __kmalloc):
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-3-sdf@google.com
      20f2505f
  11. 15 1月, 2021 4 次提交
  12. 13 1月, 2021 1 次提交
  13. 20 11月, 2020 1 次提交
    • E
      crypto: sha - split sha.h into sha1.h and sha2.h · a24d22b2
      Eric Biggers 提交于
      Currently <crypto/sha.h> contains declarations for both SHA-1 and SHA-2,
      and <crypto/sha3.h> contains declarations for SHA-3.
      
      This organization is inconsistent, but more importantly SHA-1 is no
      longer considered to be cryptographically secure.  So to the extent
      possible, SHA-1 shouldn't be grouped together with any of the other SHA
      versions, and usage of it should be phased out.
      
      Therefore, split <crypto/sha.h> into two headers <crypto/sha1.h> and
      <crypto/sha2.h>, and make everyone explicitly specify whether they want
      the declarations for SHA-1, SHA-2, or both.
      
      This avoids making the SHA-1 declarations visible to files that don't
      want anything to do with SHA-1.  It also prepares for potentially moving
      sha1.h into a new insecure/ or dangerous/ directory.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      a24d22b2
  14. 27 10月, 2020 1 次提交
  15. 22 10月, 2020 1 次提交
  16. 11 9月, 2020 1 次提交
    • L
      bpf: Plug hole in struct bpf_sk_lookup_kern · d66423fb
      Lorenz Bauer 提交于
      As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
      This leads to suboptimal instructions being generated (IPv4, x86):
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f30 <+624>:   xor    %eax,%eax
             0xffffffff81b87f32 <+626>:   mov    $0x6,%ecx
             0xffffffff81b87f37 <+631>:   lea    0x90(%rsp),%rdi
             0xffffffff81b87f3f <+639>:   movl   $0x110002,0x88(%rsp)
             0xffffffff81b87f4a <+650>:   rep stos %rax,%es:(%rdi)
             0xffffffff81b87f4d <+653>:   mov    0x8(%rsp),%eax
             0xffffffff81b87f51 <+657>:   mov    %r13d,0x90(%rsp)
             0xffffffff81b87f59 <+665>:   incl   %gs:0x7e4970a0(%rip)
             0xffffffff81b87f60 <+672>:   mov    %eax,0x8c(%rsp)
             0xffffffff81b87f67 <+679>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f6c <+684>:   mov    %ax,0xa8(%rsp)
             0xffffffff81b87f74 <+692>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f79 <+697>:   mov    %ax,0xaa(%rsp)
      
      Fix this by moving around sport and dport. pahole confirms there
      are no more holes:
      
          struct bpf_sk_lookup_kern {
              u16                        family;       /*     0     2 */
              u16                        protocol;     /*     2     2 */
              __be16                     sport;        /*     4     2 */
              u16                        dport;        /*     6     2 */
              struct {
                      __be32             saddr;        /*     8     4 */
                      __be32             daddr;        /*    12     4 */
              } v4;                                    /*     8     8 */
              struct {
                      const struct in6_addr  * saddr;  /*    16     8 */
                      const struct in6_addr  * daddr;  /*    24     8 */
              } v6;                                    /*    16    16 */
              struct sock *              selected_sk;  /*    32     8 */
              bool                       no_reuseport; /*    40     1 */
      
              /* size: 48, cachelines: 1, members: 8 */
              /* padding: 7 */
              /* last cacheline: 48 bytes */
          };
      
      The assembly also doesn't contain the pesky rep stos anymore:
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f60 <+624>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f65 <+629>:   movq   $0x0,0xa8(%rsp)
             0xffffffff81b87f71 <+641>:   movq   $0x0,0xb0(%rsp)
             0xffffffff81b87f7d <+653>:   mov    %ax,0x9c(%rsp)
             0xffffffff81b87f85 <+661>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f8a <+666>:   movq   $0x0,0xb8(%rsp)
             0xffffffff81b87f96 <+678>:   mov    %ax,0x9e(%rsp)
             0xffffffff81b87f9e <+686>:   mov    0x8(%rsp),%eax
             0xffffffff81b87fa2 <+690>:   movq   $0x0,0xc0(%rsp)
             0xffffffff81b87fae <+702>:   movl   $0x110002,0x98(%rsp)
             0xffffffff81b87fb9 <+713>:   mov    %eax,0xa0(%rsp)
             0xffffffff81b87fc0 <+720>:   mov    %r13d,0xa4(%rsp)
      
      1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com/
      
      Fixes: e9ddbb77 ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20200910110248.198326-1-lmb@cloudflare.com
      d66423fb
  17. 25 8月, 2020 2 次提交
    • M
      bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841
      Martin KaFai Lau 提交于
      [ Note: The TCP changes here is mainly to implement the bpf
        pieces into the bpf_skops_*() functions introduced
        in the earlier patches. ]
      
      The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
      algorithm to be written in BPF.  It opens up opportunities to allow
      a faster turnaround time in testing/releasing new congestion control
      ideas to production environment.
      
      The same flexibility can be extended to writing TCP header option.
      It is not uncommon that people want to test new TCP header option
      to improve the TCP performance.  Another use case is for data-center
      that has a more controlled environment and has more flexibility in
      putting header options for internal only use.
      
      For example, we want to test the idea in putting maximum delay
      ACK in TCP header option which is similar to a draft RFC proposal [1].
      
      This patch introduces the necessary BPF API and use them in the
      TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
      and write TCP header options.  It currently supports most of
      the TCP packet except RST.
      
      Supported TCP header option:
      ───────────────────────────
      This patch allows the bpf-prog to write any option kind.
      Different bpf-progs can write its own option by calling the new helper
      bpf_store_hdr_opt().  The helper will ensure there is no duplicated
      option in the header.
      
      By allowing bpf-prog to write any option kind, this gives a lot of
      flexibility to the bpf-prog.  Different bpf-prog can write its
      own option kind.  It could also allow the bpf-prog to support a
      recently standardized option on an older kernel.
      
      Sockops Callback Flags:
      ──────────────────────
      The bpf program will only be called to parse/write tcp header option
      if the following newly added callback flags are enabled
      in tp->bpf_sock_ops_cb_flags:
      BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
      
      A few words on the PARSE CB flags.  When the above PARSE CB flags are
      turned on, the bpf-prog will be called on packets received
      at a sk that has at least reached the ESTABLISHED state.
      The parsing of the SYN-SYNACK-ACK will be discussed in the
      "3 Way HandShake" section.
      
      The default is off for all of the above new CB flags, i.e. the bpf prog
      will not be called to parse or write bpf hdr option.  There are
      details comment on these new cb flags in the UAPI bpf.h.
      
      sock_ops->skb_data and bpf_load_hdr_opt()
      ─────────────────────────────────────────
      sock_ops->skb_data and sock_ops->skb_data_end covers the whole
      TCP header and its options.  They are read only.
      
      The new bpf_load_hdr_opt() helps to read a particular option "kind"
      from the skb_data.
      
      Please refer to the comment in UAPI bpf.h.  It has details
      on what skb_data contains under different sock_ops->op.
      
      3 Way HandShake
      ───────────────
      The bpf-prog can learn if it is sending SYN or SYNACK by reading the
      sock_ops->skb_tcp_flags.
      
      * Passive side
      
      When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
      the received SYN skb will be available to the bpf prog.  The bpf prog can
      use the SYN skb (which may carry the header option sent from the remote bpf
      prog) to decide what bpf header option should be written to the outgoing
      SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
      More on this later.  Also, the bpf prog can learn if it is in syncookie
      mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
      
      The bpf prog can store the received SYN pkt by using the existing
      bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
      [ Note that the fullsock here is a listen sk, bpf_sk_storage
        is not very useful here since the listen sk will be shared
        by many concurrent connection requests.
      
        Extending bpf_sk_storage support to request_sock will add weight
        to the minisock and it is not necessary better than storing the
        whole ~100 bytes SYN pkt. ]
      
      When the connection is established, the bpf prog will be called
      in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
      the bpf prog can get the header option from the saved syn and
      then apply the needed operation to the newly established socket.
      The later patch will use the max delay ack specified in the SYN
      header and set the RTO of this newly established connection
      as an example.
      
      The received ACK (that concludes the 3WHS) will also be available to
      the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
      It could be useful in syncookie scenario.  More on this later.
      
      There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
      saved syn pkt which includes the IP[46] header and the TCP header.
      A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
      start getting from, e.g. starting from TCP header, or from IP[46] header.
      
      The new getsockopt(TCP_BPF_SYN*) will also know where it can get
      the SYN's packet from:
        - (a) the just received syn (available when the bpf prog is writing SYNACK)
              and it is the only way to get SYN during syncookie mode.
        or
        - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
              existing CB).
      
      The bpf prog does not need to know where the SYN pkt is coming from.
      The getsockopt(TCP_BPF_SYN*) will hide this details.
      
      Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
      bpf_load_hdr_opt() to read a particular header option from the SYN packet.
      
      * Fastopen
      
      Fastopen should work the same as the regular non fastopen case.
      This is a test in a later patch.
      
      * Syncookie
      
      For syncookie, the later example patch asks the active
      side's bpf prog to resend the header options in ACK.  The server
      can use bpf_load_hdr_opt() to look at the options in this
      received ACK during PASSIVE_ESTABLISHED_CB.
      
      * Active side
      
      The bpf prog will get a chance to write the bpf header option
      in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
      pkt will also be available to the bpf prog during the existing
      ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
      and bpf_load_hdr_opt().
      
      * Turn off header CB flags after 3WHS
      
      If the bpf prog does not need to write/parse header options
      beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
      to avoid being called for header options.
      Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
      so that the kernel will only call it when there is option that
      the kernel cannot handle.
      
      [1]: draft-wang-tcpm-low-latency-opt-00
           https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
      0813a841
    • M
      bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 · c9985d09
      Martin KaFai Lau 提交于
      A later patch needs to add a few pointers and a few u8 to
      sock_ops_kern.  Hence, this patch saves some spaces by moving
      some of the existing members from u32 to u8 so that the later
      patch can still fit everything in a cacheline.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
      c9985d09
  18. 24 8月, 2020 1 次提交
  19. 26 7月, 2020 1 次提交
  20. 25 7月, 2020 1 次提交
  21. 20 7月, 2020 1 次提交
  22. 18 7月, 2020 3 次提交
    • J
      inet6: Run SK_LOOKUP BPF program on socket lookup · 1122702f
      Jakub Sitnicki 提交于
      Following ipv4 stack changes, run a BPF program attached to netns before
      looking up a listening socket. Program can return a listening socket to use
      as result of socket lookup, fail the lookup, or take no action.
      Suggested-by: NMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200717103536.397595-7-jakub@cloudflare.com
      1122702f
    • J
      inet: Run SK_LOOKUP BPF program on socket lookup · 1559b4aa
      Jakub Sitnicki 提交于
      Run a BPF program before looking up a listening socket on the receive path.
      Program selects a listening socket to yield as result of socket lookup by
      calling bpf_sk_assign() helper and returning SK_PASS code. Program can
      revert its decision by assigning a NULL socket with bpf_sk_assign().
      
      Alternatively, BPF program can also fail the lookup by returning with
      SK_DROP, or let the lookup continue as usual with SK_PASS on return, when
      no socket has been selected with bpf_sk_assign().
      
      This lets the user match packets with listening sockets freely at the last
      possible point on the receive path, where we know that packets are destined
      for local delivery after undergoing policing, filtering, and routing.
      
      With BPF code selecting the socket, directing packets destined to an IP
      range or to a port range to a single socket becomes possible.
      
      In case multiple programs are attached, they are run in series in the order
      in which they were attached. The end result is determined from return codes
      of all the programs according to following rules:
      
       1. If any program returned SK_PASS and selected a valid socket, the socket
          is used as result of socket lookup.
       2. If more than one program returned SK_PASS and selected a socket,
          last selection takes effect.
       3. If any program returned SK_DROP, and no program returned SK_PASS and
          selected a socket, socket lookup fails with -ECONNREFUSED.
       4. If all programs returned SK_PASS and none of them selected a socket,
          socket lookup continues to htable-based lookup.
      Suggested-by: NMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com
      1559b4aa
    • J
      bpf: Introduce SK_LOOKUP program type with a dedicated attach point · e9ddbb77
      Jakub Sitnicki 提交于
      Add a new program type BPF_PROG_TYPE_SK_LOOKUP with a dedicated attach type
      BPF_SK_LOOKUP. The new program kind is to be invoked by the transport layer
      when looking up a listening socket for a new connection request for
      connection oriented protocols, or when looking up an unconnected socket for
      a packet for connection-less protocols.
      
      When called, SK_LOOKUP BPF program can select a socket that will receive
      the packet. This serves as a mechanism to overcome the limits of what
      bind() API allows to express. Two use-cases driving this work are:
      
       (1) steer packets destined to an IP range, on fixed port to a socket
      
           192.0.2.0/24, port 80 -> NGINX socket
      
       (2) steer packets destined to an IP address, on any port to a socket
      
           198.51.100.1, any port -> L7 proxy socket
      
      In its run-time context program receives information about the packet that
      triggered the socket lookup. Namely IP version, L4 protocol identifier, and
      address 4-tuple. Context can be further extended to include ingress
      interface identifier.
      
      To select a socket BPF program fetches it from a map holding socket
      references, like SOCKMAP or SOCKHASH, and calls bpf_sk_assign(ctx, sk, ...)
      helper to record the selection. Transport layer then uses the selected
      socket as a result of socket lookup.
      
      In its basic form, SK_LOOKUP acts as a filter and hence must return either
      SK_PASS or SK_DROP. If the program returns with SK_PASS, transport should
      look for a socket to receive the packet, or use the one selected by the
      program if available, while SK_DROP informs the transport layer that the
      lookup should fail.
      
      This patch only enables the user to attach an SK_LOOKUP program to a
      network namespace. Subsequent patches hook it up to run on local delivery
      path in ipv4 and ipv6 stacks.
      Suggested-by: NMarek Majkowski <marek@cloudflare.com>
      Signed-off-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200717103536.397595-3-jakub@cloudflare.com
      e9ddbb77
  23. 09 7月, 2020 2 次提交
    • K
      bpf: Check correct cred for CAP_SYSLOG in bpf_dump_raw_ok() · 63960260
      Kees Cook 提交于
      When evaluating access control over kallsyms visibility, credentials at
      open() time need to be used, not the "current" creds (though in BPF's
      case, this has likely always been the same). Plumb access to associated
      file->f_cred down through bpf_dump_raw_ok() and its callers now that
      kallsysm_show_value() has been refactored to take struct cred.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: bpf@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 7105e828 ("bpf: allow for correlation of maps and helpers in dump")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      63960260
    • K
      kallsyms: Refactor kallsyms_show_value() to take cred · 16025184
      Kees Cook 提交于
      In order to perform future tests against the cred saved during open(),
      switch kallsyms_show_value() to operate on a cred, and have all current
      callers pass current_cred(). This makes it very obvious where callers
      are checking the wrong credential in their "read" contexts. These will
      be fixed in the coming patches.
      
      Additionally switch return value to bool, since it is always used as a
      direct permission check, not a 0-on-success, negative-on-error style
      function return.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      16025184
  24. 08 5月, 2020 2 次提交
    • E
      crypto: lib/sha1 - fold linux/cryptohash.h into crypto/sha.h · 228c4f26
      Eric Biggers 提交于
      <linux/cryptohash.h> sounds very generic and important, like it's the
      header to include if you're doing cryptographic hashing in the kernel.
      But actually it only includes the library implementation of the SHA-1
      compression function (not even the full SHA-1).  This should basically
      never be used anymore; SHA-1 is no longer considered secure, and there
      are much better ways to do cryptographic hashing in the kernel.
      
      Remove this header and fold it into <crypto/sha.h> which already
      contains constants and functions for SHA-1 (along with SHA-2).
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      228c4f26
    • E
      crypto: lib/sha1 - rename "sha" to "sha1" · 6b0b0fa2
      Eric Biggers 提交于
      The library implementation of the SHA-1 compression function is
      confusingly called just "sha_transform()".  Alongside it are some "SHA_"
      constants and "sha_init()".  Presumably these are left over from a time
      when SHA just meant SHA-1.  But now there are also SHA-2 and SHA-3, and
      moreover SHA-1 is now considered insecure and thus shouldn't be used.
      
      Therefore, rename these functions and constants to make it very clear
      that they are for SHA-1.  Also add a comment to make it clear that these
      shouldn't be used.
      
      For the extra-misleadingly named "SHA_MESSAGE_BYTES", rename it to
      SHA1_BLOCK_SIZE and define it to just '64' rather than '(512/8)' so that
      it matches the same definition in <crypto/sha.h>.  This prepares for
      merging <linux/cryptohash.h> into <crypto/sha.h>.
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      6b0b0fa2
  25. 05 5月, 2020 1 次提交
    • A
      bpf: Avoid gcc-10 stringop-overflow warning in struct bpf_prog · d26c0cc5
      Arnd Bergmann 提交于
      gcc-10 warns about accesses to zero-length arrays:
      
      kernel/bpf/core.c: In function 'bpf_patch_insn_single':
      cc1: warning: writing 8 bytes into a region of size 0 [-Wstringop-overflow=]
      In file included from kernel/bpf/core.c:21:
      include/linux/filter.h:550:20: note: at offset 0 to object 'insnsi' with size 0 declared here
        550 |   struct bpf_insn  insnsi[0];
            |                    ^~~~~~
      
      In this case, we really want to have two flexible-array members,
      but that is not possible. Removing the union to make insnsi a
      flexible-array member while leaving insns as a zero-length array
      fixes the warning, as nothing writes to the other one in that way.
      
      This trick only works on linux-3.18 or higher, as older versions
      had additional members in the union.
      
      Fixes: 60a3b225 ("net: bpf: make eBPF interpreter images read-only")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200430213101.135134-6-arnd@arndb.de
      d26c0cc5
  26. 26 4月, 2020 1 次提交
  27. 14 3月, 2020 2 次提交