1. 18 9月, 2020 4 次提交
    • M
      bpf, x64: rework pro/epilogue and tailcall handling in JIT · ebf7d1f5
      Maciej Fijalkowski 提交于
      This commit serves two things:
      1) it optimizes BPF prologue/epilogue generation
      2) it makes possible to have tailcalls within BPF subprogram
      
      Both points are related to each other since without 1), 2) could not be
      achieved.
      
      In [1], Alexei says:
      "The prologue will look like:
      nop5
      xor eax,eax  // two new bytes if bpf_tail_call() is used in this
                   // function
      push rbp
      mov rbp, rsp
      sub rsp, rounded_stack_depth
      push rax // zero init tail_call counter
      variable number of push rbx,r13,r14,r15
      
      Then bpf_tail_call will pop variable number rbx,..
      and final 'pop rax'
      Then 'add rsp, size_of_current_stack_frame'
      jmp to next function and skip over 'nop5; xor eax,eax; push rpb; mov
      rbp, rsp'
      
      This way new function will set its own stack size and will init tail
      call
      counter with whatever value the parent had.
      
      If next function doesn't use bpf_tail_call it won't have 'xor eax,eax'.
      Instead it would need to have 'nop2' in there."
      
      Implement that suggestion.
      
      Since the layout of stack is changed, tail call counter handling can not
      rely anymore on popping it to rbx just like it have been handled for
      constant prologue case and later overwrite of rbx with actual value of
      rbx pushed to stack. Therefore, let's use one of the register (%rcx) that
      is considered to be volatile/caller-saved and pop the value of tail call
      counter in there in the epilogue.
      
      Drop the BUILD_BUG_ON in emit_prologue and in
      emit_bpf_tail_call_indirect where instruction layout is not constant
      anymore.
      
      Introduce new poke target, 'tailcall_bypass' to poke descriptor that is
      dedicated for skipping the register pops and stack unwind that are
      generated right before the actual jump to target program.
      For case when the target program is not present, BPF program will skip
      the pop instructions and nop5 dedicated for jmpq $target. An example of
      such state when only R6 of callee saved registers is used by program:
      
      ffffffffc0513aa1:       e9 0e 00 00 00          jmpq   0xffffffffc0513ab4
      ffffffffc0513aa6:       5b                      pop    %rbx
      ffffffffc0513aa7:       58                      pop    %rax
      ffffffffc0513aa8:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc0513aaf:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0513ab4:       48 89 df                mov    %rbx,%rdi
      
      When target program is inserted, the jump that was there to skip
      pops/nop5 will become the nop5, so CPU will go over pops and do the
      actual tailcall.
      
      One might ask why there simply can not be pushes after the nop5?
      In the following example snippet:
      
      ffffffffc037030c:       48 89 fb                mov    %rdi,%rbx
      (...)
      ffffffffc0370332:       5b                      pop    %rbx
      ffffffffc0370333:       58                      pop    %rax
      ffffffffc0370334:       48 81 c4 00 00 00 00    add    $0x0,%rsp
      ffffffffc037033b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
      ffffffffc0370340:       48 81 ec 00 00 00 00    sub    $0x0,%rsp
      ffffffffc0370347:       50                      push   %rax
      ffffffffc0370348:       53                      push   %rbx
      ffffffffc0370349:       48 89 df                mov    %rbx,%rdi
      ffffffffc037034c:       e8 f7 21 00 00          callq  0xffffffffc0372548
      
      There is the bpf2bpf call (at ffffffffc037034c) right after the tailcall
      and jump target is not present. ctx is in %rbx register and BPF
      subprogram that we will call into on ffffffffc037034c is relying on it,
      e.g. it will pick ctx from there. Such code layout is therefore broken
      as we would overwrite the content of %rbx with the value that was pushed
      on the prologue. That is the reason for the 'bypass' approach.
      
      Special care needs to be taken during the install/update/remove of
      tailcall target. In case when target program is not present, the CPU
      must not execute the pop instructions that precede the tailcall.
      
      To address that, the following states can be defined:
      A nop, unwind, nop
      B nop, unwind, tail
      C skip, unwind, nop
      D skip, unwind, tail
      
      A is forbidden (lead to incorrectness). The state transitions between
      tailcall install/update/remove will work as follows:
      
      First install tail call f: C->D->B(f)
       * poke the tailcall, after that get rid of the skip
      Update tail call f to f': B(f)->B(f')
       * poke the tailcall (poke->tailcall_target) and do NOT touch the
         poke->tailcall_bypass
      Remove tail call: B(f')->C(f')
       * poke->tailcall_bypass is poked back to jump, then we wait the RCU
         grace period so that other programs will finish its execution and
         after that we are safe to remove the poke->tailcall_target
      Install new tail call (f''): C(f')->D(f'')->B(f'').
       * same as first step
      
      This way CPU can never be exposed to "unwind, tail" state.
      
      Last but not least, when tailcalls get mixed with bpf2bpf calls, it
      would be possible to encounter the endless loop due to clearing the
      tailcall counter if for example we would use the tailcall3-like from BPF
      selftests program that would be subprogram-based, meaning the tailcall
      would be present within the BPF subprogram.
      
      This test, broken down to particular steps, would do:
      entry -> set tailcall counter to 0, bump it by 1, tailcall to func0
      func0 -> call subprog_tail
      (we are NOT skipping the first 11 bytes of prologue and this subprogram
      has a tailcall, therefore we clear the counter...)
      subprog -> do the same thing as entry
      
      and then loop forever.
      
      To address this, the idea is to go through the call chain of bpf2bpf progs
      and look for a tailcall presence throughout whole chain. If we saw a single
      tail call then each node in this call chain needs to be marked as a subprog
      that can reach the tailcall. We would later feed the JIT with this info
      and:
      - set eax to 0 only when tailcall is reachable and this is the entry prog
      - if tailcall is reachable but there's no tailcall in insns of currently
        JITed prog then push rax anyway, so that it will be possible to
        propagate further down the call chain
      - finally if tailcall is reachable, then we need to precede the 'call'
        insn with mov rax, [rbp - (stack_depth + 8)]
      
      Tail call related cases from test_verifier kselftest are also working
      fine. Sample BPF programs that utilize tail calls (sockex3, tracex5)
      work properly as well.
      
      [1]: https://lore.kernel.org/bpf/20200517043227.2gpq22ifoq37ogst@ast-mbp.dhcp.thefacebook.com/Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ebf7d1f5
    • M
      bpf: Limit caller's stack depth 256 for subprogs with tailcalls · 7f6e4312
      Maciej Fijalkowski 提交于
      Protect against potential stack overflow that might happen when bpf2bpf
      calls get combined with tailcalls. Limit the caller's stack depth for
      such case down to 256 so that the worst case scenario would result in 8k
      stack size (32 which is tailcall limit * 256 = 8k).
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7f6e4312
    • M
      bpf: rename poke descriptor's 'ip' member to 'tailcall_target' · cf71b174
      Maciej Fijalkowski 提交于
      Reflect the actual purpose of poke->ip and rename it to
      poke->tailcall_target so that it will not the be confused with another
      poke target that will be introduced in next commit.
      
      While at it, do the same thing with poke->ip_stable - rename it to
      poke->tailcall_target_stable.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      cf71b174
    • M
      bpf: propagate poke descriptors to subprograms · a748c697
      Maciej Fijalkowski 提交于
      Previously, there was no need for poke descriptors being present in
      subprogram's bpf_prog_aux struct since tailcalls were simply not allowed
      in them. Each subprog is JITed independently so in order to enable
      JITing subprograms that use tailcalls, do the following:
      
      - in fixup_bpf_calls() store the index of tailcall insn onto the generated
        poke descriptor,
      - in case when insn patching occurs, adjust the tailcall insn idx from
        bpf_patch_insn_data,
      - then in jit_subprogs() check whether the given poke descriptor belongs
        to the current subprog by checking if that previously stored absolute
        index of tail call insn is in the scope of the insns of given subprog,
      - update the insn->imm with new poke descriptor slot so that while JITing
        the proper poke descriptor will be grabbed
      
      This way each of the main program's poke descriptors are distributed
      across the subprograms poke descriptor array, so main program's
      descriptors can be untracked out of the prog array map.
      
      Add also subprog's aux struct to the BPF map poke_progs list by calling
      on it map_poke_track().
      
      In case of any error, call the map_poke_untrack() on subprog's aux
      structs that have already been registered to prog array map.
      Signed-off-by: NMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a748c697
  2. 16 9月, 2020 1 次提交
  3. 11 9月, 2020 1 次提交
    • L
      bpf: Plug hole in struct bpf_sk_lookup_kern · d66423fb
      Lorenz Bauer 提交于
      As Alexei points out, struct bpf_sk_lookup_kern has two 4-byte holes.
      This leads to suboptimal instructions being generated (IPv4, x86):
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f30 <+624>:   xor    %eax,%eax
             0xffffffff81b87f32 <+626>:   mov    $0x6,%ecx
             0xffffffff81b87f37 <+631>:   lea    0x90(%rsp),%rdi
             0xffffffff81b87f3f <+639>:   movl   $0x110002,0x88(%rsp)
             0xffffffff81b87f4a <+650>:   rep stos %rax,%es:(%rdi)
             0xffffffff81b87f4d <+653>:   mov    0x8(%rsp),%eax
             0xffffffff81b87f51 <+657>:   mov    %r13d,0x90(%rsp)
             0xffffffff81b87f59 <+665>:   incl   %gs:0x7e4970a0(%rip)
             0xffffffff81b87f60 <+672>:   mov    %eax,0x8c(%rsp)
             0xffffffff81b87f67 <+679>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f6c <+684>:   mov    %ax,0xa8(%rsp)
             0xffffffff81b87f74 <+692>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f79 <+697>:   mov    %ax,0xaa(%rsp)
      
      Fix this by moving around sport and dport. pahole confirms there
      are no more holes:
      
          struct bpf_sk_lookup_kern {
              u16                        family;       /*     0     2 */
              u16                        protocol;     /*     2     2 */
              __be16                     sport;        /*     4     2 */
              u16                        dport;        /*     6     2 */
              struct {
                      __be32             saddr;        /*     8     4 */
                      __be32             daddr;        /*    12     4 */
              } v4;                                    /*     8     8 */
              struct {
                      const struct in6_addr  * saddr;  /*    16     8 */
                      const struct in6_addr  * daddr;  /*    24     8 */
              } v6;                                    /*    16    16 */
              struct sock *              selected_sk;  /*    32     8 */
              bool                       no_reuseport; /*    40     1 */
      
              /* size: 48, cachelines: 1, members: 8 */
              /* padding: 7 */
              /* last cacheline: 48 bytes */
          };
      
      The assembly also doesn't contain the pesky rep stos anymore:
      
          1372                    struct bpf_sk_lookup_kern ctx = {
             0xffffffff81b87f60 <+624>:   movzwl 0x10(%rsp),%eax
             0xffffffff81b87f65 <+629>:   movq   $0x0,0xa8(%rsp)
             0xffffffff81b87f71 <+641>:   movq   $0x0,0xb0(%rsp)
             0xffffffff81b87f7d <+653>:   mov    %ax,0x9c(%rsp)
             0xffffffff81b87f85 <+661>:   movzwl 0x38(%rsp),%eax
             0xffffffff81b87f8a <+666>:   movq   $0x0,0xb8(%rsp)
             0xffffffff81b87f96 <+678>:   mov    %ax,0x9e(%rsp)
             0xffffffff81b87f9e <+686>:   mov    0x8(%rsp),%eax
             0xffffffff81b87fa2 <+690>:   movq   $0x0,0xc0(%rsp)
             0xffffffff81b87fae <+702>:   movl   $0x110002,0x98(%rsp)
             0xffffffff81b87fb9 <+713>:   mov    %eax,0xa0(%rsp)
             0xffffffff81b87fc0 <+720>:   mov    %r13d,0xa4(%rsp)
      
      1: https://lore.kernel.org/bpf/CAADnVQKE6y9h2fwX6OS837v-Uf+aBXnT_JXiN_bbo2gitZQ3tA@mail.gmail.com/
      
      Fixes: e9ddbb77 ("bpf: Introduce SK_LOOKUP program type with a dedicated attach point")
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20200910110248.198326-1-lmb@cloudflare.com
      d66423fb
  4. 01 9月, 2020 6 次提交
  5. 29 8月, 2020 2 次提交
    • A
      bpf: Add bpf_copy_from_user() helper. · 07be4c4a
      Alexei Starovoitov 提交于
      Sleepable BPF programs can now use copy_from_user() to access user memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-4-alexei.starovoitov@gmail.com
      07be4c4a
    • A
      bpf: Introduce sleepable BPF programs · 1e6c62a8
      Alexei Starovoitov 提交于
      Introduce sleepable BPF programs that can request such property for themselves
      via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
      to use helpers like bpf_copy_from_user() that might sleep. At present only
      fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
      when they are attached to kernel functions that are known to allow sleeping.
      
      The non-sleepable programs are relying on implicit rcu_read_lock() and
      migrate_disable() to protect life time of programs, maps that they use and
      per-cpu kernel structures used to pass info between bpf programs and the
      kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
      migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
      should not be enclosed in migrate_disable() as well. Therefore
      rcu_read_lock_trace is used to protect the life time of sleepable progs.
      
      There are many networking and tracing program types. In many cases the
      'struct bpf_prog *' pointer itself is rcu protected within some other kernel
      data structure and the kernel code is using rcu_dereference() to load that
      program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
      Instead sleepable bpf programs are allowed with bpf trampoline only. The
      program pointers are hard-coded into generated assembly of bpf trampoline and
      synchronize_rcu_tasks_trace() is used to protect the life time of the program.
      The same trampoline can hold both sleepable and non-sleepable progs.
      
      When rcu_read_lock_trace is held it means that some sleepable bpf program is
      running from bpf trampoline. Those programs can use bpf arrays and preallocated
      hash/lru maps. These map types are waiting on programs to complete via
      synchronize_rcu_tasks_trace();
      
      Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
      synchronize_rcu_tasks() to wait for sleepable progs to finish and for
      trampoline assembly to finish.
      
      This is the first step of introducing sleepable progs. Eventually dynamically
      allocated hash maps can be allowed and networking program types can become
      sleepable too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
      1e6c62a8
  6. 28 8月, 2020 2 次提交
    • M
      net: add option to not create fall-back tunnels in root-ns as well · 316cdaa1
      Mahesh Bandewar 提交于
      The sysctl that was added  earlier by commit 79134e6c ("net: do
      not create fallback tunnels for non-default namespaces") to create
      fall-back only in root-ns. This patch enhances that behavior to provide
      option not to create fallback tunnels in root-ns as well. Since modules
      that create fallback tunnels could be built-in and setting the sysctl
      value after booting is pointless, so added a kernel cmdline options to
      change this default. The default setting is preseved for backward
      compatibility. The kernel command line option of fb_tunnels=initns will
      set the sysctl value to 1 and will create fallback tunnels only in initns
      while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
      fallback tunnels are skipped in every netns.
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Maciej Zenczykowski <maze@google.com>
      Cc: Jian Yang <jianyang@google.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      316cdaa1
    • M
      bpf: Add map_meta_equal map ops · f4d05259
      Martin KaFai Lau 提交于
      Some properties of the inner map is used in the verification time.
      When an inner map is inserted to an outer map at runtime,
      bpf_map_meta_equal() is currently used to ensure those properties
      of the inserting inner map stays the same as the verification
      time.
      
      In particular, the current bpf_map_meta_equal() checks max_entries which
      turns out to be too restrictive for most of the maps which do not use
      max_entries during the verification time.  It limits the use case that
      wants to replace a smaller inner map with a larger inner map.  There are
      some maps do use max_entries during verification though.  For example,
      the map_gen_lookup in array_map_ops uses the max_entries to generate
      the inline lookup code.
      
      To accommodate differences between maps, the map_meta_equal is added
      to bpf_map_ops.  Each map-type can decide what to check when its
      map is used as an inner map during runtime.
      
      Also, some map types cannot be used as an inner map and they are
      currently black listed in bpf_map_meta_alloc() in map_in_map.c.
      It is not unusual that the new map types may not aware that such
      blacklist exists.  This patch enforces an explicit opt-in
      and only allows a map to be used as an inner map if it has
      implemented the map_meta_equal ops.  It is based on the
      discussion in [1].
      
      All maps that support inner map has its map_meta_equal points
      to bpf_map_meta_equal in this patch.  A later patch will
      relax the max_entries check for most maps.  bpf_types.h
      counts 28 map types.  This patch adds 23 ".map_meta_equal"
      by using coccinelle.  -5 for
      	BPF_MAP_TYPE_PROG_ARRAY
      	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
      	BPF_MAP_TYPE_STRUCT_OPS
      	BPF_MAP_TYPE_ARRAY_OF_MAPS
      	BPF_MAP_TYPE_HASH_OF_MAPS
      
      The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
      is moved such that the same error is returned.
      
      [1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
      f4d05259
  7. 27 8月, 2020 4 次提交
  8. 26 8月, 2020 7 次提交
  9. 25 8月, 2020 10 次提交
    • I
      qed: align adjacent indent · c5c642c5
      Igor Russkikh 提交于
      Remove extra indent on some of adjacent declarations.
      Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: NAlexander Lobakin <alobakin@marvell.com>
      Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c642c5
    • I
      qed: use devlink logic to report errors · 4f5a8db2
      Igor Russkikh 提交于
      Use devlink_health_report to push error indications.
      We implement this in qede via callback function to make it possible
      to reuse the same for other drivers sitting on top of qed in future.
      Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: NAlexander Lobakin <alobakin@marvell.com>
      Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f5a8db2
    • I
      qed: health reporter init deinit seq · 9524067b
      Igor Russkikh 提交于
      Here we declare health reporter ops (empty for now)
      and register these in qed probe and remove callbacks.
      
      This way we get devlink attached to all kind of qed* PCI
      device entities: networking or storage offload entity.
      Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: NAlexander Lobakin <alobakin@marvell.com>
      Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
      Reviewed-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9524067b
    • I
      qed/qede: make devlink survive recovery · 755f982b
      Igor Russkikh 提交于
      Devlink instance lifecycle was linked to qed_dev object,
      that caused devlink to be recreated on each recovery.
      
      Changing it by making higher level driver (qede) responsible for its
      life. This way devlink now survives recoveries.
      
      qede now stores devlink structure pointer as a part of its device
      object, devlink private data contains a linkage structure,
      qed_devlink.
      Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
      Signed-off-by: NAlexander Lobakin <alobakin@marvell.com>
      Signed-off-by: NMichal Kalderon <michal.kalderon@marvell.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      755f982b
    • L
      io_uring: allow tcp ancillary data for __sys_recvmsg_sock() · 583bbf06
      Luke Hsiao 提交于
      For TCP tx zero-copy, the kernel notifies the process of completions by
      queuing completion notifications on the socket error queue. This patch
      allows reading these notifications via recvmsg to support TCP tx
      zero-copy.
      
      Ancillary data was originally disallowed due to privilege escalation
      via io_uring's offloading of sendmsg() onto a kernel thread with kernel
      credentials (https://crbug.com/project-zero/1975). So, we must ensure
      that the socket type is one where the ancillary data types that are
      delivered on recvmsg are plain data (no file descriptors or values that
      are translated based on the identity of the calling process).
      
      This was tested by using io_uring to call recvmsg on the MSG_ERRQUEUE
      with tx zero-copy enabled. Before this patch, we received -EINVALID from
      this specific code path. After this patch, we could read tcp tx
      zero-copy completion notifications from the MSG_ERRQUEUE.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NLuke Hsiao <lukehsiao@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      583bbf06
    • M
      tcp: bpf: Optionally store mac header in TCP_SAVE_SYN · 267cf9fa
      Martin KaFai Lau 提交于
      This patch is adapted from Eric's patch in an earlier discussion [1].
      
      The TCP_SAVE_SYN currently only stores the network header and
      tcp header.  This patch allows it to optionally store
      the mac header also if the setsockopt's optval is 2.
      
      It requires one more bit for the "save_syn" bit field in tcp_sock.
      This patch achieves this by moving the syn_smc bit next to the is_mptcp.
      The syn_smc is currently used with the TCP experimental option.  Since
      syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
      the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
      with "IS_ENABLED(CONFIG_MPTCP)".
      
      The mac_hdrlen is also stored in the "struct saved_syn"
      to allow a quick offset from the bpf prog if it chooses to start
      getting from the network header or the tcp header.
      
      [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
      267cf9fa
    • M
      bpf: tcp: Allow bpf prog to write and parse TCP header option · 0813a841
      Martin KaFai Lau 提交于
      [ Note: The TCP changes here is mainly to implement the bpf
        pieces into the bpf_skops_*() functions introduced
        in the earlier patches. ]
      
      The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
      algorithm to be written in BPF.  It opens up opportunities to allow
      a faster turnaround time in testing/releasing new congestion control
      ideas to production environment.
      
      The same flexibility can be extended to writing TCP header option.
      It is not uncommon that people want to test new TCP header option
      to improve the TCP performance.  Another use case is for data-center
      that has a more controlled environment and has more flexibility in
      putting header options for internal only use.
      
      For example, we want to test the idea in putting maximum delay
      ACK in TCP header option which is similar to a draft RFC proposal [1].
      
      This patch introduces the necessary BPF API and use them in the
      TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
      and write TCP header options.  It currently supports most of
      the TCP packet except RST.
      
      Supported TCP header option:
      ───────────────────────────
      This patch allows the bpf-prog to write any option kind.
      Different bpf-progs can write its own option by calling the new helper
      bpf_store_hdr_opt().  The helper will ensure there is no duplicated
      option in the header.
      
      By allowing bpf-prog to write any option kind, this gives a lot of
      flexibility to the bpf-prog.  Different bpf-prog can write its
      own option kind.  It could also allow the bpf-prog to support a
      recently standardized option on an older kernel.
      
      Sockops Callback Flags:
      ──────────────────────
      The bpf program will only be called to parse/write tcp header option
      if the following newly added callback flags are enabled
      in tp->bpf_sock_ops_cb_flags:
      BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
      BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
      
      A few words on the PARSE CB flags.  When the above PARSE CB flags are
      turned on, the bpf-prog will be called on packets received
      at a sk that has at least reached the ESTABLISHED state.
      The parsing of the SYN-SYNACK-ACK will be discussed in the
      "3 Way HandShake" section.
      
      The default is off for all of the above new CB flags, i.e. the bpf prog
      will not be called to parse or write bpf hdr option.  There are
      details comment on these new cb flags in the UAPI bpf.h.
      
      sock_ops->skb_data and bpf_load_hdr_opt()
      ─────────────────────────────────────────
      sock_ops->skb_data and sock_ops->skb_data_end covers the whole
      TCP header and its options.  They are read only.
      
      The new bpf_load_hdr_opt() helps to read a particular option "kind"
      from the skb_data.
      
      Please refer to the comment in UAPI bpf.h.  It has details
      on what skb_data contains under different sock_ops->op.
      
      3 Way HandShake
      ───────────────
      The bpf-prog can learn if it is sending SYN or SYNACK by reading the
      sock_ops->skb_tcp_flags.
      
      * Passive side
      
      When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
      the received SYN skb will be available to the bpf prog.  The bpf prog can
      use the SYN skb (which may carry the header option sent from the remote bpf
      prog) to decide what bpf header option should be written to the outgoing
      SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
      More on this later.  Also, the bpf prog can learn if it is in syncookie
      mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).
      
      The bpf prog can store the received SYN pkt by using the existing
      bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
      [ Note that the fullsock here is a listen sk, bpf_sk_storage
        is not very useful here since the listen sk will be shared
        by many concurrent connection requests.
      
        Extending bpf_sk_storage support to request_sock will add weight
        to the minisock and it is not necessary better than storing the
        whole ~100 bytes SYN pkt. ]
      
      When the connection is established, the bpf prog will be called
      in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
      the bpf prog can get the header option from the saved syn and
      then apply the needed operation to the newly established socket.
      The later patch will use the max delay ack specified in the SYN
      header and set the RTO of this newly established connection
      as an example.
      
      The received ACK (that concludes the 3WHS) will also be available to
      the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
      It could be useful in syncookie scenario.  More on this later.
      
      There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
      saved syn pkt which includes the IP[46] header and the TCP header.
      A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
      start getting from, e.g. starting from TCP header, or from IP[46] header.
      
      The new getsockopt(TCP_BPF_SYN*) will also know where it can get
      the SYN's packet from:
        - (a) the just received syn (available when the bpf prog is writing SYNACK)
              and it is the only way to get SYN during syncookie mode.
        or
        - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
              existing CB).
      
      The bpf prog does not need to know where the SYN pkt is coming from.
      The getsockopt(TCP_BPF_SYN*) will hide this details.
      
      Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
      bpf_load_hdr_opt() to read a particular header option from the SYN packet.
      
      * Fastopen
      
      Fastopen should work the same as the regular non fastopen case.
      This is a test in a later patch.
      
      * Syncookie
      
      For syncookie, the later example patch asks the active
      side's bpf prog to resend the header options in ACK.  The server
      can use bpf_load_hdr_opt() to look at the options in this
      received ACK during PASSIVE_ESTABLISHED_CB.
      
      * Active side
      
      The bpf prog will get a chance to write the bpf header option
      in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
      pkt will also be available to the bpf prog during the existing
      ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
      and bpf_load_hdr_opt().
      
      * Turn off header CB flags after 3WHS
      
      If the bpf prog does not need to write/parse header options
      beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
      to avoid being called for header options.
      Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
      so that the kernel will only call it when there is option that
      the kernel cannot handle.
      
      [1]: draft-wang-tcpm-low-latency-opt-00
           https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
      0813a841
    • M
      bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8 · c9985d09
      Martin KaFai Lau 提交于
      A later patch needs to add a few pointers and a few u8 to
      sock_ops_kern.  Hence, this patch saves some spaces by moving
      some of the existing members from u32 to u8 so that the later
      patch can still fit everything in a cacheline.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
      c9985d09
    • M
      tcp: Add saw_unknown to struct tcp_options_received · 7656d684
      Martin KaFai Lau 提交于
      In a later patch, the bpf prog only wants to be called to handle
      a header option if that particular header option cannot be handled by
      the kernel.  This unknown option could be written by the peer's bpf-prog.
      It could also be a new standard option that the running kernel does not
      support it while a bpf-prog can handle it.
      
      This patch adds a "saw_unknown" bit to "struct tcp_options_received"
      and it uses an existing one byte hole to do that.  "saw_unknown" will
      be set in tcp_parse_options() if it sees an option that the kernel
      cannot handle.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
      7656d684
    • M
      tcp: Use a struct to represent a saved_syn · 70a217f1
      Martin KaFai Lau 提交于
      The TCP_SAVE_SYN has both the network header and tcp header.
      The total length of the saved syn packet is currently stored in
      the first 4 bytes (u32) of an array and the actual packet data is
      stored after that.
      
      A later patch will add a bpf helper that allows to get the tcp header
      alone from the saved syn without the network header.  It will be more
      convenient to have a direct offset to a specific header instead of
      re-parsing it.  This requires to separately store the network hdrlen.
      The total header length (i.e. network + tcp) is still needed for the
      current usage in getsockopt.  Although this total length can be obtained
      by looking into the tcphdr and then get the (th->doff << 2), this patch
      chooses to directly store the tcp hdrlen in the second four bytes of
      this newly created "struct saved_syn".  By using a new struct, it can
      give a readable name to each individual header length.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
      70a217f1
  10. 22 8月, 2020 3 次提交