1. 27 1月, 2018 17 次提交
    • Y
      bpf: fix kernel page fault in lpm map trie_get_next_key · 6dd1ec6c
      Yonghong Song 提交于
      Commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command
      for LPM_TRIE map") introduces a bug likes below:
      
          if (!rcu_dereference(trie->root))
              return -ENOENT;
          if (!key || key->prefixlen > trie->max_prefixlen) {
              root = &trie->root;
              goto find_leftmost;
          }
          ......
        find_leftmost:
          for (node = rcu_dereference(*root); node;) {
      
      In the code after label find_leftmost, it is assumed
      that *root should not be NULL, but it is not true as
      it is possbile trie->root is changed to NULL by an
      asynchronous delete operation.
      
      The issue is reported by syzbot and Eric Dumazet with the
      below error log:
        ......
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
        ......
      
      This patch fixed the issue by use local rcu_dereferenced
      pointer instead of *(&trie->root) later on.
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6dd1ec6c
    • A
      Merge branch 'bpf-improvements-and-fixes' · 1651e39e
      Alexei Starovoitov 提交于
      Daniel Borkmann says:
      
      ====================
      This set contains a small cleanup in cBPF prologue generation and
      otherwise fixes an outstanding issue related to BPF to BPF calls
      and exception handling. For details please see related patches.
      Last but not least, BPF selftests is extended with several new
      test cases.
      
      Thanks!
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1651e39e
    • D
      bpf: add further test cases around div/mod and others · 21ccaf21
      Daniel Borkmann 提交于
      Update selftests to relfect recent changes and add various new
      test cases.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      21ccaf21
    • D
      bpf, arm: remove obsolete exception handling from div/mod · 73ae3c04
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from arm32 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Shubham Bansal <illusionist.neo@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      73ae3c04
    • D
      bpf, mips64: remove unneeded zero check from div/mod with k · e472d5d8
      Daniel Borkmann 提交于
      The verifier in both cBPF and eBPF reject div/mod by 0 imm,
      so this can never load. Remove emitting such test and reject
      it from being JITed instead (the latter is actually also not
      needed, but given practice in sparc64, ppc64 today, so
      doesn't hurt to add it here either).
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David Daney <david.daney@cavium.com>
      Reviewed-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e472d5d8
    • D
      bpf, mips64: remove obsolete exception handling from div/mod · 1fb5c9c6
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from mips64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David Daney <david.daney@cavium.com>
      Reviewed-by: NDavid Daney <david.daney@cavium.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1fb5c9c6
    • D
      bpf, sparc64: remove obsolete exception handling from div/mod · 740d52c6
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from sparc64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      740d52c6
    • D
      bpf, ppc64: remove obsolete exception handling from div/mod · 53fbf571
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from ppc64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      53fbf571
    • D
      bpf, s390x: remove obsolete exception handling from div/mod · a3212b8f
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from s390x JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3212b8f
    • D
      bpf, arm64: remove obsolete exception handling from div/mod · 96a71005
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from arm64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      96a71005
    • D
      bpf, x86_64: remove obsolete exception handling from div/mod · 3e5b1a39
      Daniel Borkmann 提交于
      Since we've changed div/mod exception handling for src_reg in
      eBPF verifier itself, remove the leftovers from x86_64 JIT.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e5b1a39
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: make unknown opcode handling more robust · 5e581dad
      Daniel Borkmann 提交于
      Recent findings by syzcaller fixed in 7891a87e ("bpf: arsh is
      not supported in 32 bit alu thus reject it") triggered a warning
      in the interpreter due to unknown opcode not being rejected by
      the verifier. The 'return 0' for an unknown opcode is really not
      optimal, since with BPF to BPF calls, this would go untracked by
      the verifier.
      
      Do two things here to improve the situation: i) perform basic insn
      sanity check early on in the verification phase and reject every
      non-uapi insn right there. The bpf_opcode_in_insntable() table
      reuses the same mapping as the jumptable in ___bpf_prog_run() sans
      the non-public mappings. And ii) in ___bpf_prog_run() we do need
      to BUG in the case where the verifier would ever create an unknown
      opcode due to some rewrites.
      
      Note that JITs do not have such issues since they would punt to
      interpreter in these situations. Moreover, the BPF_JIT_ALWAYS_ON
      would also help to avoid such unknown opcodes in the first place.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      5e581dad
    • D
      bpf: improve dead code sanitizing · 2a5418a1
      Daniel Borkmann 提交于
      Given we recently had c131187d ("bpf: fix branch pruning
      logic") and 95a762e2 ("bpf: fix incorrect sign extension in
      check_alu_op()") in particular where before verifier skipped
      verification of the wrongly assumed dead branch, we should not
      just replace the dead code parts with nops (mov r0,r0). If there
      is a bug such as fixed in 95a762e2 in future again, where
      runtime could execute those insns, then one of the potential
      issues with the current setting would be that given the nops
      would be at the end of the program, we could execute out of
      bounds at some point.
      
      The best in such case would be to just exit the BPF program
      altogether and return an exception code. However, given this
      would require two instructions, and such a dead code gap could
      just be a single insn long, we would need to place 'r0 = X; ret'
      snippet at the very end after the user program or at the start
      before the program (where we'd skip that region on prog entry),
      and then place unconditional ja's into the dead code gap.
      
      While more complex but possible, there's still another block
      in the road that currently prevents from this, namely BPF to
      BPF calls. The issue here is that such exception could be
      returned from a callee, but the caller would not know that
      it's an exception that needs to be propagated further down.
      Alternative that has little complexity is to just use a ja-1
      code for now which will trap the execution here instead of
      silently doing bad things if we ever get there due to bugs.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2a5418a1
    • D
      bpf: xor of a/x in cbpf can be done in 32 bit alu · 1d621674
      Daniel Borkmann 提交于
      Very minor optimization; saves 1 byte per program in x86_64
      JIT in cBPF prologue.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d621674
    • M
      samples/bpf: Partially fixes the bpf.o build · c25ef6a5
      Mickaël Salaün 提交于
      Do not build lib/bpf/bpf.o with this Makefile but use the one from the
      library directory.  This avoid making a buggy bpf.o file (e.g. missing
      symbols).
      
      This patch is useful if some code (e.g. Landlock tests) needs both the
      bpf.o (from tools/lib/bpf) and the bpf_load.o (from samples/bpf).
      Signed-off-by: NMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c25ef6a5
    • L
      bpf: clean up from test_tcpbpf_kern.c · 771fc607
      Lawrence Brakmo 提交于
      Removed commented lines from test_tcpbpf_kern.c
      
      Fixes: d6d4f60c bpf: add selftest for tcpbpf
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      771fc607
  2. 26 1月, 2018 14 次提交
    • M
      bpf: Use the IS_FD_ARRAY() macro in map_update_elem() · 9c147b56
      Mickaël Salaün 提交于
      Make the code more readable.
      Signed-off-by: NMickaël Salaün <mic@digikod.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c147b56
    • A
      Merge branch 'bpf-more-sock_ops-callbacks' · 82f1e0f3
      Alexei Starovoitov 提交于
      Lawrence Brakmo says:
      
      ====================
      This patchset adds support for:
      
      - direct R or R/W access to many tcp_sock fields
      - passing up to 4 arguments to sock_ops BPF functions
      - tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
      - optionally calling sock_ops BPF program when RTO fires
      - optionally calling sock_ops BPF program when packet is retransmitted
      - optionally calling sock_ops BPF program when TCP state changes
      - access to tclass and sk_txhash
      - new selftest
      
      v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
          below used "bpf" and Patchwork didn't work correctly.
      v3: Cleaned RTO callback as per  Yuchung's comment
          Added BPF enum for TCP states as per  Alexei's comment
      v4: Fixed compile warnings related to detecting changes between TCP
          internal states and the BPF defined states.
      v5: Fixed comment issues in some selftest files
          Fixed accesss issue with u64 fields in bpf_sock_ops struct
      v6: Made fixes based on comments form Eric Dumazet:
          The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
          Field bpf_sock_ops_cb_flags is now set through a helper function
          which returns an error when a BPF program tries to set bits for
          callbacks that are not supported in the current kernel.
          Added a comment indicating that when adding fields to bpf_sock_ops_kern
          they should be added before the field named "temp" if they need to be
          cleared before calling the BPF function.
      v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
          based on comments form Eric Dumazet and Alexei Starovoitov.
          Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
          comments from Daniel Borkmann.
          Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
          on comments from Daniel Borkmann.
      v8: Add commit message 00/12
          Add Acked-by as appropriate
      v9: Moved the bug fix to the front of the patchset
          Changed RETRANS_CB so it is always called (before it was only called if
          the retransmit succeeded). It is now called with an extra argument, the
          return value of tcp_transmit_skb (0 => success). Based on comments
          from Yuchung Cheng.
          Added support for reading 2 new fields, sacked_out and lost_out, based on
          comments from Yuchung Cheng.
      v10: Moved the callback flags from include/uapi/linux/tcp.h to
           include/uapi/linux/bpf.h
           Cleaned up the test in selftest. Added a timeout so it always completes,
           even if the client is not communicating with the server. Made it faster
           by removing the sleeps. Made sure it works even when called back-to-back
           20 times.
      
      Consists of the following patches:
      [PATCH bpf-next v10 01/12] bpf: Only reply field should be writeable
      [PATCH bpf-next v10 02/12] bpf: Make SOCK_OPS_GET_TCP size
      [PATCH bpf-next v10 03/12] bpf: Make SOCK_OPS_GET_TCP struct
      [PATCH bpf-next v10 04/12] bpf: Add write access to tcp_sock and sock
      [PATCH bpf-next v10 05/12] bpf: Support passing args to sock_ops bpf
      [PATCH bpf-next v10 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
      [PATCH bpf-next v10 07/12] bpf: Add sock_ops RTO callback
      [PATCH bpf-next v10 08/12] bpf: Add support for reading sk_state and
      [PATCH bpf-next v10 09/12] bpf: Add sock_ops R/W access to tclass
      [PATCH bpf-next v10 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
      [PATCH bpf-next v10 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
      [PATCH bpf-next v10 12/12] bpf: add selftest for tcpbpf
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      82f1e0f3
    • L
      bpf: add selftest for tcpbpf · d6d4f60c
      Lawrence Brakmo 提交于
      Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
      callbacks occured and that it can access tcp_sock fields and that their
      values are correct.
      
      Run with command: ./test_tcpbpf_user
      Adding the flag "-d" will show why it did not pass.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d6d4f60c
    • L
      bpf: Add BPF_SOCK_OPS_STATE_CB · d4487491
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a TCP state
      change. Two arguments are used; one for the old state and another for
      the new state.
      
      There is a new enum in include/uapi/linux/bpf.h that exports the TCP
      states that prepends BPF_ to the current TCP state names. If it is ever
      necessary to change the internal TCP state values (other than adding
      more to the end), then it will become necessary to convert from the
      internal TCP state value to the BPF value before calling the BPF
      sock_ops function. There are a set of compile checks added in tcp.c
      to detect if the internal and BPF values differ so we can make the
      necessary fixes.
      
      New op: BPF_SOCK_OPS_STATE_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4487491
    • L
      bpf: Add BPF_SOCK_OPS_RETRANS_CB · a31ad29e
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a
      retransmission. Three arguments are used; one for the sequence number,
      another for the number of segments retransmitted, and the last one for
      the return value of tcp_transmit_skb (0 => success).
      Does not include syn-ack retransmissions.
      
      New op: BPF_SOCK_OPS_RETRANS_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a31ad29e
    • L
      bpf: Add sock_ops R/W access to tclass · 6f9bd3d7
      Lawrence Brakmo 提交于
      Adds direct write access to sk_txhash and access to tclass for ipv6
      flows through getsockopt and setsockopt. Sample usage for tclass:
      
        bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v, sizeof(v))
      
      where skops is a pointer to the ctx (struct bpf_sock_ops).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f9bd3d7
    • L
      bpf: Add support for reading sk_state and more · 44f0e430
      Lawrence Brakmo 提交于
      Add support for reading many more tcp_sock fields
      
        state,	same as sk->sk_state
        rtt_min	same as sk->rtt_min.s[0].v (current rtt_min)
        snd_ssthresh
        rcv_nxt
        snd_nxt
        snd_una
        mss_cache
        ecn_flags
        rate_delivered
        rate_interval_us
        packets_out
        retrans_out
        total_retrans
        segs_in
        data_segs_in
        segs_out
        data_segs_out
        lost_out
        sacked_out
        sk_txhash
        bytes_received (__u64)
        bytes_acked    (__u64)
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      44f0e430
    • L
      bpf: Add sock_ops RTO callback · f89013f6
      Lawrence Brakmo 提交于
      Adds an optional call to sock_ops BPF program based on whether the
      BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
      The BPF program is passed 2 arguments: icsk_retransmits and whether the
      RTO has expired.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f89013f6
    • L
      bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock · b13d8807
      Lawrence Brakmo 提交于
      Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
      use is to determine if there should be calls to sock_ops bpf program at
      various points in the TCP code. The field is initialized to zero,
      disabling the calls. A sock_ops BPF program can set it, per connection and
      as necessary, when the connection is established.
      
      It also adds support for reading and writting the field within a
      sock_ops BPF program. Reading is done by accessing the field directly.
      However, writing is done through the helper function
      bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
      is trying to set a callback that is not supported in the current kernel
      (i.e. running an older kernel). The helper function returns 0 if it was
      able to set all of the bits set in the argument, a positive number
      containing the bits that could not be set, or -EINVAL if the socket is
      not a full TCP socket.
      
      Examples of where one could call the bpf program:
      
      1) When RTO fires
      2) When a packet is retransmitted
      3) When the connection terminates
      4) When a packet is sent
      5) When a packet is received
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b13d8807
    • L
      bpf: Support passing args to sock_ops bpf function · de525be2
      Lawrence Brakmo 提交于
      Adds support for passing up to 4 arguments to sock_ops bpf functions. It
      reusues the reply union, so the bpf_sock_ops structures are not
      increased in size.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      de525be2
    • L
      bpf: Add write access to tcp_sock and sock fields · b73042b8
      Lawrence Brakmo 提交于
      This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
      struct tcp_sock or struct sock fields. This required adding a new
      field "temp" to struct bpf_sock_ops_kern for temporary storage that
      is used by sock_ops_convert_ctx_access. It is used to store and recover
      the contents of a register, so the register can be used to store the
      address of the sk. Since we cannot overwrite the dst_reg because it
      contains the pointer to ctx, nor the src_reg since it contains the value
      we want to store, we need an extra register to contain the address
      of the sk.
      
      Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
      GET or SET macros depending on the value of the TYPE field.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b73042b8
    • L
      bpf: Make SOCK_OPS_GET_TCP struct independent · 34d367c5
      Lawrence Brakmo 提交于
      Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
      arguments so now it can also work with struct sock fields.
      The first argument is the name of the field in the bpf_sock_ops
      struct, the 2nd argument is the name of the field in the OBJ struct.
      
      Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
      New:      SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)
      
      Where OBJ is either "struct tcp_sock" or "struct sock" (without
      quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
      struct and OBJ_FIELD is the name of the field in the OBJ struct.
      
      Although the field names are currently the same, the kernel struct names
      could change in the future and this change makes it easier to support
      that.
      
      Note that adding access to tcp_sock fields in sock_ops programs does
      not preclude the tcp_sock fields from being removed as long as we are
      willing to do one of the following:
      
        1) Return a fixed value (e.x. 0 or 0xffffffff), or
        2) Make the verifier fail if that field is accessed (i.e. program
          fails to load) so the user will know that field is no longer
          supported.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      34d367c5
    • L
      bpf: Make SOCK_OPS_GET_TCP size independent · a33de397
      Lawrence Brakmo 提交于
      Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
      with 4-byte fields.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a33de397
    • L
      bpf: Only reply field should be writeable · 2585cd62
      Lawrence Brakmo 提交于
      Currently, a sock_ops BPF program can write the op field and all the
      reply fields (reply and replylong). This is a bug. The op field should
      not have been writeable and there is currently no way to use replylong
      field for indices >= 1. This patch enforces that only the reply field
      (which equals replylong[0]) is writeable.
      
      Fixes: 40304b2a ("bpf: BPF support for sock_ops")
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2585cd62
  3. 24 1月, 2018 9 次提交