1. 02 11月, 2019 1 次提交
  2. 31 10月, 2019 1 次提交
    • A
      bpf: Replace prog_raw_tp+btf_id with prog_tracing · f1b9509c
      Alexei Starovoitov 提交于
      The bpf program type raw_tp together with 'expected_attach_type'
      was the most appropriate api to indicate BTF-enabled raw_tp programs.
      But during development it became apparent that 'expected_attach_type'
      cannot be used and new 'attach_btf_id' field had to be introduced.
      Which means that the information is duplicated in two fields where
      one of them is ignored.
      Clean it up by introducing new program type where both
      'expected_attach_type' and 'attach_btf_id' fields have
      specific meaning.
      In the future 'expected_attach_type' will be extended
      with other attach points that have similar semantics to raw_tp.
      This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
      prog_type = BPF_RPOG_TYPE_TRACING
      expected_attach_type = BPF_TRACE_RAW_TP
      attach_btf_id = btf_id of raw tracepoint inside the kernel
      Future patches will add
      expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
      where programs have the same input context and the same helpers,
      but different attach points.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
      f1b9509c
  3. 25 10月, 2019 1 次提交
    • M
      bpf: Prepare btf_ctx_access for non raw_tp use case · 38207291
      Martin KaFai Lau 提交于
      This patch makes a few changes to btf_ctx_access() to prepare
      it for non raw_tp use case where the attach_btf_id is not
      necessary a BTF_KIND_TYPEDEF.
      
      It moves the "btf_trace_" prefix check and typedef-follow logic to a new
      function "check_attach_btf_id()" which is called only once during
      bpf_check().  btf_ctx_access() only operates on a BTF_KIND_FUNC_PROTO
      type now. That should also be more efficient since it is done only
      one instead of every-time check_ctx_access() is called.
      
      "check_attach_btf_id()" needs to find the func_proto type from
      the attach_btf_id.  It needs to store the result into the
      newly added prog->aux->attach_func_proto.  func_proto
      btf type has no name, so a proper name should be stored into
      "attach_func_name" also.
      
      v2:
      - Move the "btf_trace_" check to an earlier verifier phase (Alexei)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191025001811.1718491-1-kafai@fb.com
      38207291
  4. 17 10月, 2019 4 次提交
    • A
      bpf: Check types of arguments passed into helpers · a7658e1a
      Alexei Starovoitov 提交于
      Introduce new helper that reuses existing skb perf_event output
      implementation, but can be called from raw_tracepoint programs
      that receive 'struct sk_buff *' as tracepoint argument or
      can walk other kernel data structures to skb pointer.
      
      In order to do that teach verifier to resolve true C types
      of bpf helpers into in-kernel BTF ids.
      The type of kernel pointer passed by raw tracepoint into bpf
      program will be tracked by the verifier all the way until
      it's passed into helper function.
      For example:
      kfree_skb() kernel function calls trace_kfree_skb(skb, loc);
      bpf programs receives that skb pointer and may eventually
      pass it into bpf_skb_output() bpf helper which in-kernel is
      implemented via bpf_skb_event_output() kernel function.
      Its first argument in the kernel is 'struct sk_buff *'.
      The verifier makes sure that types match all the way.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
      a7658e1a
    • A
      bpf: Add support for BTF pointers to x86 JIT · 3dec541b
      Alexei Starovoitov 提交于
      Pointer to BTF object is a pointer to kernel object or NULL.
      Such pointers can only be used by BPF_LDX instructions.
      The verifier changed their opcode from LDX|MEM|size
      to LDX|PROBE_MEM|size to make JITing easier.
      The number of entries in extable is the number of BPF_LDX insns
      that access kernel memory via "pointer to BTF type".
      Only these load instructions can fault.
      Since x86 extable is relative it has to be allocated in the same
      memory region as JITed code.
      Allocate it prior to last pass of JITing and let the last pass populate it.
      Pointer to extable in bpf_prog_aux is necessary to make page fault
      handling fast.
      Page fault handling is done in two steps:
      1. bpf_prog_kallsyms_find() finds BPF program that page faulted.
         It's done by walking rb tree.
      2. then extable for given bpf program is binary searched.
      This process is similar to how page faulting is done for kernel modules.
      The exception handler skips over faulting x86 instruction and
      initializes destination register with zero. This mimics exact
      behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed).
      
      JITs for other architectures can add support in similar way.
      Until then they will reject unknown opcode and fallback to interpreter.
      
      Since extable should be aligned and placed near JITed code
      make bpf_jit_binary_alloc() return 4 byte aligned image offset,
      so that extable aligning formula in bpf_int_jit_compile() doesn't need
      to rely on internal implementation of bpf_jit_binary_alloc().
      On x86 gcc defaults to 16-byte alignment for regular kernel functions
      due to better performance. JITed code may be aligned to 16 in the future,
      but it will use 4 in the meantime.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org
      3dec541b
    • A
      bpf: Implement accurate raw_tp context access via BTF · 9e15db66
      Alexei Starovoitov 提交于
      libbpf analyzes bpf C program, searches in-kernel BTF for given type name
      and stores it into expected_attach_type.
      The kernel verifier expects this btf_id to point to something like:
      typedef void (*btf_trace_kfree_skb)(void *, struct sk_buff *skb, void *loc);
      which represents signature of raw_tracepoint "kfree_skb".
      
      Then btf_ctx_access() matches ctx+0 access in bpf program with 'skb'
      and 'ctx+8' access with 'loc' arguments of "kfree_skb" tracepoint.
      In first case it passes btf_id of 'struct sk_buff *' back to the verifier core
      and 'void *' in second case.
      
      Then the verifier tracks PTR_TO_BTF_ID as any other pointer type.
      Like PTR_TO_SOCKET points to 'struct bpf_sock',
      PTR_TO_TCP_SOCK points to 'struct bpf_tcp_sock', and so on.
      PTR_TO_BTF_ID points to in-kernel structs.
      If 1234 is btf_id of 'struct sk_buff' in vmlinux's BTF
      then PTR_TO_BTF_ID#1234 points to one of in kernel skbs.
      
      When PTR_TO_BTF_ID#1234 is dereferenced (like r2 = *(u64 *)r1 + 32)
      the btf_struct_access() checks which field of 'struct sk_buff' is
      at offset 32. Checks that size of access matches type definition
      of the field and continues to track the dereferenced type.
      If that field was a pointer to 'struct net_device' the r2's type
      will be PTR_TO_BTF_ID#456. Where 456 is btf_id of 'struct net_device'
      in vmlinux's BTF.
      
      Such verifier analysis prevents "cheating" in BPF C program.
      The program cannot cast arbitrary pointer to 'struct sk_buff *'
      and access it. C compiler would allow type cast, of course,
      but the verifier will notice type mismatch based on BPF assembly
      and in-kernel BTF.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-7-ast@kernel.org
      9e15db66
    • A
      bpf: Add attach_btf_id attribute to program load · ccfe29eb
      Alexei Starovoitov 提交于
      Add attach_btf_id attribute to prog_load command.
      It's similar to existing expected_attach_type attribute which is
      used in several cgroup based program types.
      Unfortunately expected_attach_type is ignored for
      tracing programs and cannot be reused for new purpose.
      Hence introduce attach_btf_id to verify bpf programs against
      given in-kernel BTF type id at load time.
      It is strictly checked to be valid for raw_tp programs only.
      In a later patches it will become:
      btf_id == 0 semantics of existing raw_tp progs.
      btd_id > 0 raw_tp with BTF and additional type safety.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-5-ast@kernel.org
      ccfe29eb
  5. 12 10月, 2019 1 次提交
  6. 21 8月, 2019 1 次提交
  7. 18 8月, 2019 1 次提交
  8. 30 7月, 2019 2 次提交
  9. 28 6月, 2019 1 次提交
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
  10. 15 6月, 2019 2 次提交
  11. 11 6月, 2019 1 次提交
  12. 01 6月, 2019 4 次提交
    • R
      bpf: move memory size checks to bpf_map_charge_init() · c85d6913
      Roman Gushchin 提交于
      Most bpf map types doing similar checks and bytes to pages
      conversion during memory allocation and charging.
      
      Let's unify these checks by moving them into bpf_map_charge_init().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c85d6913
    • R
      bpf: rework memlock-based memory accounting for maps · b936ca64
      Roman Gushchin 提交于
      In order to unify the existing memlock charging code with the
      memcg-based memory accounting, which will be added later, let's
      rework the current scheme.
      
      Currently the following design is used:
        1) .alloc() callback optionally checks if the allocation will likely
           succeed using bpf_map_precharge_memlock()
        2) .alloc() performs actual allocations
        3) .alloc() callback calculates map cost and sets map.memory.pages
        4) map_create() calls bpf_map_init_memlock() which sets map.memory.user
           and performs actual charging; in case of failure the map is
           destroyed
        <map is in use>
        1) bpf_map_free_deferred() calls bpf_map_release_memlock(), which
           performs uncharge and releases the user
        2) .map_free() callback releases the memory
      
      The scheme can be simplified and made more robust:
        1) .alloc() calculates map cost and calls bpf_map_charge_init()
        2) bpf_map_charge_init() sets map.memory.user and performs actual
          charge
        3) .alloc() performs actual allocations
        <map is in use>
        1) .map_free() callback releases the memory
        2) bpf_map_charge_finish() performs uncharge and releases the user
      
      The new scheme also allows to reuse bpf_map_charge_init()/finish()
      functions for memcg-based accounting. Because charges are performed
      before actual allocations and uncharges after freeing the memory,
      no bogus memory pressure can be created.
      
      In cases when the map structure is not available (e.g. it's not
      created yet, or is already destroyed), on-stack bpf_map_memory
      structure is used. The charge can be transferred with the
      bpf_map_charge_move() function.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b936ca64
    • R
      bpf: group memory related fields in struct bpf_map_memory · 3539b96e
      Roman Gushchin 提交于
      Group "user" and "pages" fields of bpf_map into the bpf_map_memory
      structure. Later it can be extended with "memcg" and other related
      information.
      
      The main reason for a such change (beside cosmetics) is to pass
      bpf_map_memory structure to charging functions before the actual
      allocation of bpf_map.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3539b96e
    • B
      bpf: Create BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY · 1f52f6c0
      brakmo 提交于
      Create new macro BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY() to be used by
      __cgroup_bpf_run_filter_skb for EGRESS BPF progs so BPF programs can
      request cwr for TCP packets.
      
      Current cgroup skb programs can only return 0 or 1 (0 to drop the
      packet. This macro changes the behavior so the low order bit
      indicates whether the packet should be dropped (0) or not (1)
      and the next bit is used for congestion notification (cn).
      
      Hence, new allowed return values of CGROUP EGRESS BPF programs are:
        0: drop packet
        1: keep packet
        2: drop packet and call cwr
        3: keep packet and call cwr
      
      This macro then converts it to one of NET_XMIT values or -EPERM
      that has the effect of dropping the packet with no cn.
        0: NET_XMIT_SUCCESS  skb should be transmitted (no cn)
        1: NET_XMIT_DROP     skb should be dropped and cwr called
        2: NET_XMIT_CN       skb should be transmitted and cwr called
        3: -EPERM            skb should be dropped (no cn)
      
      Note that when more than one BPF program is called, the packet is
      dropped if at least one of programs requests it be dropped, and
      there is cn if at least one program returns cn.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1f52f6c0
  13. 31 5月, 2019 1 次提交
  14. 29 5月, 2019 1 次提交
    • S
      bpf: remove __rcu annotations from bpf_prog_array · 54e9c9d4
      Stanislav Fomichev 提交于
      Drop __rcu annotations and rcu read sections from bpf_prog_array
      helper functions. They are not needed since all existing callers
      call those helpers from the rcu update side while holding a mutex.
      This guarantees that use-after-free could not happen.
      
      In the next patches I'll fix the callers with missing
      rcu_dereference_protected to make sparse/lockdep happy, the proper
      way to use these helpers is:
      
      	struct bpf_prog_array __rcu *progs = ...;
      	struct bpf_prog_array *p;
      
      	mutex_lock(&mtx);
      	p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
      	bpf_prog_array_length(p);
      	bpf_prog_array_copy_to_user(p, ...);
      	bpf_prog_array_delete_safe(p, ...);
      	bpf_prog_array_copy_info(p, ...);
      	bpf_prog_array_copy(p, ...);
      	bpf_prog_array_free(p);
      	mutex_unlock(&mtx);
      
      No functional changes! rcu_dereference_protected with lockdep_is_held
      should catch any cases where we update prog array without a mutex
      (I've looked at existing call sites and I think we hold a mutex
      everywhere).
      
      Motivation is to fix sparse warnings:
      kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different address spaces)
      kernel/bpf/core.c:1803:9:    expected struct callback_head *head
      kernel/bpf/core.c:1803:9:    got struct callback_head [noderef] <asn:4> *
      kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different address spaces)
      kernel/bpf/core.c:1877:44:    expected struct bpf_prog_array_item *item
      kernel/bpf/core.c:1877:44:    got struct bpf_prog_array_item [noderef] <asn:4> *
      kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different address spaces)
      kernel/bpf/core.c:1901:26:    expected struct bpf_prog_array_item *existing
      kernel/bpf/core.c:1901:26:    got struct bpf_prog_array_item [noderef] <asn:4> *
      kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different address spaces)
      kernel/bpf/core.c:1935:26:    expected struct bpf_prog_array_item *[assigned] existing
      kernel/bpf/core.c:1935:26:    got struct bpf_prog_array_item [noderef] <asn:4> *
      
      v2:
      * remove comment about potential race; that can't happen
        because all callers are in rcu-update section
      
      Cc: Roman Gushchin <guro@fb.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      54e9c9d4
  15. 25 5月, 2019 1 次提交
    • J
      bpf: verifier: insert zero extension according to analysis result · a4b1d3c1
      Jiong Wang 提交于
      After previous patches, verifier will mark a insn if it really needs zero
      extension on dst_reg.
      
      It is then for back-ends to decide how to use such information to eliminate
      unnecessary zero extension code-gen during JIT compilation.
      
      One approach is verifier insert explicit zero extension for those insns
      that need zero extension in a generic way, JIT back-ends then do not
      generate zero extension for sub-register write at default.
      
      However, only those back-ends which do not have hardware zero extension
      want this optimization. Back-ends like x86_64 and AArch64 have hardware
      zero extension support that the insertion should be disabled.
      
      This patch introduces new target hook "bpf_jit_needs_zext" which returns
      false at default, meaning verifier zero extension insertion is disabled at
      default. A back-end could override this hook to return true if it doesn't
      have hardware support and want verifier insert zero extension explicitly.
      
      Offload targets do not use this native target hook, instead, they could
      get the optimization results using bpf_prog_offload_ops.finalize.
      
      NOTE: arches could have diversified features, it is possible for one arch
      to have hardware zero extension support for some sub-register write insns
      but not for all. For example, PowerPC, SPARC have zero extended loads, but
      not for alu32. So when verifier zero extension insertion enabled, these JIT
      back-ends need to peephole insns to remove those zero extension inserted
      for insn that actually has hardware zero extension support. The peephole
      could be as simple as looking the next insn, if it is a special zero
      extension insn then it is safe to eliminate it if the current insn has
      hardware zero extension support.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a4b1d3c1
  16. 15 5月, 2019 1 次提交
  17. 28 4月, 2019 1 次提交
    • M
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau 提交于
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
  18. 27 4月, 2019 1 次提交
    • M
      bpf: add writable context for raw tracepoints · 9df1c28b
      Matt Mullins 提交于
      This is an opt-in interface that allows a tracepoint to provide a safe
      buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
      The size of the buffer must be a compile-time constant, and is checked
      before allowing a BPF program to attach to a tracepoint that uses this
      feature.
      
      The pointer to this buffer will be the first argument of tracepoints
      that opt in; the pointer is valid and can be bpf_probe_read() by both
      BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
      programs that attach to such a tracepoint, but the buffer to which it
      points may only be written by the latter.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9df1c28b
  19. 26 4月, 2019 1 次提交
  20. 13 4月, 2019 2 次提交
    • A
      bpf: Introduce bpf_strtol and bpf_strtoul helpers · d7a4cb9b
      Andrey Ignatov 提交于
      Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
      long correspondingly. It's similar to user space strtol(3) and
      strtoul(3) with a few changes to the API:
      
      * instead of NUL-terminated C string the helpers expect buffer and
        buffer length;
      
      * resulting long or unsigned long is returned in a separate
        result-argument;
      
      * return value is used to indicate success or failure, on success number
        of consumed bytes is returned that can be used to identify position to
        read next if the buffer is expected to contain multiple integers;
      
      * instead of *base* argument, *flags* is used that provides base in 5
        LSB, other bits are reserved for future use;
      
      * number of supported bases is limited.
      
      Documentation for the new helpers is provided in bpf.h UAPI.
      
      The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
      be able to convert string input to e.g. "ulongvec" output.
      
      E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
      parsed by calling to bpf_strtoul three times.
      
      Implementation notes:
      
      Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
      functions. It's done exactly same way as fs/proc/base.c already does.
      
      Unfortunately existing kstrtoX function can't be used directly since
      they fail if any invalid character is present right after integer in the
      string. Existing simple_strtoX functions can't be used either since
      they're obsolete and don't handle overflow properly.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d7a4cb9b
    • A
      bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types · 57c3bb72
      Andrey Ignatov 提交于
      Currently the way to pass result from BPF helper to BPF program is to
      provide memory area defined by pointer and size: func(void *, size_t).
      
      It works great for generic use-case, but for simple types, such as int,
      it's overkill and consumes two arguments when it could use just one.
      
      Introduce new argument types ARG_PTR_TO_INT and ARG_PTR_TO_LONG to be
      able to pass result from helper to program via pointer to int and long
      correspondingly: func(int *) or func(long *).
      
      New argument types are similar to ARG_PTR_TO_MEM with the following
      differences:
      * they don't require corresponding ARG_CONST_SIZE argument, predefined
        access sizes are used instead (32bit for int, 64bit for long);
      * it's possible to use more than one such an argument in a helper;
      * provided pointers have to be aligned.
      
      It's easy to introduce similar ARG_PTR_TO_CHAR and ARG_PTR_TO_SHORT
      argument types. It's not done due to lack of use-case though.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      57c3bb72
  21. 12 4月, 2019 1 次提交
  22. 10 4月, 2019 3 次提交
    • D
      bpf: add syscall side map freeze support · 87df15de
      Daniel Borkmann 提交于
      This patch adds a new BPF_MAP_FREEZE command which allows to
      "freeze" the map globally as read-only / immutable from syscall
      side.
      
      Map permission handling has been refactored into map_get_sys_perms()
      and drops FMODE_CAN_WRITE in case of locked map. Main use case is
      to allow for setting up .rodata sections from the BPF ELF which
      are loaded into the kernel, meaning BPF loader first allocates
      map, sets up map value by copying .rodata section into it and once
      complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
      modifications.
      
      Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
      bpf_attr members are required to be zero. I didn't add write-only
      locking here as counterpart since I don't have a concrete use-case
      for it on my side, and I think it makes probably more sense to wait
      once there is actually one. In that case bpf_attr can be extended
      as usual with a flag field and/or others where flag 0 means that
      we lock the map read-only hence this doesn't prevent to add further
      extensions to BPF_MAP_FREEZE upon need.
      
      A map creation flag like BPF_F_WRONCE was not considered for couple
      of reasons: i) in case of a generic implementation, a map can consist
      of more than just one element, thus there could be multiple map
      updates needed to set the map into a state where it can then be
      made immutable, ii) WRONCE indicates exact one-time write before
      it is then set immutable. A generic implementation would set a bit
      atomically on map update entry (if unset), indicating that every
      subsequent update from then onwards will need to bail out there.
      However, map updates can fail, so upon failure that flag would need
      to be unset again and the update attempt would need to be repeated
      for it to be eventually made immutable. While this can be made
      race-free, this approach feels less clean and in combination with
      reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
      command directly sets the flag and caller has the guarantee that
      map is immutable from syscall side upon successful return for any
      future syscall invocations that would alter the map state, which
      is also more intuitive from an API point of view. A command name
      such as BPF_MAP_LOCK has been avoided as it's too close with BPF
      map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
      is so far only enabled for privileged users.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      87df15de
    • D
      bpf: add program side {rd, wr}only support for maps · 591fe988
      Daniel Borkmann 提交于
      This work adds two new map creation flags BPF_F_RDONLY_PROG
      and BPF_F_WRONLY_PROG in order to allow for read-only or
      write-only BPF maps from a BPF program side.
      
      Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
      applies to system call side, meaning the BPF program has full
      read/write access to the map as usual while bpf(2) calls with
      map fd can either only read or write into the map depending
      on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
      for the exact opposite such that verifier is going to reject
      program loads if write into a read-only map or a read into a
      write-only map is detected. For read-only map case also some
      helpers are forbidden for programs that would alter the map
      state such as map deletion, update, etc. As opposed to the two
      BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
      as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
      
      We've enabled this generic map extension to various non-special
      maps holding normal user data: array, hash, lru, lpm, local
      storage, queue and stack. Further generic map types could be
      followed up in future depending on use-case. Main use case
      here is to forbid writes into .rodata map values from verifier
      side.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      591fe988
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  23. 04 4月, 2019 1 次提交
    • A
      bpf: increase complexity limit and maximum program size · c04c0d2b
      Alexei Starovoitov 提交于
      Large verifier speed improvements allow to increase
      verifier complexity limit.
      Now regardless of the program composition and its size it takes
      little time for the verifier to hit insn_processed limit.
      On typical x86 machine non-debug kernel processes 1M instructions
      in 1/10 of a second.
      (before these speed improvements specially crafted programs
      could be hitting multi-second verification times)
      Full kasan kernel with debug takes ~1 second for the same 1M insns.
      Hence bump the BPF_COMPLEXITY_LIMIT_INSNS limit to 1M.
      Also increase the number of instructions per program
      from 4k to internal BPF_COMPLEXITY_LIMIT_INSNS limit.
      4k limit was confusing to users, since small programs with hundreds
      of insns could be hitting BPF_COMPLEXITY_LIMIT_INSNS limit.
      Sometimes adding more insns and bpf_trace_printk debug statements
      would make the verifier accept the program while removing
      code would make the verifier reject it.
      Some user space application started to add #define MAX_FOO to
      their programs and do:
        MAX_FOO=100;
      again:
        compile with MAX_FOO;
        try to load;
        if (fails_to_load) { reduce MAX_FOO; goto again; }
      to be able to fit maximum amount of processing into single program.
      Other users artificially split their single program into a set of programs
      and use all 32 iterations of tail_calls to increase compute limits.
      And the most advanced folks used unlimited tc-bpf filter list
      to execute many bpf programs.
      Essentially the users managed to workaround 4k insn limit.
      This patch removes the limit for root programs from uapi.
      BPF_COMPLEXITY_LIMIT_INSNS is the kernel internal limit
      and success to load the program no longer depends on program size,
      but on 'smartness' of the verifier only.
      The verifier will continue to get smarter with every kernel release.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c04c0d2b
  24. 22 3月, 2019 1 次提交
  25. 14 3月, 2019 1 次提交
    • M
      bpf: Fix bpf_tcp_sock and bpf_sk_fullsock issue related to bpf_sk_release · 1b986589
      Martin KaFai Lau 提交于
      Lorenz Bauer [thanks!] reported that a ptr returned by bpf_tcp_sock(sk)
      can still be accessed after bpf_sk_release(sk).
      Both bpf_tcp_sock() and bpf_sk_fullsock() have the same issue.
      This patch addresses them together.
      
      A simple reproducer looks like this:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(sk);
      	snd_cwnd = tp->snd_cwnd; /* oops! The verifier does not complain. */
      
      The problem is the verifier did not scrub the register's states of
      the tcp_sock ptr (tp) after bpf_sk_release(sk).
      
      [ Note that when calling bpf_tcp_sock(sk), the sk is not always
        refcount-acquired. e.g. bpf_tcp_sock(skb->sk). The verifier works
        fine for this case. ]
      
      Currently, the verifier does not track if a helper's return ptr (in REG_0)
      is "carry"-ing one of its argument's refcount status. To carry this info,
      the reg1->id needs to be stored in reg0.
      
      One approach was tried, like "reg0->id = reg1->id", when calling
      "bpf_tcp_sock()".  The main idea was to avoid adding another "ref_obj_id"
      for the same reg.  However, overlapping the NULL marking and ref
      tracking purpose in one "id" does not work well:
      
      	ref_sk = bpf_sk_lookup_tcp();
      	fullsock = bpf_sk_fullsock(ref_sk);
      	tp = bpf_tcp_sock(ref_sk);
      	if (!fullsock) {
      	     bpf_sk_release(ref_sk);
      	     return 0;
      	}
      	/* fullsock_reg->id is marked for NOT-NULL.
      	 * Same for tp_reg->id because they have the same id.
      	 */
      
      	/* oops. verifier did not complain about the missing !tp check */
      	snd_cwnd = tp->snd_cwnd;
      
      Hence, a new "ref_obj_id" is needed in "struct bpf_reg_state".
      With a new ref_obj_id, when bpf_sk_release(sk) is called, the verifier can
      scrub all reg states which has a ref_obj_id match.  It is done with the
      changes in release_reg_references() in this patch.
      
      While fixing it, sk_to_full_sk() is removed from bpf_tcp_sock() and
      bpf_sk_fullsock() to avoid these helpers from returning
      another ptr. It will make bpf_sk_release(tp) possible:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) ... */
      	tp = bpf_tcp_sock(sk);
      	/* if (!tp) ... */
      	bpf_sk_release(tp);
      
      A separate helper "bpf_get_listener_sock()" will be added in a later
      patch to do sk_to_full_sk().
      
      Misc change notes:
      - To allow bpf_sk_release(tp), the arg of bpf_sk_release() is changed
        from ARG_PTR_TO_SOCKET to ARG_PTR_TO_SOCK_COMMON.  ARG_PTR_TO_SOCKET
        is removed from bpf.h since no helper is using it.
      
      - arg_type_is_refcounted() is renamed to arg_type_may_be_refcounted()
        because ARG_PTR_TO_SOCK_COMMON is the only one and skb->sk is not
        refcounted.  All bpf_sk_release(), bpf_sk_fullsock() and bpf_tcp_sock()
        take ARG_PTR_TO_SOCK_COMMON.
      
      - check_refcount_ok() ensures is_acquire_function() cannot take
        arg_type_may_be_refcounted() as its argument.
      
      - The check_func_arg() can only allow one refcount-ed arg.  It is
        guaranteed by check_refcount_ok() which ensures at most one arg can be
        refcounted.  Hence, it is a verifier internal error if >1 refcount arg
        found in check_func_arg().
      
      - In release_reference(), release_reference_state() is called
        first to ensure a match on "reg->ref_obj_id" can be found before
        scrubbing the reg states with release_reg_references().
      
      - reg_is_refcounted() is no longer needed.
        1. In mark_ptr_or_null_regs(), its usage is replaced by
           "ref_obj_id && ref_obj_id == id" because,
           when is_null == true, release_reference_state() should only be
           called on the ref_obj_id obtained by a acquire helper (i.e.
           is_acquire_function() == true).  Otherwise, the following
           would happen:
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	fullsock = bpf_sk_fullsock(sk);
      	if (!fullsock) {
      		/*
      		 * release_reference_state(fullsock_reg->ref_obj_id)
      		 * where fullsock_reg->ref_obj_id == sk_reg->ref_obj_id.
      		 *
      		 * Hence, the following bpf_sk_release(sk) will fail
      		 * because the ref state has already been released in the
      		 * earlier release_reference_state(fullsock_reg->ref_obj_id).
      		 */
      		bpf_sk_release(sk);
      	}
      
        2. In release_reg_references(), the current reg_is_refcounted() call
           is unnecessary because the id check is enough.
      
      - The type_is_refcounted() and type_is_refcounted_or_null()
        are no longer needed also because reg_is_refcounted() is removed.
      
      Fixes: 655a51e5 ("bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock")
      Reported-by: NLorenz Bauer <lmb@cloudflare.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1b986589
  26. 28 2月, 2019 1 次提交
    • A
      bpf: enable program stats · 492ecee8
      Alexei Starovoitov 提交于
      JITed BPF programs are indistinguishable from kernel functions, but unlike
      kernel code BPF code can be changed often.
      Typical approach of "perf record" + "perf report" profiling and tuning of
      kernel code works just as well for BPF programs, but kernel code doesn't
      need to be monitored whereas BPF programs do.
      Users load and run large amount of BPF programs.
      These BPF stats allow tools monitor the usage of BPF on the server.
      The monitoring tools will turn sysctl kernel.bpf_stats_enabled
      on and off for few seconds to sample average cost of the programs.
      Aggregated data over hours and days will provide an insight into cost of BPF
      and alarms can trigger in case given program suddenly gets more expensive.
      
      The cost of two sched_clock() per program invocation adds ~20 nsec.
      Fast BPF progs (like selftests/bpf/progs/test_pkt_access.c) will slow down
      from ~10 nsec to ~30 nsec.
      static_key minimizes the cost of the stats collection.
      There is no measurable difference before/after this patch
      with kernel.bpf_stats_enabled=0
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      492ecee8
  27. 13 2月, 2019 1 次提交
  28. 11 2月, 2019 2 次提交
    • M
      bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock · 655a51e5
      Martin KaFai Lau 提交于
      This patch adds a helper function BPF_FUNC_tcp_sock and it
      is currently available for cg_skb and sched_(cls|act):
      
      struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_tcp_sock *tp;
      	struct bpf_sock *sk;
      	__u32 snd_cwnd;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	tp = bpf_tcp_sock(sk);
      	if (!tp)
      		return 1;
      
      	snd_cwnd = tp->snd_cwnd;
      	/* ... */
      
      	return 1;
      }
      
      A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
      read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
      that has already been exposed by the bpf_sock_ops.
      i.e. no new tcp_sock's fields are exposed in bpf.h.
      
      This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
      or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
      returns NULL.  Hence, the caller needs to check for NULL before
      accessing it.
      
      The current use case is to expose members from tcp_sock
      to allow a cg_skb_bpf_prog to provide per cgroup traffic
      policing/shaping.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      655a51e5
    • M
      bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper · 46f8bc92
      Martin KaFai Lau 提交于
      In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
      before accessing the fields in sock.  For example, in __netdev_pick_tx:
      
      static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
      			    struct net_device *sb_dev)
      {
      	/* ... */
      
      	struct sock *sk = skb->sk;
      
      		if (queue_index != new_index && sk &&
      		    sk_fullsock(sk) &&
      		    rcu_access_pointer(sk->sk_dst_cache))
      			sk_tx_queue_set(sk, new_index);
      
      	/* ... */
      
      	return queue_index;
      }
      
      This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
      where a few of the convert_ctx_access() in filter.c has already been
      accessing the skb->sk sock_common's fields,
      e.g. sock_ops_convert_ctx_access().
      
      "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
      Some of the fileds in "bpf_sock" will not be directly
      accessible through the "__sk_buff->sk" pointer.  It is limited
      by the new "bpf_sock_common_is_valid_access()".
      e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
           are not allowed.
      
      The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
      can be used to get a sk with all accessible fields in "bpf_sock".
      This helper is added to both cg_skb and sched_(cls|act).
      
      int cg_skb_foo(struct __sk_buff *skb) {
      	struct bpf_sock *sk;
      
      	sk = skb->sk;
      	if (!sk)
      		return 1;
      
      	sk = bpf_sk_fullsock(sk);
      	if (!sk)
      		return 1;
      
      	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
      		return 1;
      
      	/* some_traffic_shaping(); */
      
      	return 1;
      }
      
      (1) The sk is read only
      
      (2) There is no new "struct bpf_sock_common" introduced.
      
      (3) Future kernel sock's members could be added to bpf_sock only
          instead of repeatedly adding at multiple places like currently
          in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.
      
      (4) After "sk = skb->sk", the reg holding sk is in type
          PTR_TO_SOCK_COMMON_OR_NULL.
      
      (5) After bpf_sk_fullsock(), the return type will be in type
          PTR_TO_SOCKET_OR_NULL which is the same as the return type of
          bpf_sk_lookup_xxx().
      
          However, bpf_sk_fullsock() does not take refcnt.  The
          acquire_reference_state() is only depending on the return type now.
          To avoid it, a new is_acquire_function() is checked before calling
          acquire_reference_state().
      
      (6) The WARN_ON in "release_reference_state()" is no longer an
          internal verifier bug.
      
          When reg->id is not found in state->refs[], it means the
          bpf_prog does something wrong like
          "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
          never been acquired by calling "bpf_sk_fullsock(skb->sk)".
      
          A -EINVAL and a verbose are done instead of WARN_ON.  A test is
          added to the test_verifier in a later patch.
      
          Since the WARN_ON in "release_reference_state()" is no longer
          needed, "__release_reference_state()" is folded into
          "release_reference_state()" also.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      46f8bc92