1. 30 3月, 2020 4 次提交
  2. 29 3月, 2020 4 次提交
  3. 28 3月, 2020 3 次提交
    • D
      bpf: Add selftest cases for ctx_or_null argument type · 23599ada
      Daniel Borkmann 提交于
      Add various tests to make sure the verifier keeps catching them:
      
        # ./test_verifier
        [...]
        #230/p pass ctx or null check, 1: ctx OK
        #231/p pass ctx or null check, 2: null OK
        #232/p pass ctx or null check, 3: 1 OK
        #233/p pass ctx or null check, 4: ctx - const OK
        #234/p pass ctx or null check, 5: null (connect) OK
        #235/p pass ctx or null check, 6: null (bind) OK
        #236/p pass ctx or null check, 7: ctx (bind) OK
        #237/p pass ctx or null check, 8: null (bind) OK
        [...]
        Summary: 1595 PASSED, 0 SKIPPED, 0 FAILED
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c74758d07b1b678036465ef7f068a49e9efd3548.1585323121.git.daniel@iogearbox.net
      23599ada
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  4. 26 3月, 2020 3 次提交
    • J
      bpf: Test_verifier, #70 error message updates for 32-bit right shift · aa131ed4
      John Fastabend 提交于
      After changes to add update_reg_bounds after ALU ops and adding ALU32
      bounds tracking the error message is changed in the 32-bit right shift
      tests.
      
      Test "#70/u bounds check after 32-bit right shift with 64-bit input FAIL"
      now fails with,
      
      Unexpected error message!
      	EXP: R0 invalid mem access
      	RES: func#0 @0
      
      7: (b7) r1 = 2
      8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP2 R10=fp0 fp-8_w=mmmmmmmm
      8: (67) r1 <<= 31
      9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967296 R10=fp0 fp-8_w=mmmmmmmm
      9: (74) w1 >>= 31
      10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP0 R10=fp0 fp-8_w=mmmmmmmm
      10: (14) w1 -= 2
      11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=invP4294967294 R10=fp0 fp-8_w=mmmmmmmm
      11: (0f) r0 += r1
      math between map_value pointer and 4294967294 is not allowed
      
      And test "#70/p bounds check after 32-bit right shift with 64-bit input
      FAIL" now fails with,
      
      Unexpected error message!
      	EXP: R0 invalid mem access
      	RES: func#0 @0
      
      7: (b7) r1 = 2
      8: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv2 R10=fp0 fp-8_w=mmmmmmmm
      8: (67) r1 <<= 31
      9: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967296 R10=fp0 fp-8_w=mmmmmmmm
      9: (74) w1 >>= 31
      10: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv0 R10=fp0 fp-8_w=mmmmmmmm
      10: (14) w1 -= 2
      11: R0_w=map_value(id=0,off=0,ks=8,vs=8,imm=0) R1_w=inv4294967294 R10=fp0 fp-8_w=mmmmmmmm
      11: (0f) r0 += r1
      last_idx 11 first_idx 0
      regs=2 stack=0 before 10: (14) w1 -= 2
      regs=2 stack=0 before 9: (74) w1 >>= 31
      regs=2 stack=0 before 8: (67) r1 <<= 31
      regs=2 stack=0 before 7: (b7) r1 = 2
      math between map_value pointer and 4294967294 is not allowed
      
      Before this series we did not trip the "math between map_value pointer..."
      error because check_reg_sane_offset is never called in
      adjust_ptr_min_max_vals(). Instead we have a register state that looks
      like this at line 11*,
      
      11: R0_w=map_value(id=0,off=0,ks=8,vs=8,
                         smin_value=0,smax_value=0,
                         umin_value=0,umax_value=0,
                         var_off=(0x0; 0x0))
          R1_w=invP(id=0,
                    smin_value=0,smax_value=4294967295,
                    umin_value=0,umax_value=4294967295,
                    var_off=(0xfffffffe; 0x0))
          R10=fp(id=0,off=0,
                 smin_value=0,smax_value=0,
                 umin_value=0,umax_value=0,
                 var_off=(0x0; 0x0)) fp-8_w=mmmmmmmm
      11: (0f) r0 += r1
      
      In R1 'smin_val != smax_val' yet we have a tnum_const as seen
      by 'var_off(0xfffffffe; 0x0))' with a 0x0 mask. So we hit this check
      in adjust_ptr_min_max_vals()
      
       if ((known && (smin_val != smax_val || umin_val != umax_val)) ||
            smin_val > smax_val || umin_val > umax_val) {
             /* Taint dst register if offset had invalid bounds derived from
              * e.g. dead branches.
              */
             __mark_reg_unknown(env, dst_reg);
             return 0;
       }
      
      So we don't throw an error here and instead only throw an error
      later in the verification when the memory access is made.
      
      The root cause in verifier without alu32 bounds tracking is having
      'umin_value = 0' and 'umax_value = U64_MAX' from BPF_SUB which we set
      when 'umin_value < umax_val' here,
      
       if (dst_reg->umin_value < umax_val) {
          /* Overflow possible, we know nothing */
          dst_reg->umin_value = 0;
          dst_reg->umax_value = U64_MAX;
       } else { ...}
      
      Later in adjust_calar_min_max_vals we previously did a
      coerce_reg_to_size() which will clamp the U64_MAX to U32_MAX by
      truncating to 32bits. But either way without a call to update_reg_bounds
      the less precise bounds tracking will fall out of the alu op
      verification.
      
      After latest changes we now exit adjust_scalar_min_max_vals with the
      more precise umin value, due to zero extension propogating bounds from
      alu32 bounds into alu64 bounds and then calling update_reg_bounds.
      This then causes the verifier to trigger an earlier error and we get
      the error in the output above.
      
      This patch updates tests to reflect new error message.
      
      * I have a local patch to print entire verifier state regardless if we
       believe it is a constant so we can get a full picture of the state.
       Usually if tnum_is_const() then bounds are also smin=smax, etc. but
       this is not always true and is a bit subtle. Being able to see these
       states helps understand dataflow imo. Let me know if we want something
       similar upstream.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/158507161475.15666.3061518385241144063.stgit@john-Precision-5820-Tower
      aa131ed4
    • S
      libbpf: Don't allocate 16M for log buffer by default · 8395f320
      Stanislav Fomichev 提交于
      For each prog/btf load we allocate and free 16 megs of verifier buffer.
      On production systems it doesn't really make sense because the
      programs/btf have gone through extensive testing and (mostly) guaranteed
      to successfully load.
      
      Let's assume successful case by default and skip buffer allocation
      on the first try. If there is an error, start with BPF_LOG_BUF_SIZE
      and double it on each ENOSPC iteration.
      
      v3:
      * Return -ENOMEM when can't allocate log buffer (Andrii Nakryiko)
      
      v2:
      * Don't allocate the buffer at all on the first try (Andrii Nakryiko)
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200325195521.112210-1-sdf@google.com
      8395f320
    • T
      libbpf: Remove unused parameter `def` to get_map_field_int · 9fc9aad9
      Tobias Klauser 提交于
      Has been unused since commit ef99b02b ("libbpf: capture value in BTF
      type info for BTF-defined map defs").
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NQuentin Monnet <quentin@isovalent.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200325113655.19341-1-tklauser@distanz.ch
      9fc9aad9
  5. 24 3月, 2020 2 次提交
  6. 21 3月, 2020 1 次提交
  7. 20 3月, 2020 4 次提交
    • M
      bpftool: Add struct_ops support · 65c93628
      Martin KaFai Lau 提交于
      This patch adds struct_ops support to the bpftool.
      
      To recap a bit on the recent bpf_struct_ops feature on the kernel side:
      It currently supports "struct tcp_congestion_ops" to be implemented
      in bpf.  At a high level, bpf_struct_ops is struct_ops map populated
      with a number of bpf progs.  bpf_struct_ops currently supports the
      "struct tcp_congestion_ops".  However, the bpf_struct_ops design is
      generic enough that other kernel struct ops can be supported in
      the future.
      
      Although struct_ops is map+progs at a high lever, there are differences
      in details.  For example,
      1) After registering a struct_ops, the struct_ops is held by the kernel
         subsystem (e.g. tcp-cc).  Thus, there is no need to pin a
         struct_ops map or its progs in order to keep them around.
      2) To iterate all struct_ops in a system, it iterates all maps
         in type BPF_MAP_TYPE_STRUCT_OPS.  BPF_MAP_TYPE_STRUCT_OPS is
         the current usual filter.  In the future, it may need to
         filter by other struct_ops specific properties.  e.g. filter by
         tcp_congestion_ops or other kernel subsystem ops in the future.
      3) struct_ops requires the running kernel having BTF info.  That allows
         more flexibility in handling other kernel structs.  e.g. it can
         always dump the latest bpf_map_info.
      4) Also, "struct_ops" command is not intended to repeat all features
         already provided by "map" or "prog".  For example, if there really
         is a need to pin the struct_ops map, the user can use the "map" cmd
         to do that.
      
      While the first attempt was to reuse parts from map/prog.c,  it ended up
      not a lot to share.  The only obvious item is the map_parse_fds() but
      that still requires modifications to accommodate struct_ops map specific
      filtering (for the immediate and the future needs).  Together with the
      earlier mentioned differences, it is better to part away from map/prog.c.
      
      The initial set of subcmds are, register, unregister, show, and dump.
      
      For register, it registers all struct_ops maps that can be found in an
      obj file.  Option can be added in the future to specify a particular
      struct_ops map.  Also, the common bpf_tcp_cc is stateless (e.g.
      bpf_cubic.c and bpf_dctcp.c).  The "reuse map" feature is not
      implemented in this patch and it can be considered later also.
      
      For other subcmds, please see the man doc for details.
      
      A sample output of dump:
      [root@arch-fb-vm1 bpf]# bpftool struct_ops dump name cubic
      [{
              "bpf_map_info": {
                  "type": 26,
                  "id": 64,
                  "key_size": 4,
                  "value_size": 256,
                  "max_entries": 1,
                  "map_flags": 0,
                  "name": "cubic",
                  "ifindex": 0,
                  "btf_vmlinux_value_type_id": 18452,
                  "netns_dev": 0,
                  "netns_ino": 0,
                  "btf_id": 52,
                  "btf_key_type_id": 0,
                  "btf_value_type_id": 0
              }
          },{
              "bpf_struct_ops_tcp_congestion_ops": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": "BPF_STRUCT_OPS_STATE_INUSE",
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 0,
                      "init": "void (struct sock *) bictcp_init/prog_id:138",
                      "release": "void (struct sock *) 0",
                      "ssthresh": "u32 (struct sock *) bictcp_recalc_ssthresh/prog_id:141",
                      "cong_avoid": "void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140",
                      "set_state": "void (struct sock *, u8) bictcp_state/prog_id:142",
                      "cwnd_event": "void (struct sock *, enum tcp_ca_event) bictcp_cwnd_event/prog_id:139",
                      "in_ack_event": "void (struct sock *, u32) 0",
                      "undo_cwnd": "u32 (struct sock *) tcp_reno_undo_cwnd/prog_id:144",
                      "pkts_acked": "void (struct sock *, const struct ack_sample *) bictcp_acked/prog_id:143",
                      "min_tso_segs": "u32 (struct sock *) 0",
                      "sndbuf_expand": "u32 (struct sock *) 0",
                      "cong_control": "void (struct sock *, const struct rate_sample *) 0",
                      "get_info": "size_t (struct sock *, u32, int *, union tcp_cc_info *) 0",
                      "name": "bpf_cubic",
                      "owner": 0
                  }
              }
          }
      ]
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200318171656.129650-1-kafai@fb.com
      65c93628
    • M
      bpftool: Translate prog_id to its bpf prog_name · d5ae04da
      Martin KaFai Lau 提交于
      The kernel struct_ops obj has kernel's func ptrs implemented by bpf_progs.
      The bpf prog_id is stored as the value of the func ptr for introspection
      purpose.  In the latter patch, a struct_ops dump subcmd will be added
      to introspect these func ptrs.  It is desired to print the actual bpf
      prog_name instead of only printing the prog_id.
      
      Since struct_ops is the only usecase storing prog_id in the func ptr,
      this patch adds a prog_id_as_func_ptr bool (default is false) to
      "struct btf_dumper" in order not to mis-interpret the ptr value
      for the other existing use-cases.
      
      While printing a func_ptr as a bpf prog_name,
      this patch also prefix the bpf prog_name with the ptr's func_proto.
      [ Note that it is the ptr's func_proto instead of the bpf prog's
        func_proto ]
      It reuses the current btf_dump_func() to obtain the ptr's func_proto
      string.
      
      Here is an example from the bpf_cubic.c:
      "void (struct sock *, u32, u32) bictcp_cong_avoid/prog_id:140"
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200318171650.129252-1-kafai@fb.com
      d5ae04da
    • M
      bpftool: Print as a string for char array · 30255d31
      Martin KaFai Lau 提交于
      A char[] is currently printed as an integer array.
      This patch will print it as a string when
      1) The array element type is an one byte int
      2) The array element type has a BTF_INT_CHAR encoding or
         the array element type's name is "char"
      3) All characters is between (0x1f, 0x7f) and it is terminated
         by a null character.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200318171643.129021-1-kafai@fb.com
      30255d31
    • M
      bpftool: Print the enum's name instead of value · ca7e6e45
      Martin KaFai Lau 提交于
      This patch prints the enum's name if there is one found in
      the array of btf_enum.
      
      The commit 9eea9849 ("bpf: fix BTF verification of enums")
      has details about an enum could have any power-of-2 size (up to 8 bytes).
      This patch also takes this chance to accommodate these non 4 byte
      enums.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200318171637.128862-1-kafai@fb.com
      ca7e6e45
  8. 18 3月, 2020 6 次提交
  9. 15 3月, 2020 3 次提交
  10. 14 3月, 2020 10 次提交