1. 03 7月, 2019 3 次提交
  2. 29 6月, 2019 1 次提交
  3. 28 6月, 2019 1 次提交
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
  4. 15 6月, 2019 2 次提交
  5. 14 6月, 2019 1 次提交
  6. 11 6月, 2019 1 次提交
  7. 07 6月, 2019 1 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
  8. 25 5月, 2019 2 次提交
    • J
      bpf: introduce new bpf prog load flags "BPF_F_TEST_RND_HI32" · c240eff6
      Jiong Wang 提交于
      x86_64 and AArch64 perhaps are two arches that running bpf testsuite
      frequently, however the zero extension insertion pass is not enabled for
      them because of their hardware support.
      
      It is critical to guarantee the pass correction as it is supposed to be
      enabled at default for a couple of other arches, for example PowerPC,
      SPARC, arm, NFP etc. Therefore, it would be very useful if there is a way
      to test this pass on for example x86_64.
      
      The test methodology employed by this set is "poisoning" useless bits. High
      32-bit of a definition is randomized if it is identified as not used by any
      later insn. Such randomization is only enabled under testing mode which is
      gated by the new bpf prog load flags "BPF_F_TEST_RND_HI32".
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c240eff6
    • Y
      bpf: implement bpf_send_signal() helper · 8b401f9e
      Yonghong Song 提交于
      This patch tries to solve the following specific use case.
      
      Currently, bpf program can already collect stack traces
      through kernel function get_perf_callchain()
      when certain events happens (e.g., cache miss counter or
      cpu clock counter overflows). But such stack traces are
      not enough for jitted programs, e.g., hhvm (jited php).
      To get real stack trace, jit engine internal data structures
      need to be traversed in order to get the real user functions.
      
      bpf program itself may not be the best place to traverse
      the jit engine as the traversing logic could be complex and
      it is not a stable interface either.
      
      Instead, hhvm implements a signal handler,
      e.g. for SIGALARM, and a set of program locations which
      it can dump stack traces. When it receives a signal, it will
      dump the stack in next such program location.
      
      Such a mechanism can be implemented in the following way:
        . a perf ring buffer is created between bpf program
          and tracing app.
        . once a particular event happens, bpf program writes
          to the ring buffer and the tracing app gets notified.
        . the tracing app sends a signal SIGALARM to the hhvm.
      
      But this method could have large delays and causing profiling
      results skewed.
      
      This patch implements bpf_send_signal() helper to send
      a signal to hhvm in real time, resulting in intended stack traces.
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8b401f9e
  9. 13 5月, 2019 2 次提交
  10. 28 4月, 2019 1 次提交
    • M
      bpf: Introduce bpf sk local storage · 6ac99e8f
      Martin KaFai Lau 提交于
      After allowing a bpf prog to
      - directly read the skb->sk ptr
      - get the fullsock bpf_sock by "bpf_sk_fullsock()"
      - get the bpf_tcp_sock by "bpf_tcp_sock()"
      - get the listener sock by "bpf_get_listener_sock()"
      - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
        into different bpf running context.
      
      this patch is another effort to make bpf's network programming
      more intuitive to do (together with memory and performance benefit).
      
      When bpf prog needs to store data for a sk, the current practice is to
      define a map with the usual 4-tuples (src/dst ip/port) as the key.
      If multiple bpf progs require to store different sk data, multiple maps
      have to be defined.  Hence, wasting memory to store the duplicated
      keys (i.e. 4 tuples here) in each of the bpf map.
      [ The smallest key could be the sk pointer itself which requires
        some enhancement in the verifier and it is a separate topic. ]
      
      Also, the bpf prog needs to clean up the elem when sk is freed.
      Otherwise, the bpf map will become full and un-usable quickly.
      The sk-free tracking currently could be done during sk state
      transition (e.g. BPF_SOCK_OPS_STATE_CB).
      
      The size of the map needs to be predefined which then usually ended-up
      with an over-provisioned map in production.  Even the map was re-sizable,
      while the sk naturally come and go away already, this potential re-size
      operation is arguably redundant if the data can be directly connected
      to the sk itself instead of proxy-ing through a bpf map.
      
      This patch introduces sk->sk_bpf_storage to provide local storage space
      at sk for bpf prog to use.  The space will be allocated when the first bpf
      prog has created data for this particular sk.
      
      The design optimizes the bpf prog's lookup (and then optionally followed by
      an inline update).  bpf_spin_lock should be used if the inline update needs
      to be protected.
      
      BPF_MAP_TYPE_SK_STORAGE:
      -----------------------
      To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
      this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
      be created to fit different bpf progs' needs.  The map enforces
      BTF to allow printing the sk-local-storage during a system-wise
      sk dump (e.g. "ss -ta") in the future.
      
      The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
      a "sk-local-storage" data from a particular sk.
      Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
      particular "type" of "sk-local-storage" data can then be stored in any sk.
      
      The main purposes of this map are mostly:
      1. Define the size of a "sk-local-storage" type.
      2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
         map-id, map-btf...etc.)
      3. Keep track of all sk's storages of this "type" and clean them up
         when the map is freed.
      
      sk->sk_bpf_storage:
      ------------------
      The main lookup/update/delete is done on sk->sk_bpf_storage (which
      is a "struct bpf_sk_storage").  When doing a lookup,
      the "map" pointer is now used as the "key" to search on the
      sk_storage->list.  The "map" pointer is actually serving
      as the "type" of the "sk-local-storage" that is being
      requested.
      
      To allow very fast lookup, it should be as fast as looking up an
      array at a stable-offset.  At the same time, it is not ideal to
      set a hard limit on the number of sk-local-storage "type" that the
      system can have.  Hence, this patch takes a cache approach.
      The last search result from sk_storage->list is cached in
      sk_storage->cache[] which is a stable sized array.  Each
      "sk-local-storage" type has a stable offset to the cache[] array.
      In the future, a map's flag could be introduced to do cache
      opt-out/enforcement if it became necessary.
      
      The cache size is 16 (i.e. 16 types of "sk-local-storage").
      Programs can share map.  On the program side, having a few bpf_progs
      running in the networking hotpath is already a lot.  The bpf_prog
      should have already consolidated the existing sock-key-ed map usage
      to minimize the map lookup penalty.  16 has enough runway to grow.
      
      All sk-local-storage data will be removed from sk->sk_bpf_storage
      during sk destruction.
      
      bpf_sk_storage_get() and bpf_sk_storage_delete():
      ------------------------------------------------
      Instead of using bpf_map_(lookup|update|delete)_elem(),
      the bpf prog needs to use the new helper bpf_sk_storage_get() and
      bpf_sk_storage_delete().  The verifier can then enforce the
      ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
      "create" new elem if one does not exist in the sk.  It is done by
      the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
      provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
      The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
      it has eliminated the potential use cases for an equivalent
      bpf_map_update_elem() API (for bpf_prog) in this patch.
      
      Misc notes:
      ----------
      1. map_get_next_key is not supported.  From the userspace syscall
         perspective,  the map has the socket fd as the key while the map
         can be shared by pinned-file or map-id.
      
         Since btf is enforced, the existing "ss" could be enhanced to pretty
         print the local-storage.
      
         Supporting a kernel defined btf with 4 tuples as the return key could
         be explored later also.
      
      2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
         e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
         Please refer to the source code comments for the details in
         synchronization cases and considerations.
      
      3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.
      
      Benchmark:
      ---------
      Here is the benchmark data collected by turning on
      the "kernel.bpf_stats_enabled" sysctl.
      Two bpf progs are tested:
      
      One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
      sk ptr as the key. (verifier is modified to support sk ptr as the key
      That should have shortened the key lookup time.)
      
      Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.
      
      Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
      each egress skb and then bump the cnt.  netperf is used to drive
      data with 4096 connected UDP sockets.
      
      BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
      27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
          loaded_at 2019-04-15T13:46:39-0700  uid 0
          xlated 344B  jited 258B  memlock 4096B  map_ids 16
          btf_id 5
      
      BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
      30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
          loaded_at 2019-04-15T13:47:54-0700  uid 0
          xlated 168B  jited 156B  memlock 4096B  map_ids 17
          btf_id 6
      
      Here is a high-level picture on how are the objects organized:
      
             sk
          ┌──────┐
          │      │
          │      │
          │      │
          │*sk_bpf_storage───── bpf_sk_storage
          └──────┘                 ┌───────┐
                       ┌───────────┤ list  │
                       │           │       │
                       │           │       │
                       │           │       │
                       │           └───────┘
                       │
                       │     elem
                       │  ┌────────┐
                       ├─│ snode  │
                       │  ├────────┤
                       │  │  data  │          bpf_map
                       │  ├────────┤        ┌─────────┐
                       │  │map_node│─┬─────┤  list   │
                       │  └────────┘  │     │         │
                       │              │     │         │
                       │     elem     │     │         │
                       │  ┌────────┐  │     └─────────┘
                       └─│ snode  │  │
                          ├────────┤  │
         bpf_map          │  data  │  │
       ┌─────────┐        ├────────┤  │
       │  list   ├───────│map_node│  │
       │         │        └────────┘  │
       │         │                    │
       │         │           elem     │
       └─────────┘        ┌────────┐  │
                       ┌─│ snode  │  │
                       │  ├────────┤  │
                       │  │  data  │  │
                       │  ├────────┤  │
                       │  │map_node│─┘
                       │  └────────┘
                       │
                       │
                       │          ┌───────┐
           sk          └──────────│ list  │
        ┌──────┐                  │       │
        │      │                  │       │
        │      │                  │       │
        │      │                  └───────┘
        │*sk_bpf_storage───────bpf_sk_storage
        └──────┘
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6ac99e8f
  11. 27 4月, 2019 1 次提交
    • M
      bpf: add writable context for raw tracepoints · 9df1c28b
      Matt Mullins 提交于
      This is an opt-in interface that allows a tracepoint to provide a safe
      buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
      The size of the buffer must be a compile-time constant, and is checked
      before allowing a BPF program to attach to a tracepoint that uses this
      feature.
      
      The pointer to this buffer will be the first argument of tracepoints
      that opt in; the pointer is valid and can be bpf_probe_read() by both
      BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
      programs that attach to such a tracepoint, but the buffer to which it
      points may only be written by the latter.
      Signed-off-by: NMatt Mullins <mmullins@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9df1c28b
  12. 17 4月, 2019 1 次提交
  13. 16 4月, 2019 1 次提交
  14. 13 4月, 2019 6 次提交
    • A
      bpf: Introduce bpf_strtol and bpf_strtoul helpers · d7a4cb9b
      Andrey Ignatov 提交于
      Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
      long correspondingly. It's similar to user space strtol(3) and
      strtoul(3) with a few changes to the API:
      
      * instead of NUL-terminated C string the helpers expect buffer and
        buffer length;
      
      * resulting long or unsigned long is returned in a separate
        result-argument;
      
      * return value is used to indicate success or failure, on success number
        of consumed bytes is returned that can be used to identify position to
        read next if the buffer is expected to contain multiple integers;
      
      * instead of *base* argument, *flags* is used that provides base in 5
        LSB, other bits are reserved for future use;
      
      * number of supported bases is limited.
      
      Documentation for the new helpers is provided in bpf.h UAPI.
      
      The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
      be able to convert string input to e.g. "ulongvec" output.
      
      E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
      parsed by calling to bpf_strtoul three times.
      
      Implementation notes:
      
      Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
      functions. It's done exactly same way as fs/proc/base.c already does.
      
      Unfortunately existing kstrtoX function can't be used directly since
      they fail if any invalid character is present right after integer in the
      string. Existing simple_strtoX functions can't be used either since
      they're obsolete and don't handle overflow properly.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d7a4cb9b
    • A
      bpf: Add file_pos field to bpf_sysctl ctx · e1550bfe
      Andrey Ignatov 提交于
      Add file_pos field to bpf_sysctl context to read and write sysctl file
      position at which sysctl is being accessed (read or written).
      
      The field can be used to e.g. override whole sysctl value on write to
      sysctl even when sys_write is called by user space with file_pos > 0. Or
      BPF program may reject such accesses.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e1550bfe
    • A
      bpf: Introduce bpf_sysctl_{get,set}_new_value helpers · 4e63acdf
      Andrey Ignatov 提交于
      Add helpers to work with new value being written to sysctl by user
      space.
      
      bpf_sysctl_get_new_value() copies value being written to sysctl into
      provided buffer.
      
      bpf_sysctl_set_new_value() overrides new value being written by user
      space with a one from provided buffer. Buffer should contain string
      representation of the value, similar to what can be seen in /proc/sys/.
      
      Both helpers can be used only on sysctl write.
      
      File position matters and can be managed by an interface that will be
      introduced separately. E.g. if user space calls sys_write to a file in
      /proc/sys/ at file position = X, where X > 0, then the value set by
      bpf_sysctl_set_new_value() will be written starting from X. If program
      wants to override whole value with specified buffer, file position has
      to be set to zero.
      
      Documentation for the new helpers is provided in bpf.h UAPI.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e63acdf
    • A
      bpf: Introduce bpf_sysctl_get_current_value helper · 1d11b301
      Andrey Ignatov 提交于
      Add bpf_sysctl_get_current_value() helper to copy current sysctl value
      into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
      
      It provides same string as user space can see by reading corresponding
      file in /proc/sys/, including new line, etc.
      
      Documentation for the new helper is provided in bpf.h UAPI.
      
      Since current value is kept in ctl_table->data in a parsed form,
      ctl_table->proc_handler() with write=0 is called to read that data and
      convert it to a string. Such a string can later be parsed by a program
      using helpers that will be introduced separately.
      
      Unfortunately it's not trivial to provide API to access parsed data due to
      variety of data representations (string, intvec, uintvec, ulongvec,
      custom structures, even NULL, etc). Instead it's assumed that user know
      how to handle specific sysctl they're interested in and appropriate
      helpers can be used.
      
      Since ctl_table->proc_handler() expects __user buffer, conversion to
      __user happens for kernel allocated one where the value is stored.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d11b301
    • A
      bpf: Introduce bpf_sysctl_get_name helper · 808649fb
      Andrey Ignatov 提交于
      Add bpf_sysctl_get_name() helper to copy sysctl name (/proc/sys/ entry)
      into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.
      
      By default full name (w/o /proc/sys/) is copied, e.g. "net/ipv4/tcp_mem".
      
      If BPF_F_SYSCTL_BASE_NAME flag is set, only base name will be copied,
      e.g. "tcp_mem".
      
      Documentation for the new helper is provided in bpf.h UAPI.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      808649fb
    • A
      bpf: Sysctl hook · 7b146ceb
      Andrey Ignatov 提交于
      Containerized applications may run as root and it may create problems
      for whole host. Specifically such applications may change a sysctl and
      affect applications in other containers.
      
      Furthermore in existing infrastructure it may not be possible to just
      completely disable writing to sysctl, instead such a process should be
      gradual with ability to log what sysctl are being changed by a
      container, investigate, limit the set of writable sysctl to currently
      used ones (so that new ones can not be changed) and eventually reduce
      this set to zero.
      
      The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
      attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
      
      New program type has access to following minimal context:
      	struct bpf_sysctl {
      		__u32	write;
      	};
      
      Where @write indicates whether sysctl is being read (= 0) or written (=
      1).
      
      Helpers to access sysctl name and value will be introduced separately.
      
      BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
      passing control to ctl_table->proc_handler so that BPF program can
      either allow or deny access to sysctl.
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7b146ceb
  15. 12 4月, 2019 1 次提交
    • A
      bpf: add layer 2 encap support to bpf_skb_adjust_room · 58dfc900
      Alan Maguire 提交于
      commit 868d5235 ("bpf: add bpf_skb_adjust_room encap flags")
      introduced support to bpf_skb_adjust_room for GSO-friendly GRE
      and UDP encapsulation.
      
      For GSO to work for skbs, the inner headers (mac and network) need to
      be marked.  For L3 encapsulation using bpf_skb_adjust_room, the mac
      and network headers are identical.  Here we provide a way of specifying
      the inner mac header length for cases where L2 encap is desired.  Such
      an approach can support encapsulated ethernet headers, MPLS headers etc.
      For example to convert from a packet of form [eth][ip][tcp] to
      [eth][ip][udp][inner mac][ip][tcp], something like the following could
      be done:
      
      	headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen;
      
      	ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC,
      				  BPF_F_ADJ_ROOM_ENCAP_L4_UDP |
      				  BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 |
      				  BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen));
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      58dfc900
  16. 11 4月, 2019 1 次提交
    • S
      bpf: support input __sk_buff context in BPF_PROG_TEST_RUN · b0b9395d
      Stanislav Fomichev 提交于
      Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
      * ctx_in/ctx_size_in - input context
      * ctx_out/ctx_size_out - output context
      
      The intended use case is to pass some meta data to the test runs that
      operate on skb (this has being brought up on recent LPC).
      
      For programs that use bpf_prog_test_run_skb, support __sk_buff input and
      output. Initially, from input __sk_buff, copy _only_ cb and priority into
      skb, all other non-zero fields are prohibited (with EINVAL).
      If the user has set ctx_out/ctx_size_out, copy the potentially modified
      __sk_buff back to the userspace.
      
      We require all fields of input __sk_buff except the ones we explicitly
      support to be set to zero. The expectation is that in the future we might
      add support for more fields and we want to fail explicitly if the user
      runs the program on the kernel where we don't yet support them.
      
      The API is intentionally vague (i.e. we don't explicitly add __sk_buff
      to bpf_attr, but ctx_in) to potentially let other test_run types use
      this interface in the future (this can be xdp_md for xdp types for
      example).
      
      v4:
        * don't copy more than allowed in bpf_ctx_init [Martin]
      
      v3:
        * handle case where ctx_in is NULL, but ctx_out is not [Martin]
        * convert size==0 checks to ptr==NULL checks and add some extra ptr
          checks [Martin]
      
      v2:
        * Addressed comments from Martin Lau
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b0b9395d
  17. 10 4月, 2019 3 次提交
    • D
      bpf: add syscall side map freeze support · 87df15de
      Daniel Borkmann 提交于
      This patch adds a new BPF_MAP_FREEZE command which allows to
      "freeze" the map globally as read-only / immutable from syscall
      side.
      
      Map permission handling has been refactored into map_get_sys_perms()
      and drops FMODE_CAN_WRITE in case of locked map. Main use case is
      to allow for setting up .rodata sections from the BPF ELF which
      are loaded into the kernel, meaning BPF loader first allocates
      map, sets up map value by copying .rodata section into it and once
      complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
      modifications.
      
      Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
      bpf_attr members are required to be zero. I didn't add write-only
      locking here as counterpart since I don't have a concrete use-case
      for it on my side, and I think it makes probably more sense to wait
      once there is actually one. In that case bpf_attr can be extended
      as usual with a flag field and/or others where flag 0 means that
      we lock the map read-only hence this doesn't prevent to add further
      extensions to BPF_MAP_FREEZE upon need.
      
      A map creation flag like BPF_F_WRONCE was not considered for couple
      of reasons: i) in case of a generic implementation, a map can consist
      of more than just one element, thus there could be multiple map
      updates needed to set the map into a state where it can then be
      made immutable, ii) WRONCE indicates exact one-time write before
      it is then set immutable. A generic implementation would set a bit
      atomically on map update entry (if unset), indicating that every
      subsequent update from then onwards will need to bail out there.
      However, map updates can fail, so upon failure that flag would need
      to be unset again and the update attempt would need to be repeated
      for it to be eventually made immutable. While this can be made
      race-free, this approach feels less clean and in combination with
      reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
      command directly sets the flag and caller has the guarantee that
      map is immutable from syscall side upon successful return for any
      future syscall invocations that would alter the map state, which
      is also more intuitive from an API point of view. A command name
      such as BPF_MAP_LOCK has been avoided as it's too close with BPF
      map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
      is so far only enabled for privileged users.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      87df15de
    • D
      bpf: add program side {rd, wr}only support for maps · 591fe988
      Daniel Borkmann 提交于
      This work adds two new map creation flags BPF_F_RDONLY_PROG
      and BPF_F_WRONLY_PROG in order to allow for read-only or
      write-only BPF maps from a BPF program side.
      
      Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
      applies to system call side, meaning the BPF program has full
      read/write access to the map as usual while bpf(2) calls with
      map fd can either only read or write into the map depending
      on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
      for the exact opposite such that verifier is going to reject
      program loads if write into a read-only map or a read into a
      write-only map is detected. For read-only map case also some
      helpers are forbidden for programs that would alter the map
      state such as map deletion, update, etc. As opposed to the two
      BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
      as BPF_F_WRONLY_PROG really do correspond to the map lifetime.
      
      We've enabled this generic map extension to various non-special
      maps holding normal user data: array, hash, lru, lpm, local
      storage, queue and stack. Further generic map types could be
      followed up in future depending on use-case. Main use case
      here is to forbid writes into .rodata map values from verifier
      side.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      591fe988
    • D
      bpf: implement lookup-free direct value access for maps · d8eca5bb
      Daniel Borkmann 提交于
      This generic extension to BPF maps allows for directly loading
      an address residing inside a BPF map value as a single BPF
      ldimm64 instruction!
      
      The idea is similar to what BPF_PSEUDO_MAP_FD does today, which
      is a special src_reg flag for ldimm64 instruction that indicates
      that inside the first part of the double insns's imm field is a
      file descriptor which the verifier then replaces as a full 64bit
      address of the map into both imm parts. For the newly added
      BPF_PSEUDO_MAP_VALUE src_reg flag, the idea is the following:
      the first part of the double insns's imm field is again a file
      descriptor corresponding to the map, and the second part of the
      imm field is an offset into the value. The verifier will then
      replace both imm parts with an address that points into the BPF
      map value at the given value offset for maps that support this
      operation. Currently supported is array map with single entry.
      It is possible to support more than just single map element by
      reusing both 16bit off fields of the insns as a map index, so
      full array map lookup could be expressed that way. It hasn't
      been implemented here due to lack of concrete use case, but
      could easily be done so in future in a compatible way, since
      both off fields right now have to be 0 and would correctly
      denote a map index 0.
      
      The BPF_PSEUDO_MAP_VALUE is a distinct flag as otherwise with
      BPF_PSEUDO_MAP_FD we could not differ offset 0 between load of
      map pointer versus load of map's value at offset 0, and changing
      BPF_PSEUDO_MAP_FD's encoding into off by one to differ between
      regular map pointer and map value pointer would add unnecessary
      complexity and increases barrier for debugability thus less
      suitable. Using the second part of the imm field as an offset
      into the value does /not/ come with limitations since maximum
      possible value size is in u32 universe anyway.
      
      This optimization allows for efficiently retrieving an address
      to a map value memory area without having to issue a helper call
      which needs to prepare registers according to calling convention,
      etc, without needing the extra NULL test, and without having to
      add the offset in an additional instruction to the value base
      pointer. The verifier then treats the destination register as
      PTR_TO_MAP_VALUE with constant reg->off from the user passed
      offset from the second imm field, and guarantees that this is
      within bounds of the map value. Any subsequent operations are
      normally treated as typical map value handling without anything
      extra needed from verification side.
      
      The two map operations for direct value access have been added to
      array map for now. In future other types could be supported as
      well depending on the use case. The main use case for this commit
      is to allow for BPF loader support for global variables that
      reside in .data/.rodata/.bss sections such that we can directly
      load the address of them with minimal additional infrastructure
      required. Loader support has been added in subsequent commits for
      libbpf library.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d8eca5bb
  18. 23 3月, 2019 3 次提交
  19. 22 3月, 2019 2 次提交
  20. 15 3月, 2019 2 次提交
  21. 14 3月, 2019 1 次提交
    • M
      bpf: Add bpf_get_listener_sock(struct bpf_sock *sk) helper · dbafd7dd
      Martin KaFai Lau 提交于
      Add a new helper "struct bpf_sock *bpf_get_listener_sock(struct bpf_sock *sk)"
      which returns a bpf_sock in TCP_LISTEN state.  It will trace back to
      the listener sk from a request_sock if possible.  It returns NULL
      for all other cases.
      
      No reference is taken because the helper ensures the sk is
      in SOCK_RCU_FREE (where the TCP_LISTEN sock should be in).
      Hence, bpf_sk_release() is unnecessary and the verifier does not
      allow bpf_sk_release(listen_sk) to be called either.
      
      The following is also allowed because the bpf_prog is run under
      rcu_read_lock():
      
      	sk = bpf_sk_lookup_tcp();
      	/* if (!sk) { ... } */
      	listen_sk = bpf_get_listener_sock(sk);
      	/* if (!listen_sk) { ... } */
      	bpf_sk_release(sk);
      	src_port = listen_sk->src_port; /* Allowed */
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      dbafd7dd
  22. 03 3月, 2019 1 次提交
    • B
      bpf: add bpf helper bpf_skb_ecn_set_ce · f7c917ba
      brakmo 提交于
      This patch adds a new bpf helper BPF_FUNC_skb_ecn_set_ce
      "int bpf_skb_ecn_set_ce(struct sk_buff *skb)". It is added to
      BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can
      be attached to the ingress and egress path. The helper is needed
      because his type of bpf_prog cannot modify the skb directly.
      
      This helper is used to set the ECN field of ECN capable IP packets to ce
      (congestion encountered) in the IPv6 or IPv4 header of the skb. It can be
      used by a bpf_prog to manage egress or ingress network bandwdith limit
      per cgroupv2 by inducing an ECN response in the TCP sender.
      This works best when using DCTCP.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f7c917ba
  23. 28 2月, 2019 1 次提交
  24. 14 2月, 2019 1 次提交
    • P
      bpf: add plumbing for BPF_LWT_ENCAP_IP in bpf_lwt_push_encap · 3e0bd37c
      Peter Oskolkov 提交于
      This patch adds all needed plumbing in preparation to allowing
      bpf programs to do IP encapping via bpf_lwt_push_encap. Actual
      implementation is added in the next patch in the patchset.
      
      Of note:
      - bpf_lwt_push_encap can now be called from BPF_PROG_TYPE_LWT_XMIT
        prog types in addition to BPF_PROG_TYPE_LWT_IN;
      - if the skb being encapped has GSO set, encapsulation is limited
        to IPIP/IP+GRE/IP+GUE (both IPv4 and IPv6);
      - as route lookups are different for ingress vs egress, the single
        external bpf_lwt_push_encap BPF helper is routed internally to
        either bpf_lwt_in_push_encap or bpf_lwt_xmit_push_encap BPF_CALLs,
        depending on prog type.
      
      v8 changes: fixed a typo.
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3e0bd37c