1. 29 4月, 2020 2 次提交
  2. 27 4月, 2020 1 次提交
  3. 31 3月, 2020 2 次提交
    • A
      bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc
      Andrii Nakryiko 提交于
      Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
      under given bpf_link. Currently this is only supported for bpf_cgroup_link,
      but will be extended to other kinds of bpf_links in follow-up patches.
      
      For bpf_cgroup_link, implemented functionality matches existing semantics for
      direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
      unconditionally set new bpf_prog regardless of which bpf_prog is currently
      active under given bpf_link, or, optionally, can specify expected active
      bpf_prog. If active bpf_prog doesn't match expected one, no changes are
      performed, old bpf_link stays intact and attached, operation returns
      a failure.
      
      cgroup_bpf_replace() operation is resolving race between auto-detachment and
      bpf_prog update in the same fashion as it's done for bpf_link detachment,
      except in this case update has no way of succeeding because of target cgroup
      marked as dying. So in this case error is returned.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
      0c991ebc
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
  4. 28 2月, 2020 1 次提交
  5. 20 12月, 2019 1 次提交
    • A
      bpf: Support replacing cgroup-bpf program in MULTI mode · 7dd68b32
      Andrey Ignatov 提交于
      The common use-case in production is to have multiple cgroup-bpf
      programs per attach type that cover multiple use-cases. Such programs
      are attached with BPF_F_ALLOW_MULTI and can be maintained by different
      people.
      
      Order of programs usually matters, for example imagine two egress
      programs: the first one drops packets and the second one counts packets.
      If they're swapped the result of counting program will be different.
      
      It brings operational challenges with updating cgroup-bpf program(s)
      attached with BPF_F_ALLOW_MULTI since there is no way to replace a
      program:
      
      * One way to update is to detach all programs first and then attach the
        new version(s) again in the right order. This introduces an
        interruption in the work a program is doing and may not be acceptable
        (e.g. if it's egress firewall);
      
      * Another way is attach the new version of a program first and only then
        detach the old version. This introduces the time interval when two
        versions of same program are working, what may not be acceptable if a
        program is not idempotent. It also imposes additional burden on
        program developers to make sure that two versions of their program can
        co-exist.
      
      Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
      command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
      flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
      and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
      replace
      
      That way user can replace any program among those attached with
      BPF_F_ALLOW_MULTI flag without the problems described above.
      
      Details of the new API:
      
      * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
        descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
        error (EINVAL or EBADF).
      
      * If replace_bpf_fd has valid descriptor of BPF program but such a
        program is not attached to specified cgroup, BPF_PROG_ATTACH will
        return ENOENT.
      
      BPF_F_REPLACE is introduced to make the user intent clear, since
      replace_bpf_fd alone can't be used for this (its default value, 0, is a
      valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
      future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Narkyiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
      7dd68b32
  6. 18 12月, 2019 1 次提交
  7. 28 6月, 2019 1 次提交
    • S
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev 提交于
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      0d01da6a
  8. 07 6月, 2019 1 次提交
    • D
      bpf: fix unconnected udp hooks · 983695fa
      Daniel Borkmann 提交于
      Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
      to applications as also stated in original motivation in 7828f20e ("Merge
      branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
      two hooks into Cilium to enable host based load-balancing with Kubernetes,
      I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes
      typically sets up DNS as a service and is thus subject to load-balancing.
      
      Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
      is currently insufficient and thus not usable as-is for standard applications
      shipped with most distros. To break down the issue we ran into with a simple
      example:
      
        # cat /etc/resolv.conf
        nameserver 147.75.207.207
        nameserver 147.75.207.208
      
      For the purpose of a simple test, we set up above IPs as service IPs and
      transparently redirect traffic to a different DNS backend server for that
      node:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      The attached BPF program is basically selecting one of the backends if the
      service IP/port matches on the cgroup hook. DNS breaks here, because the
      hooks are not transparent enough to applications which have built-in msg_name
      address checks:
      
        # nslookup 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
        ;; connection timed out; no servers could be reached
      
        # dig 1.1.1.1
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
        ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
        [...]
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; connection timed out; no servers could be reached
      
      For comparison, if none of the service IPs is used, and we tell nslookup
      to use 8.8.8.8 directly it works just fine, of course:
      
        # nslookup 1.1.1.1 8.8.8.8
        1.1.1.1.in-addr.arpa	name = one.one.one.one.
      
      In order to fix this and thus act more transparent to the application,
      this needs reverse translation on recvmsg() side. A minimal fix for this
      API is to add similar recvmsg() hooks behind the BPF cgroups static key
      such that the program can track state and replace the current sockaddr_in{,6}
      with the original service IP. From BPF side, this basically tracks the
      service tuple plus socket cookie in an LRU map where the reverse NAT can
      then be retrieved via map value as one example. Side-note: the BPF cgroups
      static key should be converted to a per-hook static key in future.
      
      Same example after this fix:
      
        # cilium service list
        ID   Frontend            Backend
        1    147.75.207.207:53   1 => 8.8.8.8:53
        2    147.75.207.208:53   1 => 8.8.8.8:53
      
      Lookups work fine now:
      
        # nslookup 1.1.1.1
        1.1.1.1.in-addr.arpa    name = one.one.one.one.
      
        Authoritative answers can be found from:
      
        # dig 1.1.1.1
      
        ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
        ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ;; QUESTION SECTION:
        ;1.1.1.1.                       IN      A
      
        ;; AUTHORITY SECTION:
        .                       23426   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
      
        ;; Query time: 17 msec
        ;; SERVER: 147.75.207.207#53(147.75.207.207)
        ;; WHEN: Tue May 21 12:59:38 UTC 2019
        ;; MSG SIZE  rcvd: 111
      
      And from an actual packet level it shows that we're using the back end
      server when talking via 147.75.207.20{7,8} front end:
      
        # tcpdump -i any udp
        [...]
        12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67)
        [...]
      
      In order to be flexible and to have same semantics as in sendmsg BPF
      programs, we only allow return codes in [1,1] range. In the sendmsg case
      the program is called if msg->msg_name is present which can be the case
      in both, connected and unconnected UDP.
      
      The former only relies on the sockaddr_in{,6} passed via connect(2) if
      passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar
      way to call into the BPF program whenever a non-NULL msg->msg_name was
      passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
      that for TCP case, the msg->msg_name is ignored in the regular recvmsg
      path and therefore not relevant.
      
      For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
      the hook is not called. This is intentional as it aligns with the same
      semantics as in case of TCP cgroup BPF hooks right now. This might be
      better addressed in future through a different bpf_attach_type such
      that this case can be distinguished from the regular recvmsg paths,
      for example.
      
      Fixes: 1cedee13 ("bpf: Hooks for sys_sendmsg")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NMartynas Pumputis <m@lambda.lt>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      983695fa
  9. 29 5月, 2019 2 次提交
    • S
      bpf: cgroup: properly use bpf_prog_array api · dbcc1ba2
      Stanislav Fomichev 提交于
      Now that we don't have __rcu markers on the bpf_prog_array helpers,
      let's use proper rcu_dereference_protected to obtain array pointer
      under mutex.
      
      We also don't need __rcu annotations on cgroup_bpf.inactive since
      it's not read/updated concurrently.
      
      v4:
      * drop cgroup_rcu_xyz wrappers and use rcu APIs directly; presumably
        should be more clear to understand which mutex/refcount protects
        each particular place
      
      v3:
      * amend cgroup_rcu_dereference to include percpu_ref_is_dying;
        cgroup_bpf is now reference counted and we don't hold cgroup_mutex
        anymore in cgroup_bpf_release
      
      v2:
      * replace xchg with rcu_swap_protected
      
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dbcc1ba2
    • R
      bpf: decouple the lifetime of cgroup_bpf from cgroup itself · 4bfc0bb2
      Roman Gushchin 提交于
      Currently the lifetime of bpf programs attached to a cgroup is bound
      to the lifetime of the cgroup itself. It means that if a user
      forgets (or intentionally avoids) to detach a bpf program before
      removing the cgroup, it will stay attached up to the release of the
      cgroup. Since the cgroup can stay in the dying state (the state
      between being rmdir()'ed and being released) for a very long time, it
      leads to a waste of memory. Also, it blocks a possibility to implement
      the memcg-based memory accounting for bpf objects, because a circular
      reference dependency will occur. Charged memory pages are pinning the
      corresponding memory cgroup, and if the memory cgroup is pinning
      the attached bpf program, nothing will be ever released.
      
      A dying cgroup can not contain any processes, so the only chance for
      an attached bpf program to be executed is a live socket associated
      with the cgroup. So in order to release all bpf data early, let's
      count associated sockets using a new percpu refcounter. On cgroup
      removal the counter is transitioned to the atomic mode, and as soon
      as it reaches 0, all bpf programs are detached.
      
      Because cgroup_bpf_release() can block, it can't be called from
      the percpu ref counter callback directly, so instead an asynchronous
      work is scheduled.
      
      The reference counter is not socket specific, and can be used for any
      other types of programs, which can be executed from a cgroup-bpf hook
      outside of the process context, had such a need arise in the future.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: jolsa@redhat.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4bfc0bb2
  10. 13 4月, 2019 3 次提交
    • A
      bpf: Add file_pos field to bpf_sysctl ctx · e1550bfe
      Andrey Ignatov 提交于
      Add file_pos field to bpf_sysctl context to read and write sysctl file
      position at which sysctl is being accessed (read or written).
      
      The field can be used to e.g. override whole sysctl value on write to
      sysctl even when sys_write is called by user space with file_pos > 0. Or
      BPF program may reject such accesses.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e1550bfe
    • A
      bpf: Introduce bpf_sysctl_{get,set}_new_value helpers · 4e63acdf
      Andrey Ignatov 提交于
      Add helpers to work with new value being written to sysctl by user
      space.
      
      bpf_sysctl_get_new_value() copies value being written to sysctl into
      provided buffer.
      
      bpf_sysctl_set_new_value() overrides new value being written by user
      space with a one from provided buffer. Buffer should contain string
      representation of the value, similar to what can be seen in /proc/sys/.
      
      Both helpers can be used only on sysctl write.
      
      File position matters and can be managed by an interface that will be
      introduced separately. E.g. if user space calls sys_write to a file in
      /proc/sys/ at file position = X, where X > 0, then the value set by
      bpf_sysctl_set_new_value() will be written starting from X. If program
      wants to override whole value with specified buffer, file position has
      to be set to zero.
      
      Documentation for the new helpers is provided in bpf.h UAPI.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4e63acdf
    • A
      bpf: Sysctl hook · 7b146ceb
      Andrey Ignatov 提交于
      Containerized applications may run as root and it may create problems
      for whole host. Specifically such applications may change a sysctl and
      affect applications in other containers.
      
      Furthermore in existing infrastructure it may not be possible to just
      completely disable writing to sysctl, instead such a process should be
      gradual with ability to log what sysctl are being changed by a
      container, investigate, limit the set of writable sysctl to currently
      used ones (so that new ones can not be changed) and eventually reduce
      this set to zero.
      
      The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
      attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.
      
      New program type has access to following minimal context:
      	struct bpf_sysctl {
      		__u32	write;
      	};
      
      Where @write indicates whether sysctl is being read (= 0) or written (=
      1).
      
      Helpers to access sysctl name and value will be introduced separately.
      
      BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
      passing control to ctl_table->proc_handler so that BPF program can
      either allow or deny access to sysctl.
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7b146ceb
  11. 09 3月, 2019 1 次提交
  12. 31 1月, 2019 1 次提交
  13. 01 10月, 2018 3 次提交
    • R
      bpf: introduce per-cpu cgroup local storage · b741f163
      Roman Gushchin 提交于
      This commit introduced per-cpu cgroup local storage.
      
      Per-cpu cgroup local storage is very similar to simple cgroup storage
      (let's call it shared), except all the data is per-cpu.
      
      The main goal of per-cpu variant is to implement super fast
      counters (e.g. packet counters), which don't require neither
      lookups, neither atomic operations.
      
      >From userspace's point of view, accessing a per-cpu cgroup storage
      is similar to other per-cpu map types (e.g. per-cpu hashmaps and
      arrays).
      
      Writing to a per-cpu cgroup storage is not atomic, but is performed
      by copying longs, so some minimal atomicity is here, exactly
      as with other per-cpu maps.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b741f163
    • R
      bpf: rework cgroup storage pointer passing · f294b37e
      Roman Gushchin 提交于
      To simplify the following introduction of per-cpu cgroup storage,
      let's rework a bit a mechanism of passing a pointer to a cgroup
      storage into the bpf_get_local_storage(). Let's save a pointer
      to the corresponding bpf_cgroup_storage structure, instead of
      a pointer to the actual buffer.
      
      It will help us to handle per-cpu storage later, which has
      a different way of accessing to the actual data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f294b37e
    • R
      bpf: extend cgroup bpf core to allow multiple cgroup storage types · 8bad74f9
      Roman Gushchin 提交于
      In order to introduce per-cpu cgroup storage, let's generalize
      bpf cgroup core to support multiple cgroup storage types.
      Potentially, per-node cgroup storage can be added later.
      
      This commit is mostly a formal change that replaces
      cgroup_storage pointer with a array of cgroup_storage pointers.
      It doesn't actually introduce a new storage type,
      it will be done later.
      
      Each bpf program is now able to have one cgroup storage of each type.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8bad74f9
  14. 03 8月, 2018 3 次提交
  15. 09 7月, 2018 1 次提交
    • R
      bpf: include errno.h from bpf-cgroup.h · f292b87d
      Roman Gushchin 提交于
      Commit fdb5c453 ("bpf: fix attach type BPF_LIRC_MODE2 dependency
      wrt CONFIG_CGROUP_BPF") caused some build issues, detected by 0-DAY
      kernel test infrastructure.
      
      The problem is that cgroup_bpf_prog_attach/detach/query() functions
      can return -EINVAL error code, which is not defined. Fix this adding
      errno.h to includes.
      
      Fixes: fdb5c453 ("bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Sean Young <sean@mess.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f292b87d
  16. 26 6月, 2018 1 次提交
    • S
      bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF · fdb5c453
      Sean Young 提交于
      If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
      possible to attach, detach or query IR BPF programs to /dev/lircN devices,
      making them impossible to use. For embedded devices, it should be possible
      to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
      
      This change requires some refactoring, since bpf_prog_{attach,detach,query}
      functions are now always compiled, but their code paths for cgroups need
      moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
      moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
      require #ifdefs since that is already conditionally compiled.
      
      Fixes: f4364dcf ("media: rc: introduce BPF_PROG_LIRC_MODE2")
      Signed-off-by: NSean Young <sean@mess.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fdb5c453
  17. 28 5月, 2018 2 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
    • A
      bpf: Define cgroup_bpf_enabled for CONFIG_CGROUP_BPF=n · 13193b0f
      Andrey Ignatov 提交于
      Static key is used to enable/disable cgroup-bpf related code paths at
      run time.
      
      Though it's not defined when cgroup-bpf is disabled at compile time,
      i.e. CONFIG_CGROUP_BPF=n, and if some code wants to use it, it has to do
      this:
      
      	#ifdef CONFIG_CGROUP_BPF
      		if (cgroup_bpf_enabled) {
      			/* ... some work ... */
      		}
      	#endif
      
      This code can be simplified by setting cgroup_bpf_enabled to 0 for
      CONFIG_CGROUP_BPF=n case:
      
      	if (cgroup_bpf_enabled) {
      		/* ... some work ... */
      	}
      
      And it aligns well with existing BPF_CGROUP_RUN_PROG_* macros that
      defined for both states of CONFIG_CGROUP_BPF.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      13193b0f
  18. 31 3月, 2018 3 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
  19. 23 2月, 2018 1 次提交
  20. 05 11月, 2017 1 次提交
  21. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  22. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  23. 18 7月, 2017 1 次提交
  24. 02 7月, 2017 1 次提交
    • L
      bpf: BPF support for sock_ops · 40304b2a
      Lawrence Brakmo 提交于
      Created a new BPF program type, BPF_PROG_TYPE_SOCK_OPS, and a corresponding
      struct that allows BPF programs of this type to access some of the
      socket's fields (such as IP addresses, ports, etc.). It uses the
      existing bpf cgroups infrastructure so the programs can be attached per
      cgroup with full inheritance support. The program will be called at
      appropriate times to set relevant connections parameters such as buffer
      sizes, SYN and SYN-ACK RTOs, etc., based on connection information such
      as IP addresses, port numbers, etc.
      
      Alghough there are already 3 mechanisms to set parameters (sysctls,
      route metrics and setsockopts), this new mechanism provides some
      distinct advantages. Unlike sysctls, it can set parameters per
      connection. In contrast to route metrics, it can also use port numbers
      and information provided by a user level program. In addition, it could
      set parameters probabilistically for evaluation purposes (i.e. do
      something different on 10% of the flows and compare results with the
      other 90% of the flows). Also, in cases where IPv6 addresses contain
      geographic information, the rules to make changes based on the distance
      (or RTT) between the hosts are much easier than route metric rules and
      can be global. Finally, unlike setsockopt, it oes not require
      application changes and it can be updated easily at any time.
      
      Although the bpf cgroup framework already contains a sock related
      program type (BPF_PROG_TYPE_CGROUP_SOCK), I created the new type
      (BPF_PROG_TYPE_SOCK_OPS) beccause the existing type expects to be called
      only once during the connections's lifetime. In contrast, the new
      program type will be called multiple times from different places in the
      network stack code.  For example, before sending SYN and SYN-ACKs to set
      an appropriate timeout, when the connection is established to set
      congestion control, etc. As a result it has "op" field to specify the
      type of operation requested.
      
      The purpose of this new program type is to simplify setting connection
      parameters, such as buffer sizes, TCP's SYN RTO, etc. For example, it is
      easy to use facebook's internal IPv6 addresses to determine if both hosts
      of a connection are in the same datacenter. Therefore, it is easy to
      write a BPF program to choose a small SYN RTO value when both hosts are
      in the same datacenter.
      
      This patch only contains the framework to support the new BPF program
      type, following patches add the functionality to set various connection
      parameters.
      
      This patch defines a new BPF program type: BPF_PROG_TYPE_SOCKET_OPS
      and a new bpf syscall command to load a new program of this type:
      BPF_PROG_LOAD_SOCKET_OPS.
      
      Two new corresponding structs (one for the kernel one for the user/BPF
      program):
      
      /* kernel version */
      struct bpf_sock_ops_kern {
              struct sock *sk;
              __u32  op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
      };
      
      /* user version
       * Some fields are in network byte order reflecting the sock struct
       * Use the bpf_ntohl helper macro in samples/bpf/bpf_endian.h to
       * convert them to host byte order.
       */
      struct bpf_sock_ops {
              __u32 op;
              union {
                      __u32 reply;
                      __u32 replylong[4];
              };
              __u32 family;
              __u32 remote_ip4;     /* In network byte order */
              __u32 local_ip4;      /* In network byte order */
              __u32 remote_ip6[4];  /* In network byte order */
              __u32 local_ip6[4];   /* In network byte order */
              __u32 remote_port;    /* In network byte order */
              __u32 local_port;     /* In host byte horder */
      };
      
      Currently there are two types of ops. The first type expects the BPF
      program to return a value which is then used by the caller (or a
      negative value to indicate the operation is not supported). The second
      type expects state changes to be done by the BPF program, for example
      through a setsockopt BPF helper function, and they ignore the return
      value.
      
      The reply fields of the bpf_sockt_ops struct are there in case a bpf
      program needs to return a value larger than an integer.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40304b2a
  25. 13 2月, 2017 1 次提交
    • A
      bpf: introduce BPF_F_ALLOW_OVERRIDE flag · 7f677633
      Alexei Starovoitov 提交于
      If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
      to the given cgroup the descendent cgroup will be able to override
      effective bpf program that was inherited from this cgroup.
      By default it's not passed, therefore override is disallowed.
      
      Examples:
      1.
      prog X attached to /A with default
      prog Y fails to attach to /A/B and /A/B/C
      Everything under /A runs prog X
      
      2.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A/B with default (non-override)
      prog M attached to /A/B with allow_override.
      Everything under /A/B runs prog M only.
      
      3.
      prog X attached to /A with allow_override.
      prog Y fails to attach to /A with default.
      The user has to detach first to switch the mode.
      
      In the future this behavior may be extended with a chain of
      non-overridable programs.
      
      Also fix the bug where detach from cgroup where nothing is attached
      was not throwing error. Return ENOENT in such case.
      
      Add several testcases and adjust libbpf.
      
      Fixes: 30070984 ("cgroup: add support for eBPF programs")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDaniel Mack <daniel@zonque.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f677633
  26. 18 12月, 2016 1 次提交
  27. 03 12月, 2016 1 次提交