1. 20 5月, 2020 1 次提交
    • D
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann 提交于
      As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  2. 15 5月, 2020 2 次提交
  3. 12 5月, 2020 1 次提交
  4. 10 5月, 2020 4 次提交
    • Y
      bpf: Add bpf_seq_printf and bpf_seq_write helpers · 492e639f
      Yonghong Song 提交于
      Two helpers bpf_seq_printf and bpf_seq_write, are added for
      writing data to the seq_file buffer.
      
      bpf_seq_printf supports common format string flag/width/type
      fields so at least I can get identical results for
      netlink and ipv6_route targets.
      
      For bpf_seq_printf and bpf_seq_write, return value -EOVERFLOW
      specifically indicates a write failure due to overflow, which
      means the object will be repeated in the next bpf invocation
      if object collection stays the same. Note that if the object
      collection is changed, depending how collection traversal is
      done, even if the object still in the collection, it may not
      be visited.
      
      For bpf_seq_printf, format %s, %p{i,I}{4,6} needs to
      read kernel memory. Reading kernel memory may fail in
      the following two cases:
        - invalid kernel address, or
        - valid kernel address but requiring a major fault
      If reading kernel memory failed, the %s string will be
      an empty string and %p{i,I}{4,6} will be all 0.
      Not returning error to bpf program is consistent with
      what bpf_trace_printk() does for now.
      
      bpf_seq_printf may return -EBUSY meaning that internal percpu
      buffer for memory copy of strings or other pointees is
      not available. Bpf program can return 1 to indicate it
      wants the same object to be repeated. Right now, this should not
      happen on no-RT kernels since migrate_disable(), which guards
      bpf prog call, calls preempt_disable().
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175914.2476661-1-yhs@fb.com
      492e639f
    • Y
      bpf: Create anonymous bpf iterator · ac51d99b
      Yonghong Song 提交于
      A new bpf command BPF_ITER_CREATE is added.
      
      The anonymous bpf iterator is seq_file based.
      The seq_file private data are referenced by targets.
      The bpf_iter infrastructure allocated additional space
      at seq_file->private before the space used by targets
      to store some meta data, e.g.,
        prog:       prog to run
        session_id: an unique id for each opened seq_file
        seq_num:    how many times bpf programs are queried in this session
        done_stop:  an internal state to decide whether bpf program
                    should be called in seq_ops->stop() or not
      
      The seq_num will start from 0 for valid objects.
      The bpf program may see the same seq_num more than once if
       - seq_file buffer overflow happens and the same object
         is retried by bpf_seq_read(), or
       - the bpf program explicitly requests a retry of the
         same object
      
      Since module is not supported for bpf_iter, all target
      registeration happens at __init time, so there is no
      need to change bpf_iter_unreg_target() as it is used
      mostly in error path of the init function at which time
      no bpf iterators have been created yet.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
      ac51d99b
    • Y
      bpf: Support bpf tracing/iter programs for BPF_LINK_CREATE · de4e05ca
      Yonghong Song 提交于
      Given a bpf program, the step to create an anonymous bpf iterator is:
        - create a bpf_iter_link, which combines bpf program and the target.
          In the future, there could be more information recorded in the link.
          A link_fd will be returned to the user space.
        - create an anonymous bpf iterator with the given link_fd.
      
      The bpf_iter_link can be pinned to bpffs mount file system to
      create a file based bpf iterator as well.
      
      The benefit to use of bpf_iter_link:
        - using bpf link simplifies design and implementation as bpf link
          is used for other tracing bpf programs.
        - for file based bpf iterator, bpf_iter_link provides a standard
          way to replace underlying bpf programs.
        - for both anonymous and free based iterators, bpf link query
          capability can be leveraged.
      
      The patch added support of tracing/iter programs for BPF_LINK_CREATE.
      A new link type BPF_LINK_TYPE_ITER is added to facilitate link
      querying. Currently, only prog_id is needed, so there is no
      additional in-kernel show_fdinfo() and fill_link_info() hook
      is needed for BPF_LINK_TYPE_ITER link.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
      de4e05ca
    • Y
      bpf: Allow loading of a bpf_iter program · 15d83c4d
      Yonghong Song 提交于
      A bpf_iter program is a tracing program with attach type
      BPF_TRACE_ITER. The load attribute
        attach_btf_id
      is used by the verifier against a particular kernel function,
      which represents a target, e.g., __bpf_iter__bpf_map
      for target bpf_map which is implemented later.
      
      The program return value must be 0 or 1 for now.
        0 : successful, except potential seq_file buffer overflow
            which is handled by seq_file reader.
        1 : request to restart the same object
      
      In the future, other return values may be used for filtering or
      teminating the iterator.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
      15d83c4d
  5. 09 5月, 2020 1 次提交
    • S
      bpf: Allow any port in bpf_bind helper · 8086fbaf
      Stanislav Fomichev 提交于
      We want to have a tighter control on what ports we bind to in
      the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means
      connect() becomes slightly more expensive. The expensive part
      comes from the fact that we now need to call inet_csk_get_port()
      that verifies that the port is not used and allocates an entry
      in the hash table for it.
      
      Since we can't rely on "snum || !bind_address_no_port" to prevent
      us from calling POST_BIND hook anymore, let's add another bind flag
      to indicate that the call site is BPF program.
      
      v5:
      * fix wrong AF_INET (should be AF_INET6) in the bpf program for v6
      
      v3:
      * More bpf_bind documentation refinements (Martin KaFai Lau)
      * Add UDP tests as well (Martin KaFai Lau)
      * Don't start the thread, just do socket+bind+listen (Martin KaFai Lau)
      
      v2:
      * Update documentation (Andrey Ignatov)
      * Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov)
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com
      8086fbaf
  6. 02 5月, 2020 2 次提交
    • S
      bpf: Bpf_{g,s}etsockopt for struct bpf_sock_addr · beecf11b
      Stanislav Fomichev 提交于
      Currently, bpf_getsockopt and bpf_setsockopt helpers operate on the
      'struct bpf_sock_ops' context in BPF_PROG_TYPE_SOCK_OPS program.
      Let's generalize them and make them available for 'struct bpf_sock_addr'.
      That way, in the future, we can allow those helpers in more places.
      
      As an example, let's expose those 'struct bpf_sock_addr' based helpers to
      BPF_CGROUP_INET{4,6}_CONNECT hooks. That way we can override CC before the
      connection is made.
      
      v3:
      * Expose custom helpers for bpf_sock_addr context instead of doing
        generic bpf_sock argument (as suggested by Daniel). Even with
        try_socket_lock that doesn't sleep we have a problem where context sk
        is already locked and socket lock is non-nestable.
      
      v2:
      * s/BPF_PROG_TYPE_CGROUP_SOCKOPT/BPF_PROG_TYPE_SOCK_OPS/
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200430233152.199403-1-sdf@google.com
      beecf11b
    • S
      bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS · d46edd67
      Song Liu 提交于
      Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
      Typical userspace tools use kernel.bpf_stats_enabled as follows:
      
        1. Enable kernel.bpf_stats_enabled;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Disable kernel.bpf_stats_enabled.
      
      The problem with this approach is that only one userspace tool can toggle
      this sysctl. If multiple tools toggle the sysctl at the same time, the
      measurement may be inaccurate.
      
      To fix this problem while keep backward compatibility, introduce a new
      bpf command BPF_ENABLE_STATS. On success, this command enables stats and
      returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
      only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
      command to support other types of stats in the future.
      
      With BPF_ENABLE_STATS, user space tool would have the following flow:
      
        1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Close the fd.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
      d46edd67
  7. 29 4月, 2020 1 次提交
  8. 28 4月, 2020 1 次提交
  9. 27 4月, 2020 1 次提交
  10. 25 4月, 2020 1 次提交
  11. 14 4月, 2020 5 次提交
    • A
      tools headers kvm: Sync linux/kvm.h with the kernel sources · b8fc2280
      Arnaldo Carvalho de Melo 提交于
      To pick up the changes from:
      
        9a5788c6 ("KVM: PPC: Book3S HV: Add a capability for enabling secure guests")
        3c9bd400 ("KVM: x86: enable dirty log gradually in small chunks")
        13da9ae1 ("KVM: s390: protvirt: introduce and enable KVM_CAP_S390_PROTECTED")
        e0d2773d ("KVM: s390: protvirt: UV calls in support of diag308 0, 1")
        19e12277 ("KVM: S390: protvirt: Introduce instruction data area bounce buffer")
        29b40f10 ("KVM: s390: protvirt: Add initial vm and cpu lifecycle handling")
      
      So far we're ignoring those arch specific ioctls, we need to revisit
      this at some time to have arch specific tables, etc:
      
        $ grep S390 tools/perf/trace/beauty/kvm_ioctl.sh
            egrep -v " ((ARM|PPC|S390)_|[GS]ET_(DEBUGREGS|PIT2|XSAVE|TSC_KHZ)|CREATE_SPAPR_TCE_64)" | \
        $
      
      This addresses these tools/perf build warnings:
      
        Warning: Kernel ABI header at 'tools/arch/arm/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm/include/uapi/asm/kvm.h'
        diff -u tools/arch/arm/include/uapi/asm/kvm.h arch/arm/include/uapi/asm/kvm.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Jay Zhou <jianjay.zhou@huawei.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      b8fc2280
    • A
      tools headers UAPI: Sync linux/fscrypt.h with the kernel sources · 1abcb9d9
      Arnaldo Carvalho de Melo 提交于
      To pick the changes from:
      
        e98ad464 ("fscrypt: add FS_IOC_GET_ENCRYPTION_NONCE ioctl")
      
      That don't trigger any changes in tooling.
      
      This silences this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/fscrypt.h' differs from latest version at 'include/uapi/linux/fscrypt.h'
        diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h
      
      In time we should come up with something like:
      
        $ tools/perf/trace/beauty/fsconfig.sh
        static const char *fsconfig_cmds[] = {
        	[0] = "SET_FLAG",
        	[1] = "SET_STRING",
        	[2] = "SET_BINARY",
        	[3] = "SET_PATH",
        	[4] = "SET_PATH_EMPTY",
        	[5] = "SET_FD",
        	[6] = "CMD_CREATE",
        	[7] = "CMD_RECONFIGURE",
        };
        $
      
      And:
      
        $ tools/perf/trace/beauty/drm_ioctl.sh | head
        #ifndef DRM_COMMAND_BASE
        #define DRM_COMMAND_BASE                0x40
        #endif
        static const char *drm_ioctl_cmds[] = {
        	[0x00] = "VERSION",
        	[0x01] = "GET_UNIQUE",
        	[0x02] = "GET_MAGIC",
        	[0x03] = "IRQ_BUSID",
        	[0x04] = "GET_MAP",
        	[0x05] = "GET_CLIENT",
        $
      
      For fscrypt's ioctls.
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      1abcb9d9
    • A
      tools include UAPI: Sync linux/vhost.h with the kernel sources · 3df4d4bf
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        4c8cf318 ("vhost: introduce vDPA-based backend")
      
      Silencing this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/vhost.h' differs from latest version at 'include/uapi/linux/vhost.h'
        diff -u tools/include/uapi/linux/vhost.h include/uapi/linux/vhost.h
      
      This automatically picks these new ioctls, making tools such as 'perf
      trace' aware of them and possibly allowing to use the strings in
      filters, etc:
      
        $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > before
        $ cp include/uapi/linux/vhost.h tools/include/uapi/linux/vhost.h
        $ tools/perf/trace/beauty/vhost_virtio_ioctl.sh > after
        $ diff -u before after
        --- before	2020-04-14 09:12:28.559748968 -0300
        +++ after	2020-04-14 09:12:38.781696242 -0300
        @@ -24,9 +24,16 @@
         	[0x44] = "SCSI_GET_EVENTS_MISSED",
         	[0x60] = "VSOCK_SET_GUEST_CID",
         	[0x61] = "VSOCK_SET_RUNNING",
        +	[0x72] = "VDPA_SET_STATUS",
        +	[0x74] = "VDPA_SET_CONFIG",
        +	[0x75] = "VDPA_SET_VRING_ENABLE",
         };
         static const char *vhost_virtio_ioctl_read_cmds[] = {
         	[0x00] = "GET_FEATURES",
         	[0x12] = "GET_VRING_BASE",
         	[0x26] = "GET_BACKEND_FEATURES",
        +	[0x70] = "VDPA_GET_DEVICE_ID",
        +	[0x71] = "VDPA_GET_STATUS",
        +	[0x73] = "VDPA_GET_CONFIG",
        +	[0x76] = "VDPA_GET_VRING_NUM",
         };
        $
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Tiwei Bie <tiwei.bie@intel.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      3df4d4bf
    • A
      tools headers UAPI: Sync linux/mman.h with the kernel · f60b3878
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      
      Add that to 'perf trace's mremap 'flags' decoder.
      
      This silences this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/mman.h' differs from latest version at 'include/uapi/linux/mman.h'
        diff -u tools/include/uapi/linux/mman.h include/uapi/linux/mman.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      f60b3878
    • A
      tools headers UAPI: Sync sched.h with the kernel · 027fa8fb
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        ef2c41cf ("clone3: allow spawning processes into cgroups")
      
      Add that to 'perf trace's clone 'flags' decoder.
      
      This silences this perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/sched.h' differs from latest version at 'include/uapi/linux/sched.h'
        diff -u tools/include/uapi/linux/sched.h include/uapi/linux/sched.h
      
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      027fa8fb
  12. 02 4月, 2020 1 次提交
  13. 31 3月, 2020 3 次提交
    • A
      libbpf: Add support for bpf_link-based cgroup attachment · cc4f864b
      Andrii Nakryiko 提交于
      Add bpf_program__attach_cgroup(), which uses BPF_LINK_CREATE subcommand to
      create an FD-based kernel bpf_link. Also add low-level bpf_link_create() API.
      
      If expected_attach_type is not specified explicitly with
      bpf_program__set_expected_attach_type(), libbpf will try to determine proper
      attach type from BPF program's section definition.
      
      Also add support for bpf_link's underlying BPF program replacement:
        - unconditional through high-level bpf_link__update_program() API;
        - cmpxchg-like with specifying expected current BPF program through
          low-level bpf_link_update() API.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-4-andriin@fb.com
      cc4f864b
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
    • J
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer 提交于
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
  14. 30 3月, 2020 2 次提交
  15. 29 3月, 2020 1 次提交
  16. 28 3月, 2020 2 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  17. 27 3月, 2020 1 次提交
  18. 24 3月, 2020 1 次提交
    • A
      tools headers uapi: Update linux/in.h copy · 29f36c16
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        26776253 ("seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number")
      
      That ends up automatically adding the new IPPROTO_ETHERNET to the socket
      args beautifiers:
      
        $ tools/perf/trace/beauty/socket_ipproto.sh > before
      
      Apply this patch:
      
        $ tools/perf/trace/beauty/socket_ipproto.sh > after
        $ diff -u before after
        --- before	2020-03-19 11:48:36.876673819 -0300
        +++ after	2020-03-19 11:49:00.148541377 -0300
        @@ -6,6 +6,7 @@
         	[132] = "SCTP",
         	[136] = "UDPLITE",
         	[137] = "MPLS",
        +	[143] = "ETHERNET",
         	[17] = "UDP",
         	[1] = "ICMP",
         	[22] = "IDP",
        $
      
      Addresses this tools/perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
        diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Paolo Lungaroni <paolo.lungaroni@cnit.it>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      29f36c16
  19. 19 3月, 2020 1 次提交
    • A
      tools headers uapi: Update linux/in.h copy · 564200ed
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        26776253 ("seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number")
      
      That ends up automatically adding the new IPPROTO_ETHERNET to the socket
      args beautifiers:
      
        $ tools/perf/trace/beauty/socket_ipproto.sh > before
      
      Apply this patch:
      
        $ tools/perf/trace/beauty/socket_ipproto.sh > after
        $ diff -u before after
        --- before	2020-03-19 11:48:36.876673819 -0300
        +++ after	2020-03-19 11:49:00.148541377 -0300
        @@ -6,6 +6,7 @@
         	[132] = "SCTP",
         	[136] = "UDPLITE",
         	[137] = "MPLS",
        +	[143] = "ETHERNET",
         	[17] = "UDP",
         	[1] = "ICMP",
         	[22] = "IDP",
        $
      
      Addresses this tools/perf build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
        diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Paolo Lungaroni <paolo.lungaroni@cnit.it>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      564200ed
  20. 14 3月, 2020 1 次提交
  21. 13 3月, 2020 2 次提交
  22. 05 3月, 2020 3 次提交
    • A
      tools headers UAPI: Update tools's copy of linux/perf_event.h · 6339998d
      Arnaldo Carvalho de Melo 提交于
      To get the changes in:
      
        bbfd5e4f ("perf/core: Add new branch sample type for HW index of raw branch records")
      
      This silences this perf tools build warning:
      
        Warning: Kernel ABI header at 'tools/include/uapi/linux/perf_event.h' differs from latest version at 'include/uapi/linux/perf_event.h'
        diff -u tools/include/uapi/linux/perf_event.h include/uapi/linux/perf_event.h
      
      This update is a prerequisite to adding support for the HW index of raw
      branch records.
      Acked-by: NKan Liang <kan.liang@linux.intel.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Pavel Gerasimov <pavel.gerasimov@intel.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com>
      Link: http://lore.kernel.org/lkml/20200304134902.GB12612@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      6339998d
    • K
      bpf: Introduce BPF_MODIFY_RETURN · ae240823
      KP Singh 提交于
      When multiple programs are attached, each program receives the return
      value from the previous program on the stack and the last program
      provides the return value to the attached function.
      
      The fmod_ret bpf programs are run after the fentry programs and before
      the fexit programs. The original function is only called if all the
      fmod_ret programs return 0 to avoid any unintended side-effects. The
      success value, i.e. 0 is not currently configurable but can be made so
      where user-space can specify it at load time.
      
      For example:
      
      int func_to_be_attached(int a, int b)
      {  <--- do_fentry
      
      do_fmod_ret:
         <update ret by calling fmod_ret>
         if (ret != 0)
              goto do_fexit;
      
      original_function:
      
          <side_effects_happen_here>
      
      }  <--- do_fexit
      
      The fmod_ret program attached to this function can be defined as:
      
      SEC("fmod_ret/func_to_be_attached")
      int BPF_PROG(func_name, int a, int b, int ret)
      {
              // This will skip the original function logic.
              return 1;
      }
      
      The first fmod_ret program is passed 0 in its return argument.
      Signed-off-by: NKP Singh <kpsingh@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org
      ae240823
    • A
      bpf: Switch BPF UAPI #define constants used from BPF program side to enums · 1aae4bdd
      Andrii Nakryiko 提交于
      Switch BPF UAPI constants, previously defined as #define macro, to anonymous
      enum values. This preserves constants values and behavior in expressions, but
      has added advantaged of being captured as part of DWARF and, subsequently, BTF
      type info. Which, in turn, greatly improves usefulness of generated vmlinux.h
      for BPF applications, as it will not require BPF users to copy/paste various
      flags and constants, which are frequently used with BPF helpers. Only those
      constants that are used/useful from BPF program side are converted.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200303003233.3496043-2-andriin@fb.com
      1aae4bdd
  23. 04 3月, 2020 1 次提交
  24. 20 2月, 2020 1 次提交