1. 23 5月, 2018 3 次提交
    • M
      bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info · 9b2cf328
      Martin KaFai Lau 提交于
      In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
      could cause confusion because the "id" of "btf_id" means the BPF obj id
      given to the BTF object while
      "btf_key_id" and "btf_value_id" means the BTF type id within
      that BTF object.
      
      To make it clear, btf_key_id and btf_value_id are
      renamed to btf_key_type_id and btf_value_type_id.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9b2cf328
    • M
      bpf: btf: Remove unused bits from uapi/linux/btf.h · aea2f7b8
      Martin KaFai Lau 提交于
      This patch does the followings:
      1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k.  We can
         raise it later.
      
      2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID.  They are
         currently encoded at the highest bit of a u32.
         It is because the current use case does not require supporting
         parent type (i.e type_id referring to a type in another BTF file).
         It also does not support referring to a string in ELF.
      
         The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
         by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
         defined in btf.c instead of uapi/linux/btf.h.
      
      3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
         There is unused bits headroom if we ever needed it later.
      
      4. The root bit in BTF_INFO is also removed because it is not
         used in the current use case.
      
      5. Remove BTF_INT_VARARGS since func type is not supported now.
         The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.
      
      The above can be added back later because the verifier
      ensures the unused bits are zeros.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aea2f7b8
    • M
      bpf: btf: Change how section is supported in btf_header · f80442a4
      Martin KaFai Lau 提交于
      There are currently unused section descriptions in the btf_header.  Those
      sections are here to support future BTF use cases.  For example, the
      func section (func_off) is to support function signature (e.g. the BPF
      prog function signature).
      
      Instead of spelling out all potential sections up-front in the btf_header.
      This patch makes changes to btf_header such that extending it (e.g. adding
      a section) is possible later.  The unused ones can be removed for now and
      they can be added back later.
      
      This patch:
      1. adds a hdr_len to the btf_header.  It will allow adding
      sections (and other info like parent_label and parent_name)
      later.  The check is similar to the existing bpf_attr.
      If a user passes in a longer hdr_len, the kernel
      ensures the extra tailing bytes are 0.
      
      2. allows the section order in the BTF object to be
      different from its sec_off order in btf_header.
      
      3. each sec_off is followed by a sec_len.  It must not have gap or
      overlapping among sections.
      
      The string section is ensured to be at the end due to the 4 bytes
      alignment requirement of the type section.
      
      The above changes will allow enough flexibility to
      add new sections (and other info) to the btf_header later.
      
      This patch also removes an unnecessary !err check
      at the end of btf_parse().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f80442a4
  2. 22 5月, 2018 2 次提交
  3. 19 5月, 2018 1 次提交
  4. 18 5月, 2018 1 次提交
  5. 16 5月, 2018 1 次提交
  6. 15 5月, 2018 2 次提交
    • M
      sched: cls: enable verbose logging · 81c7288b
      Marcelo Ricardo Leitner 提交于
      Currently, when the rule is not to be exclusively executed by the
      hardware, extack is not passed along and offloading failures don't
      get logged. The idea was that hardware failures are okay because the
      rule will get executed in software then and this way it doesn't confuse
      unware users.
      
      But this is not helpful in case one needs to understand why a certain
      rule failed to get offloaded. Considering it may have been a temporary
      failure, like resources exceeded or so, reproducing it later and knowing
      that it is triggering the same reason may be challenging.
      
      The ultimate goal is to improve Open vSwitch debuggability when using
      flower offloading.
      
      This patch adds a new flag to enable verbose logging. With the flag set,
      extack will be passed to the driver, which will be able to log the
      error. As the operation itself probably won't fail (not because of this,
      at least), current iproute will already log it as a Warning.
      
      The flag is generic, so it can be reused later. No need to restrict it
      just for HW offloading. The command line will follow the syntax that
      tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning.
      
      For example:
      # ./tc qdisc add dev p7p1 ingress
      # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
      	flower verbose \
      	src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
      	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
      Warning: TC offload is disabled on net device.
      # echo $?
      0
      # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
      	flower \
      	src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
      	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
      # echo $?
      0
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81c7288b
    • R
      vmcore: add API to collect hardware dump in second kernel · 2724273e
      Rahul Lakkireddy 提交于
      The sequence of actions done by device drivers to append their device
      specific hardware/firmware logs to /proc/vmcore are as follows:
      
      1. During probe (before hardware is initialized), device drivers
      register to the vmcore module (via vmcore_add_device_dump()), with
      callback function, along with buffer size and log name needed for
      firmware/hardware log collection.
      
      2. vmcore module allocates the buffer with requested size. It adds
      an Elf note and invokes the device driver's registered callback
      function.
      
      3. Device driver collects all hardware/firmware logs into the buffer
      and returns control back to vmcore module.
      
      Ensure that the device dump buffer size is always aligned to page size
      so that it can be mmaped.
      
      Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
      generic and reserve NT_VMCOREDD note type to indicate vmcore device
      dump.
      
      Suggested-by: Eric Biederman <ebiederm@xmission.com>.
      Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2724273e
  7. 11 5月, 2018 1 次提交
    • D
      bpf: Provide helper to do forwarding lookups in kernel FIB table · 87f5fc7e
      David Ahern 提交于
      Provide a helper for doing a FIB and neighbor lookup in the kernel
      tables from an XDP program. The helper provides a fastpath for forwarding
      packets. If the packet is a local delivery or for any reason is not a
      simple lookup and forward, the packet continues up the stack.
      
      If it is to be forwarded, the forwarding can be done directly if the
      neighbor is already known. If the neighbor does not exist, the first
      few packets go up the stack for neighbor resolution. Once resolved, the
      xdp program provides the fast path.
      
      On successful lookup the nexthop dmac, current device smac and egress
      device index are returned.
      
      The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
      are implemented in this patch. The API includes layer 4 parameters if
      the XDP program chooses to do deep packet inspection to allow compare
      against ACLs implemented as FIB rules.
      
      Header rewrite is left to the XDP program.
      
      The lookup takes 2 flags:
      - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
        straight to the table associated with the device (expert setting for
        those looking to maximize throughput)
      
      - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
        Default is an ingress lookup.
      
      Initial performance numbers collected by Jesper, forwarded packets/sec:
      
             Full stack    XDP FIB lookup    XDP Direct lookup
      IPv4   1,947,969       7,074,156          7,415,333
      IPv6   1,728,000       6,165,504          7,262,720
      
      These number are single CPU core forwarding on a Broadwell
      E5-1650 v4 @ 3.60GHz.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      87f5fc7e
  8. 09 5月, 2018 2 次提交
    • M
      bpf: btf: Add struct bpf_btf_info · 62dab84c
      Martin KaFai Lau 提交于
      During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
      info.info is directly filled with the BTF binary data.  It is
      not extensible.  In this case, we want to add BTF ID.
      
      This patch adds "struct bpf_btf_info" which has the BTF ID as
      one of its member.  The BTF binary data itself is exposed through
      the "btf" and "btf_size" members.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      62dab84c
    • M
      bpf: btf: Introduce BTF ID · 78958fca
      Martin KaFai Lau 提交于
      This patch gives an ID to each loaded BTF.  The ID is allocated by
      the idr like the existing prog-id and map-id.
      
      The bpf_put(map->btf) is moved to __bpf_map_put() so that the
      userspace can stop seeing the BTF ID ASAP when the last BTF
      refcnt is gone.
      
      It also makes BTF accessible from userspace through the
      1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
         which is inline with the BPF_BTF_LOAD cmd and the existing
         BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
      2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
      
      Once the BTF ID handler is accessible from userspace, freeing a BTF
      object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
      can then be done under a rcu_read_lock() instead of taking
      spin_lock.
      [Note: A similar rcu usage can be done to the existing
             bpf_prog_get_fd_by_id() in a follow up patch]
      
      When processing the BPF_BTF_GET_FD_BY_ID cmd,
      refcount_inc_not_zero() is needed because the BTF object
      could be already in the rcu dead row .  btf_get() is
      removed since its usage is currently limited to btf.c
      alone.  refcount_inc() is used directly instead.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      78958fca
  9. 07 5月, 2018 5 次提交
  10. 04 5月, 2018 9 次提交
  11. 02 5月, 2018 4 次提交
    • M
      Revert "vhost: make msg padding explicit" · c818aa88
      Michael S. Tsirkin 提交于
      This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406.
      
      Unfortunately the padding will break 32 bit userspace.
      Ouch. Need to add some compat code, revert for now.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c818aa88
    • S
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh 提交于
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75eba76
    • S
      connector: add parent pid and tgid to coredump and exit events · b086ff87
      Stefan Strogin 提交于
      The intention is to get notified of process failures as soon
      as possible, before a possible core dumping (which could be very long)
      (e.g. in some process-manager). Coredump and exit process events
      are perfect for such use cases (see 2b5faa4c "connector: Added
      coredumping event to the process connector").
      
      The problem is that for now the process-manager cannot know the parent
      of a dying process using connectors. This could be useful if the
      process-manager should monitor for failures only children of certain
      parents, so we could filter the coredump and exit events by parent
      process and/or thread ID.
      
      Add parent pid and tgid to coredump and exit process connectors event
      data.
      Signed-off-by: NStefan Strogin <sstrogin@cisco.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b086ff87
    • M
      vhost: make msg padding explicit · de08481a
      Michael S. Tsirkin 提交于
      There's a 32 bit hole just after type. It's best to
      give it a name, this way compiler is forced to initialize
      it with rest of the structure.
      Reported-by: NKevin Easton <kevin@guarana.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de08481a
  12. 30 4月, 2018 3 次提交
    • Q
      bpf: fix formatting for bpf_get_stack() helper doc · 79552fbc
      Quentin Monnet 提交于
      Fix formatting (indent) for bpf_get_stack() helper documentation, so
      that the doc is rendered correctly with the Python script.
      
      Fixes: c195651e ("bpf: add bpf_get_stack helper")
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      79552fbc
    • Q
      bpf: fix formatting for bpf_perf_event_read() helper doc · 3bd5a09b
      Quentin Monnet 提交于
      Some edits brought to the last iteration of BPF helper functions
      documentation introduced an error with RST formatting. As a result, most
      of one paragraph is rendered in bold text when only the name of a helper
      should be. Fix it, and fix formatting of another function name in the
      same paragraph.
      
      Fixes: c6b5fb86 ("bpf: add documentation for eBPF helpers (42-50)")
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bd5a09b
    • E
      tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82
      Eric Dumazet 提交于
      When adding tcp mmap() implementation, I forgot that socket lock
      had to be taken before current->mm->mmap_sem. syzbot eventually caught
      the bug.
      
      Since we can not lock the socket in tcp mmap() handler we have to
      split the operation in two phases.
      
      1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
        This operation does not involve any TCP locking.
      
      2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
       the transfert of pages from skbs to one VMA.
        This operation only uses down_read(&current->mm->mmap_sem) after
        holding TCP lock, thus solving the lockdep issue.
      
      This new implementation was suggested by Andy Lutomirski with great details.
      
      Benefits are :
      
      - Better scalability, in case multiple threads reuse VMAS
         (without mmap()/munmap() calls) since mmap_sem wont be write locked.
      
      - Better error recovery.
         The previous mmap() model had to provide the expected size of the
         mapping. If for some reason one part could not be mapped (partial MSS),
         the whole operation had to be aborted.
         With the tcp_zerocopy_receive struct, kernel can report how
         many bytes were successfuly mapped, and how many bytes should
         be read to skip the problematic sequence.
      
      - No more memory allocation to hold an array of page pointers.
        16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
      
      - skbs are freed while mmap_sem has been released
      
      Following patch makes the change in tcp_mmap tool to demonstrate
      one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
      
      Note that memcg might require additional changes.
      
      Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Cc: linux-mm@kvack.org
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05255b82
  13. 29 4月, 2018 2 次提交
    • A
      bpf: Fix helpers ctx struct types in uapi doc · a3ef8e9a
      Andrey Ignatov 提交于
      Helpers may operate on two types of ctx structures: user visible ones
      (e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
      (e.g. `struct bpf_sock_ops_kern`) in kernel implementation.
      
      UAPI documentation must refer to only user visible structures.
      
      The patch replaces references to `_kern` structures in BPF helpers
      description by corresponding user visible structures.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3ef8e9a
    • Y
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song 提交于
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c195651e
  14. 28 4月, 2018 1 次提交
  15. 27 4月, 2018 3 次提交
    • J
      tipc: introduce ioctl for fetching node identity · 3e5cf362
      Jon Maloy 提交于
      After the introduction of a 128-bit node identity it may be difficult
      for a user to correlate between this identity and the generated node
      hash address.
      
      We now try to make this easier by introducing a new ioctl() call for
      fetching a node identity by using the hash value as key. This will
      be particularly useful when we extend some of the commands in the
      'tipc' tool, but we also expect regular user applications to need
      this feature.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e5cf362
    • Q
      bpf: add documentation for eBPF helpers (65-66) · 2d020dd7
      Quentin Monnet 提交于
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions:
      
      Helper from Nikita:
      - bpf_xdp_adjust_tail()
      
      Helper from Eyal:
      - bpf_skb_get_xfrm_state()
      
      v4:
      - New patch (helpers did not exist yet for previous versions).
      
      Cc: Nikita V. Shirokov <tehnerd@tehnerd.com>
      Cc: Eyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      2d020dd7
    • Q
      bpf: add documentation for eBPF helpers (58-64) · ab127040
      Quentin Monnet 提交于
      Add documentation for eBPF helper functions to bpf.h user header file.
      This documentation can be parsed with the Python script provided in
      another commit of the patch series, in order to provide a RST document
      that can later be converted into a man page.
      
      The objective is to make the documentation easily understandable and
      accessible to all eBPF developers, including beginners.
      
      This patch contains descriptions for the following helper functions, all
      written by John:
      
      - bpf_redirect_map()
      - bpf_sk_redirect_map()
      - bpf_sock_map_update()
      - bpf_msg_redirect_map()
      - bpf_msg_apply_bytes()
      - bpf_msg_cork_bytes()
      - bpf_msg_pull_data()
      
      v4:
      - bpf_redirect_map(): Fix typos: "XDP_ABORT" changed to "XDP_ABORTED",
        "his" to "this". Also add a paragraph on performance improvement over
        bpf_redirect() helper.
      
      v3:
      - bpf_sk_redirect_map(): Improve description of BPF_F_INGRESS flag.
      - bpf_msg_redirect_map(): Improve description of BPF_F_INGRESS flag.
      - bpf_redirect_map(): Fix note on CPU redirection, not fully implemented
        for generic XDP but supported on native XDP.
      - bpf_msg_pull_data(): Clarify comment about invalidated verifier
        checks.
      
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ab127040