1. 24 5月, 2018 4 次提交
    • M
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux 提交于
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • M
      bpf: Add IPv6 Segment Routing helpers · fe94cc29
      Mathieu Xhonneux 提交于
      The BPF seg6local hook should be powerful enough to enable users to
      implement most of the use-cases one could think of. After some thinking,
      we figured out that the following actions should be possible on a SRv6
      packet, requiring 3 specific helpers :
          - bpf_lwt_seg6_store_bytes: Modify non-sensitive fields of the SRH
          - bpf_lwt_seg6_adjust_srh: Allow to grow or shrink a SRH
                                     (to add/delete TLVs)
          - bpf_lwt_seg6_action: Apply some SRv6 network programming actions
                                 (specifically End.X, End.T, End.B6 and
                                  End.B6.Encap)
      
      The specifications of these helpers are provided in the patch (see
      include/uapi/linux/bpf.h).
      
      The non-sensitive fields of the SRH are the following : flags, tag and
      TLVs. The other fields can not be modified, to maintain the SRH
      integrity. Flags, tag and TLVs can easily be modified as their validity
      can be checked afterwards via seg6_validate_srh. It is not allowed to
      modify the segments directly. If one wants to add segments on the path,
      he should stack a new SRH using the End.B6 action via
      bpf_lwt_seg6_action.
      
      Growing, shrinking or editing TLVs via the helpers will flag the SRH as
      invalid, and it will have to be re-validated before re-entering the IPv6
      layer. This flag is stored in a per-CPU buffer, along with the current
      header length in bytes.
      
      Storing the SRH len in bytes in the control block is mandatory when using
      bpf_lwt_seg6_adjust_srh. The Header Ext. Length field contains the SRH
      len rounded to 8 bytes (a padding TLV can be inserted to ensure the 8-bytes
      boundary). When adding/deleting TLVs within the BPF program, the SRH may
      temporary be in an invalid state where its length cannot be rounded to 8
      bytes without remainder, hence the need to store the length in bytes
      separately. The caller of the BPF program can then ensure that the SRH's
      final length is valid using this value. Again, a final SRH modified by a
      BPF program which doesn’t respect the 8-bytes boundary will be discarded
      as it will be considered as invalid.
      
      Finally, a fourth helper is provided, bpf_lwt_push_encap, which is
      available from the LWT BPF IN hook, but not from the seg6local BPF one.
      This helper allows to encapsulate a Segment Routing Header (either with
      a new outer IPv6 header, or by inlining it directly in the existing IPv6
      header) into a non-SRv6 packet. This helper is required if we want to
      offer the possibility to dynamically encapsulate a SRH for non-SRv6 packet,
      as the BPF seg6local hook only works on traffic already containing a SRH.
      This is the BPF equivalent of the seg6 LWT infrastructure, which achieves
      the same purpose but with a static SRH per route.
      
      These helpers require CONFIG_IPV6=y (and not =m).
      Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fe94cc29
    • S
      bpf: get JITed image lengths of functions via syscall · 815581c1
      Sandipan Das 提交于
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of the JITed image lengths of each function for a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      This can be used by userspace applications like bpftool
      to split up the contiguous JITed dump, also obtained via
      the system call, into more relatable chunks corresponding
      to each function.
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      815581c1
    • S
      bpf: get kernel symbol addresses via syscall · dbecd738
      Sandipan Das 提交于
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of kernel symbol addresses for all functions in a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      When bpf_jit_kallsyms is enabled, we can get the address
      of the corresponding kernel symbol for a callee function
      and resolve the symbol's name. The address is determined
      by adding the value of the call instruction's imm field
      to __bpf_call_base. This offset gets assigned to the imm
      field by the verifier.
      
      For some architectures, such as powerpc64, the imm field
      is not large enough to hold this offset.
      
      We resolve this by:
      
      [1] Assigning the subprog id to the imm field of a call
          instruction in the verifier instead of the offset of
          the callee's symbol's address from __bpf_call_base.
      
      [2] Determining the address of a callee's corresponding
          symbol by using the imm field as an index for the
          list of kernel symbol addresses now available from
          the program info.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dbecd738
  2. 23 5月, 2018 3 次提交
    • M
      bpf: btf: Rename btf_key_id and btf_value_id in bpf_map_info · 9b2cf328
      Martin KaFai Lau 提交于
      In "struct bpf_map_info", the name "btf_id", "btf_key_id" and "btf_value_id"
      could cause confusion because the "id" of "btf_id" means the BPF obj id
      given to the BTF object while
      "btf_key_id" and "btf_value_id" means the BTF type id within
      that BTF object.
      
      To make it clear, btf_key_id and btf_value_id are
      renamed to btf_key_type_id and btf_value_type_id.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9b2cf328
    • M
      bpf: btf: Remove unused bits from uapi/linux/btf.h · aea2f7b8
      Martin KaFai Lau 提交于
      This patch does the followings:
      1. Limit BTF_MAX_TYPES and BTF_MAX_NAME_OFFSET to 64k.  We can
         raise it later.
      
      2. Remove the BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID.  They are
         currently encoded at the highest bit of a u32.
         It is because the current use case does not require supporting
         parent type (i.e type_id referring to a type in another BTF file).
         It also does not support referring to a string in ELF.
      
         The BTF_TYPE_PARENT and BTF_STR_TBL_ELF_ID checks are replaced
         by BTF_TYPE_ID_CHECK and BTF_STR_OFFSET_CHECK which are
         defined in btf.c instead of uapi/linux/btf.h.
      
      3. Limit the BTF_INFO_KIND from 5 bits to 4 bits which is enough.
         There is unused bits headroom if we ever needed it later.
      
      4. The root bit in BTF_INFO is also removed because it is not
         used in the current use case.
      
      5. Remove BTF_INT_VARARGS since func type is not supported now.
         The BTF_INT_ENCODING is limited to 4 bits instead of 8 bits.
      
      The above can be added back later because the verifier
      ensures the unused bits are zeros.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aea2f7b8
    • M
      bpf: btf: Change how section is supported in btf_header · f80442a4
      Martin KaFai Lau 提交于
      There are currently unused section descriptions in the btf_header.  Those
      sections are here to support future BTF use cases.  For example, the
      func section (func_off) is to support function signature (e.g. the BPF
      prog function signature).
      
      Instead of spelling out all potential sections up-front in the btf_header.
      This patch makes changes to btf_header such that extending it (e.g. adding
      a section) is possible later.  The unused ones can be removed for now and
      they can be added back later.
      
      This patch:
      1. adds a hdr_len to the btf_header.  It will allow adding
      sections (and other info like parent_label and parent_name)
      later.  The check is similar to the existing bpf_attr.
      If a user passes in a longer hdr_len, the kernel
      ensures the extra tailing bytes are 0.
      
      2. allows the section order in the BTF object to be
      different from its sec_off order in btf_header.
      
      3. each sec_off is followed by a sec_len.  It must not have gap or
      overlapping among sections.
      
      The string section is ensured to be at the end due to the 4 bytes
      alignment requirement of the type section.
      
      The above changes will allow enough flexibility to
      add new sections (and other info) to the btf_header later.
      
      This patch also removes an unnecessary !err check
      at the end of btf_parse().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f80442a4
  3. 22 5月, 2018 2 次提交
  4. 19 5月, 2018 1 次提交
  5. 18 5月, 2018 1 次提交
  6. 16 5月, 2018 1 次提交
  7. 15 5月, 2018 2 次提交
    • M
      sched: cls: enable verbose logging · 81c7288b
      Marcelo Ricardo Leitner 提交于
      Currently, when the rule is not to be exclusively executed by the
      hardware, extack is not passed along and offloading failures don't
      get logged. The idea was that hardware failures are okay because the
      rule will get executed in software then and this way it doesn't confuse
      unware users.
      
      But this is not helpful in case one needs to understand why a certain
      rule failed to get offloaded. Considering it may have been a temporary
      failure, like resources exceeded or so, reproducing it later and knowing
      that it is triggering the same reason may be challenging.
      
      The ultimate goal is to improve Open vSwitch debuggability when using
      flower offloading.
      
      This patch adds a new flag to enable verbose logging. With the flag set,
      extack will be passed to the driver, which will be able to log the
      error. As the operation itself probably won't fail (not because of this,
      at least), current iproute will already log it as a Warning.
      
      The flag is generic, so it can be reused later. No need to restrict it
      just for HW offloading. The command line will follow the syntax that
      tc-ebpf already uses, tc ... [ verbose ] ... , and extend its meaning.
      
      For example:
      # ./tc qdisc add dev p7p1 ingress
      # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
      	flower verbose \
      	src_mac ed:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
      	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
      Warning: TC offload is disabled on net device.
      # echo $?
      0
      # ./tc filter add dev p7p1 parent ffff: protocol ip prio 1 \
      	flower \
      	src_mac ff:13:db:00:00:00 dst_mac 01:80:c2:00:00:d0 \
      	src_ip 56.0.0.0 dst_ip 55.0.0.0 action drop
      # echo $?
      0
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81c7288b
    • R
      vmcore: add API to collect hardware dump in second kernel · 2724273e
      Rahul Lakkireddy 提交于
      The sequence of actions done by device drivers to append their device
      specific hardware/firmware logs to /proc/vmcore are as follows:
      
      1. During probe (before hardware is initialized), device drivers
      register to the vmcore module (via vmcore_add_device_dump()), with
      callback function, along with buffer size and log name needed for
      firmware/hardware log collection.
      
      2. vmcore module allocates the buffer with requested size. It adds
      an Elf note and invokes the device driver's registered callback
      function.
      
      3. Device driver collects all hardware/firmware logs into the buffer
      and returns control back to vmcore module.
      
      Ensure that the device dump buffer size is always aligned to page size
      so that it can be mmaped.
      
      Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
      generic and reserve NT_VMCOREDD note type to indicate vmcore device
      dump.
      
      Suggested-by: Eric Biederman <ebiederm@xmission.com>.
      Signed-off-by: NRahul Lakkireddy <rahul.lakkireddy@chelsio.com>
      Signed-off-by: NGanesh Goudar <ganeshgr@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2724273e
  8. 11 5月, 2018 1 次提交
    • D
      bpf: Provide helper to do forwarding lookups in kernel FIB table · 87f5fc7e
      David Ahern 提交于
      Provide a helper for doing a FIB and neighbor lookup in the kernel
      tables from an XDP program. The helper provides a fastpath for forwarding
      packets. If the packet is a local delivery or for any reason is not a
      simple lookup and forward, the packet continues up the stack.
      
      If it is to be forwarded, the forwarding can be done directly if the
      neighbor is already known. If the neighbor does not exist, the first
      few packets go up the stack for neighbor resolution. Once resolved, the
      xdp program provides the fast path.
      
      On successful lookup the nexthop dmac, current device smac and egress
      device index are returned.
      
      The API supports IPv4, IPv6 and MPLS protocols, but only IPv4 and IPv6
      are implemented in this patch. The API includes layer 4 parameters if
      the XDP program chooses to do deep packet inspection to allow compare
      against ACLs implemented as FIB rules.
      
      Header rewrite is left to the XDP program.
      
      The lookup takes 2 flags:
      - BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
        straight to the table associated with the device (expert setting for
        those looking to maximize throughput)
      
      - BPF_FIB_LOOKUP_OUTPUT to do a lookup from the egress perspective.
        Default is an ingress lookup.
      
      Initial performance numbers collected by Jesper, forwarded packets/sec:
      
             Full stack    XDP FIB lookup    XDP Direct lookup
      IPv4   1,947,969       7,074,156          7,415,333
      IPv6   1,728,000       6,165,504          7,262,720
      
      These number are single CPU core forwarding on a Broadwell
      E5-1650 v4 @ 3.60GHz.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      87f5fc7e
  9. 09 5月, 2018 2 次提交
    • M
      bpf: btf: Add struct bpf_btf_info · 62dab84c
      Martin KaFai Lau 提交于
      During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
      info.info is directly filled with the BTF binary data.  It is
      not extensible.  In this case, we want to add BTF ID.
      
      This patch adds "struct bpf_btf_info" which has the BTF ID as
      one of its member.  The BTF binary data itself is exposed through
      the "btf" and "btf_size" members.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      62dab84c
    • M
      bpf: btf: Introduce BTF ID · 78958fca
      Martin KaFai Lau 提交于
      This patch gives an ID to each loaded BTF.  The ID is allocated by
      the idr like the existing prog-id and map-id.
      
      The bpf_put(map->btf) is moved to __bpf_map_put() so that the
      userspace can stop seeing the BTF ID ASAP when the last BTF
      refcnt is gone.
      
      It also makes BTF accessible from userspace through the
      1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
         which is inline with the BPF_BTF_LOAD cmd and the existing
         BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
      2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
      
      Once the BTF ID handler is accessible from userspace, freeing a BTF
      object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
      can then be done under a rcu_read_lock() instead of taking
      spin_lock.
      [Note: A similar rcu usage can be done to the existing
             bpf_prog_get_fd_by_id() in a follow up patch]
      
      When processing the BPF_BTF_GET_FD_BY_ID cmd,
      refcount_inc_not_zero() is needed because the BTF object
      could be already in the rcu dead row .  btf_get() is
      removed since its usage is currently limited to btf.c
      alone.  refcount_inc() is used directly instead.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      78958fca
  10. 07 5月, 2018 5 次提交
  11. 04 5月, 2018 9 次提交
  12. 02 5月, 2018 4 次提交
    • M
      Revert "vhost: make msg padding explicit" · c818aa88
      Michael S. Tsirkin 提交于
      This reverts commit 93c0d549c4c5a7382ad70de6b86610b7aae57406.
      
      Unfortunately the padding will break 32 bit userspace.
      Ouch. Need to add some compat code, revert for now.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c818aa88
    • S
      tcp: send in-queue bytes in cmsg upon read · b75eba76
      Soheil Hassas Yeganeh 提交于
      Applications with many concurrent connections, high variance
      in receive queue length and tight memory bounds cannot
      allocate worst-case buffer size to drain sockets. Knowing
      the size of receive queue length, applications can optimize
      how they allocate buffers to read from the socket.
      
      The number of bytes pending on the socket is directly
      available through ioctl(FIONREAD/SIOCINQ) and can be
      approximated using getsockopt(MEMINFO) (rmem_alloc includes
      skb overheads in addition to application data). But, both of
      these options add an extra syscall per recvmsg. Moreover,
      ioctl(FIONREAD/SIOCINQ) takes the socket lock.
      
      Add the TCP_INQ socket option to TCP. When this socket
      option is set, recvmsg() relays the number of bytes available
      on the socket for reading to the application via the
      TCP_CM_INQ control message.
      
      Calculate the number of bytes after releasing the socket lock
      to include the processed backlog, if any. To avoid an extra
      branch in the hot path of recvmsg() for this new control
      message, move all cmsg processing inside an existing branch for
      processing receive timestamps. Since the socket lock is not held
      when calculating the size of receive queue, TCP_INQ is a hint.
      For example, it can overestimate the queue size by one byte,
      if FIN is received.
      
      With this method, applications can start reading from the socket
      using a small buffer, and then use larger buffers based on the
      remaining data when needed.
      
      V3 change-log:
      	As suggested by David Miller, added loads with barrier
      	to check whether we have multiple threads calling recvmsg
      	in parallel. When that happens we lock the socket to
      	calculate inq.
      V4 change-log:
      	Removed inline from a static function.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75eba76
    • S
      connector: add parent pid and tgid to coredump and exit events · b086ff87
      Stefan Strogin 提交于
      The intention is to get notified of process failures as soon
      as possible, before a possible core dumping (which could be very long)
      (e.g. in some process-manager). Coredump and exit process events
      are perfect for such use cases (see 2b5faa4c "connector: Added
      coredumping event to the process connector").
      
      The problem is that for now the process-manager cannot know the parent
      of a dying process using connectors. This could be useful if the
      process-manager should monitor for failures only children of certain
      parents, so we could filter the coredump and exit events by parent
      process and/or thread ID.
      
      Add parent pid and tgid to coredump and exit process connectors event
      data.
      Signed-off-by: NStefan Strogin <sstrogin@cisco.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b086ff87
    • M
      vhost: make msg padding explicit · de08481a
      Michael S. Tsirkin 提交于
      There's a 32 bit hole just after type. It's best to
      give it a name, this way compiler is forced to initialize
      it with rest of the structure.
      Reported-by: NKevin Easton <kevin@guarana.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de08481a
  13. 30 4月, 2018 3 次提交
    • Q
      bpf: fix formatting for bpf_get_stack() helper doc · 79552fbc
      Quentin Monnet 提交于
      Fix formatting (indent) for bpf_get_stack() helper documentation, so
      that the doc is rendered correctly with the Python script.
      
      Fixes: c195651e ("bpf: add bpf_get_stack helper")
      Cc: Yonghong Song <yhs@fb.com>
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      79552fbc
    • Q
      bpf: fix formatting for bpf_perf_event_read() helper doc · 3bd5a09b
      Quentin Monnet 提交于
      Some edits brought to the last iteration of BPF helper functions
      documentation introduced an error with RST formatting. As a result, most
      of one paragraph is rendered in bold text when only the name of a helper
      should be. Fix it, and fix formatting of another function name in the
      same paragraph.
      
      Fixes: c6b5fb86 ("bpf: add documentation for eBPF helpers (42-50)")
      Signed-off-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3bd5a09b
    • E
      tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive · 05255b82
      Eric Dumazet 提交于
      When adding tcp mmap() implementation, I forgot that socket lock
      had to be taken before current->mm->mmap_sem. syzbot eventually caught
      the bug.
      
      Since we can not lock the socket in tcp mmap() handler we have to
      split the operation in two phases.
      
      1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
        This operation does not involve any TCP locking.
      
      2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
       the transfert of pages from skbs to one VMA.
        This operation only uses down_read(&current->mm->mmap_sem) after
        holding TCP lock, thus solving the lockdep issue.
      
      This new implementation was suggested by Andy Lutomirski with great details.
      
      Benefits are :
      
      - Better scalability, in case multiple threads reuse VMAS
         (without mmap()/munmap() calls) since mmap_sem wont be write locked.
      
      - Better error recovery.
         The previous mmap() model had to provide the expected size of the
         mapping. If for some reason one part could not be mapped (partial MSS),
         the whole operation had to be aborted.
         With the tcp_zerocopy_receive struct, kernel can report how
         many bytes were successfuly mapped, and how many bytes should
         be read to skip the problematic sequence.
      
      - No more memory allocation to hold an array of page pointers.
        16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
      
      - skbs are freed while mmap_sem has been released
      
      Following patch makes the change in tcp_mmap tool to demonstrate
      one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
      
      Note that memcg might require additional changes.
      
      Fixes: 93ab6cc6 ("tcp: implement mmap() for zero copy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Cc: linux-mm@kvack.org
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05255b82
  14. 29 4月, 2018 2 次提交
    • A
      bpf: Fix helpers ctx struct types in uapi doc · a3ef8e9a
      Andrey Ignatov 提交于
      Helpers may operate on two types of ctx structures: user visible ones
      (e.g. `struct bpf_sock_ops`) when used in user programs, and kernel ones
      (e.g. `struct bpf_sock_ops_kern`) in kernel implementation.
      
      UAPI documentation must refer to only user visible structures.
      
      The patch replaces references to `_kern` structures in BPF helpers
      description by corresponding user visible structures.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a3ef8e9a
    • Y
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song 提交于
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c195651e