1. 20 4月, 2018 7 次提交
    • M
      bpf: btf: Add pretty print support to the basic arraymap · a26ca7c9
      Martin KaFai Lau 提交于
      This patch adds pretty print support to the basic arraymap.
      Support for other bpf maps can be added later.
      
      This patch adds new attrs to the BPF_MAP_CREATE command to allow
      specifying the btf_fd, btf_key_id and btf_value_id.  The
      BPF_MAP_CREATE can then associate the btf to the map if
      the creating map supports BTF.
      
      A BTF supported map needs to implement two new map ops,
      map_seq_show_elem() and map_check_btf().  This patch has
      implemented these new map ops for the basic arraymap.
      
      It also adds file_operations, bpffs_map_fops, to the pinned
      map such that the pinned map can be opened and read.
      After that, the user has an intuitive way to do
      "cat bpffs/pathto/a-pinned-map" instead of getting
      an error.
      
      bpffs_map_fops should not be extended further to support
      other operations.  Other operations (e.g. write/key-lookup...)
      should be realized by the userspace tools (e.g. bpftool) through
      the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
      Follow up patches will allow the userspace to obtain
      the BTF from a map-fd.
      
      Here is a sample output when reading a pinned arraymap
      with the following map's value:
      
      struct map_value {
      	int count_a;
      	int count_b;
      };
      
      cat /sys/fs/bpf/pinned_array_map:
      
      0: {1,2}
      1: {3,4}
      2: {5,6}
      ...
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a26ca7c9
    • M
      bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd · 60197cfb
      Martin KaFai Lau 提交于
      This patch adds BPF_OBJ_GET_INFO_BY_FD support to BTF fd.
      The original BTF data, which was used to create the BTF fd during
      the earlier BPF_BTF_LOAD call, will be returned.
      
      The userspace is expected to allocate buffer
      to info.info and the buffer size is set to info.info_len before
      calling BPF_OBJ_GET_INFO_BY_FD.
      
      The original BTF data is copied to the userspace buffer (info.info).
      Only upto the user's specified info.info_len will be copied.
      
      The original BTF data size is set to info.info_len.  The userspace
      needs to check if it is bigger than its allocated buffer size.
      If it is, the userspace should realloc with the kernel-returned
      info.info_len and call the BPF_OBJ_GET_INFO_BY_FD again.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60197cfb
    • M
      bpf: btf: Add BPF_BTF_LOAD command · f56a653c
      Martin KaFai Lau 提交于
      This patch adds a BPF_BTF_LOAD command which
      1) loads and verifies the BTF (implemented in earlier patches)
      2) returns a BTF fd to userspace.  In the next patch, the
         BTF fd can be specified during BPF_MAP_CREATE.
      
      It currently limits to CAP_SYS_ADMIN.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f56a653c
    • M
      bpf: btf: Add pretty print capability for data with BTF type info · b00b8dae
      Martin KaFai Lau 提交于
      This patch adds pretty print capability for data with BTF type info.
      The current usage is to allow pretty print for a BPF map.
      
      The next few patches will allow a read() on a pinned map with BTF
      type info for its key and value.
      
      This patch uses the seq_printf() infra.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b00b8dae
    • M
      bpf: btf: Check members of struct/union · 179cde8c
      Martin KaFai Lau 提交于
      This patch checks a few things of struct's members:
      
      1) It has a valid size (e.g. a "const void" is invalid)
      2) A member's size (+ its member's offset) does not exceed
         the containing struct's size.
      3) The member's offset satisfies the alignment requirement
      
      The above can only be done after the needs_resolve member's type
      is resolved.  Hence, the above is done together in
      btf_struct_resolve().
      
      Each possible member's type (e.g. int, enum, modifier...) implements
      the check_member() ops which will be called from btf_struct_resolve().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      179cde8c
    • M
      bpf: btf: Validate type reference · eb3f595d
      Martin KaFai Lau 提交于
      After collecting all btf_type in the first pass in an earlier patch,
      the second pass (in this patch) can validate the reference types
      (e.g. the referring type does exist and it does not refer to itself).
      
      While checking the reference type, it also gathers other information (e.g.
      the size of an array).  This info will be useful in checking the
      struct's members in a later patch.  They will also be useful in doing
      pretty print later.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      eb3f595d
    • M
      bpf: btf: Introduce BPF Type Format (BTF) · 69b693f0
      Martin KaFai Lau 提交于
      This patch introduces BPF type Format (BTF).
      
      BTF (BPF Type Format) is the meta data format which describes
      the data types of BPF program/map.  Hence, it basically focus
      on the C programming language which the modern BPF is primary
      using.  The first use case is to provide a generic pretty print
      capability for a BPF map.
      
      BTF has its root from CTF (Compact C-Type format).  To simplify
      the handling of BTF data, BTF removes the differences between
      small and big type/struct-member.  Hence, BTF consistently uses u32
      instead of supporting both "one u16" and "two u32 (+padding)" in
      describing type and struct-member.
      
      It also raises the number of types (and functions) limit
      from 0x7fff to 0x7fffffff.
      
      Due to the above changes,  the format is not compatible to CTF.
      Hence, BTF starts with a new BTF_MAGIC and version number.
      
      This patch does the first verification pass to the BTF.  The first
      pass checks:
      1. meta-data size (e.g. It does not go beyond the total btf's size)
      2. name_offset is valid
      3. Each BTF_KIND (e.g. int, enum, struct....) does its
         own check of its meta-data.
      
      Some other checks, like checking a struct's member is referring
      to a valid type, can only be done in the second pass.  The second
      verification pass will be implemented in the next patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      69b693f0
  2. 19 4月, 2018 13 次提交
  3. 18 4月, 2018 15 次提交
  4. 17 4月, 2018 5 次提交
    • D
      Merge branch 'XDP-redirect-memory-return-API' · 684009d4
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      XDP redirect memory return API
      
      Submitted against net-next, as it contains NIC driver changes.
      
      This patchset works towards supporting different XDP RX-ring memory
      allocators.  As this will be needed by the AF_XDP zero-copy mode.
      
      The patchset uses mlx5 as the sample driver, which gets implemented
      XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
      change thought the patchset).
      
      A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
      And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
      
      Support for a driver supplied allocator is implemented, and a
      refurbished version of page_pool is the first return allocator type
      introduced.  This will be a integration point for AF_XDP zero-copy.
      
      The mlx5 driver evolve into using the page_pool, and see a performance
      increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.
      
      The patchset stop at 16 patches (one over limit), but more API changes
      are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
      APIs to support bulking.  As this will address some known limits.
      
      V2: Updated according to Tariq's feedback
      V3: Updated based on feedback from Jason Wang and Alex Duyck
      V4: Updated based on feedback from Tariq and Jason
      V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
      V6: Updated based on feedback from Eric Dumazet and Alex Duyck
      V7: Adapt to i40e that got XDP_REDIRECT support in-between
      V8:
       Updated based on feedback kbuild test robot, and adjust for mlx5 changes
       page_pool only compiled into kernel when drivers Kconfig 'select' feature
      V9:
       Remove some inline statements, let compiler decide what to inline
       Fix return value in virtio_net driver
       Adjust for mlx5 changes in-between submissions
      V10:
       Minor adjust for mlx5 requested by Tariq
       Resubmit against net-next
      V11: avoid leaking info stored in frame data on page reuse
      ====================
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      684009d4
    • J
      xdp: avoid leaking info stored in frame data on page reuse · 6dfb970d
      Jesper Dangaard Brouer 提交于
      The bpf infrastructure and verifier goes to great length to avoid
      bpf progs leaking kernel (pointer) info.
      
      For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
      kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
      Checks are in place to assure enough headroom is available for this.
      
      This info is not cleared, and if the frame is reused, then a
      malicious user could use bpf_xdp_adjust_head helper to move
      xdp->data into this area.  Thus, making this area readable.
      
      This is not super critical as XDP progs requires root or
      CAP_SYS_ADMIN, which are privileged enough for such info.  An
      effort (is underway) towards moving networking bpf hooks to the
      lesser privileged mode CAP_NET_ADMIN, where leaking such info
      should be avoided.  Thus, this patch to clear the info when
      needed.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dfb970d
    • J
      xdp: transition into using xdp_frame for ndo_xdp_xmit · 44fa2dbd
      Jesper Dangaard Brouer 提交于
      Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
      xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.
      
      This builds towards changing the API further to become a bulk API,
      because xdp_buff is not a queue-able object while xdp_frame is.
      
      V4: Adjust for commit 59655a5b ("tuntap: XDP_TX can use native XDP")
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44fa2dbd
    • J
      xdp: transition into using xdp_frame for return API · 03993094
      Jesper Dangaard Brouer 提交于
      Changing API xdp_return_frame() to take struct xdp_frame as argument,
      seems like a natural choice. But there are some subtle performance
      details here that needs extra care, which is a deliberate choice.
      
      When de-referencing xdp_frame on a remote CPU during DMA-TX
      completion, result in the cache-line is change to "Shared"
      state. Later when the page is reused for RX, then this xdp_frame
      cache-line is written, which change the state to "Modified".
      
      This situation already happens (naturally) for, virtio_net, tun and
      cpumap as the xdp_frame pointer is the queued object.  In tun and
      cpumap, the ptr_ring is used for efficiently transferring cache-lines
      (with pointers) between CPUs. Thus, the only option is to
      de-referencing xdp_frame.
      
      It is only the ixgbe driver that had an optimization, in which it can
      avoid doing the de-reference of xdp_frame.  The driver already have
      TX-ring queue, which (in case of remote DMA-TX completion) have to be
      transferred between CPUs anyhow.  In this data area, we stored a
      struct xdp_mem_info and a data pointer, which allowed us to avoid
      de-referencing xdp_frame.
      
      To compensate for this, a prefetchw is used for telling the cache
      coherency protocol about our access pattern.  My benchmarks show that
      this prefetchw is enough to compensate the ixgbe driver.
      
      V7: Adjust for commit d9314c47 ("i40e: add support for XDP_REDIRECT")
      V8: Adjust for commit bd658dda ("net/mlx5e: Separate dma base address
      and offset in dma_sync call")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03993094
    • J
      mlx5: use page_pool for xdp_return_frame call · 60bbf7ee
      Jesper Dangaard Brouer 提交于
      This patch shows how it is possible to have both the driver local page
      cache, which uses elevated refcnt for "catching"/avoiding SKB
      put_page returns the page through the page allocator.  And at the
      same time, have pages getting returned to the page_pool from
      ndp_xdp_xmit DMA completion.
      
      The performance improvement for XDP_REDIRECT in this patch is really
      good.  Especially considering that (currently) the xdp_return_frame
      API and page_pool_put_page() does per frame operations of both
      rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
      (It is the plan to remove these per frame operation in a followup
      patchset).
      
      The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
      with xdp_redirect_map (using devmap) . And the target/maximum
      capability of ixgbe is 13Mpps (on this HW setup).
      
      Before this patch for mlx5, XDP redirected frames were returned via
      the page allocator.  The single flow performance was 6Mpps, and if I
      started two flows the collective performance drop to 4Mpps, because we
      hit the page allocator lock (further negative scaling occurs).
      
      Two test scenarios need to be covered, for xdp_return_frame API, which
      is DMA-TX completion running on same-CPU or cross-CPU free/return.
      Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
      close to our 13Mpps max target.
      
      The reason max target isn't reached in cross-CPU test, is likely due
      to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
      ixgbe testing).  It is also planned to remove this unnecessary DMA
      unmap in a later patchset
      
      V2: Adjustments requested by Tariq
       - Changed page_pool_create return codes not return NULL, only
         ERR_PTR, as this simplifies err handling in drivers.
       - Save a branch in mlx5e_page_release
       - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V5: Updated patch desc
      
      V8: Adjust for b0cedc84 ("net/mlx5e: Remove rq_headroom field from params")
      V9:
       - Adjust for 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
       - Adjust for 73281b78 ("net/mlx5e: Derive Striding RQ size from MTU")
       - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      
      V10: Req from Tariq
       - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60bbf7ee