1. 28 6月, 2019 3 次提交
  2. 09 3月, 2019 1 次提交
    • B
      xsk: fix to reject invalid flags in xsk_bind · f54ba391
      Björn Töpel 提交于
      Passing a non-existing flag in the sxdp_flags member of struct
      sockaddr_xdp was, incorrectly, silently ignored. This patch addresses
      that behavior, and rejects any non-existing flags.
      
      We have examined existing user space code, and to our best knowledge,
      no one is relying on the current incorrect behavior. AF_XDP is still
      in its infancy, so from our perspective, the risk of breakage is very
      low, and addressing this problem now is important.
      
      Fixes: 965a9909 ("xsk: add support for bind for Rx")
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f54ba391
  3. 21 2月, 2019 1 次提交
  4. 11 2月, 2019 1 次提交
    • M
      xsk: add missing smp_rmb() in xsk_mmap · e6762c8b
      Magnus Karlsson 提交于
      All the setup code in AF_XDP is protected by a mutex with the
      exception of the mmap code that cannot use it. To make sure that a
      process banging on the mmap call at the same time as another process
      is setting up the socket, smp_wmb() calls were added in the umem
      registration code and the queue creation code, so that the published
      structures that xsk_mmap needs would be consistent. However, the
      corresponding smp_rmb() calls were not added to the xsk_mmap
      code. This patch adds these calls.
      
      Fixes: 37b07693 ("xsk: add missing write- and data-dependency barrier")
      Fixes: c0c77d8f ("xsk: add user memory registration support sockopt")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e6762c8b
  5. 25 1月, 2019 2 次提交
  6. 20 12月, 2018 1 次提交
    • B
      xsk: simplify AF_XDP socket teardown · e2ce3674
      Björn Töpel 提交于
      Prior this commit, when the struct socket object was being released,
      the UMEM did not have its reference count decreased. Instead, this was
      done in the struct sock sk_destruct function.
      
      There is no reason to keep the UMEM reference around when the socket
      is being orphaned, so in this patch the xdp_put_mem is called in the
      xsk_release function. This results in that the xsk_destruct function
      can be removed!
      
      Note that, it still holds that a struct xsk_sock reference might still
      linger in the XSKMAP after the UMEM is released, e.g. if a user does
      not clear the XSKMAP prior to closing the process. This sock will be
      in a "released" zombie like state, until the XSKMAP is removed.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e2ce3674
  7. 11 10月, 2018 1 次提交
  8. 08 10月, 2018 1 次提交
    • B
      xsk: proper AF_XDP socket teardown ordering · 541d7fdd
      Björn Töpel 提交于
      The AF_XDP socket struct can exist in three different, implicit
      states: setup, bound and released. Setup is prior the socket has been
      bound to a device. Bound is when the socket is active for receive and
      send. Released is when the process/userspace side of the socket is
      released, but the sock object is still lingering, e.g. when there is a
      reference to the socket in an XSKMAP after process termination.
      
      The Rx fast-path code uses the "dev" member of struct xdp_sock to
      check whether a socket is bound or relased, and the Tx code uses the
      struct xdp_umem "xsk_list" member in conjunction with "dev" to
      determine the state of a socket.
      
      However, the transition from bound to released did not tear the socket
      down in correct order.
      
      On the Rx side "dev" was cleared after synchronize_net() making the
      synchronization useless. On the Tx side, the internal queues were
      destroyed prior removing them from the "xsk_list".
      
      This commit corrects the cleanup order, and by doing so
      xdp_del_sk_umem() can be simplified and one synchronize_net() can be
      removed.
      
      Fixes: 965a9909 ("xsk: add support for bind for Rx")
      Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      541d7fdd
  9. 05 10月, 2018 1 次提交
    • M
      xsk: fix bug when trying to use both copy and zero-copy on one queue id · c9b47cc1
      Magnus Karlsson 提交于
      Previously, the xsk code did not record which umem was bound to a
      specific queue id. This was not required if all drivers were zero-copy
      enabled as this had to be recorded in the driver anyway. So if a user
      tried to bind two umems to the same queue, the driver would say
      no. But if copy-mode was first enabled and then zero-copy mode (or the
      reverse order), we mistakenly enabled both of them on the same umem
      leading to buggy behavior. The main culprit for this is that we did
      not store the association of umem to queue id in the copy case and
      only relied on the driver reporting this. As this relation was not
      stored in the driver for copy mode (it does not rely on the AF_XDP
      NDOs), this obviously could not work.
      
      This patch fixes the problem by always recording the umem to queue id
      relationship in the netdev_queue and netdev_rx_queue structs. This way
      we always know what kind of umem has been bound to a queue id and can
      act appropriately at bind time.
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c9b47cc1
  10. 01 9月, 2018 1 次提交
  11. 30 8月, 2018 1 次提交
  12. 31 7月, 2018 1 次提交
  13. 13 7月, 2018 4 次提交
  14. 03 7月, 2018 2 次提交
    • M
      xsk: fix potential race in SKB TX completion code · a9744f7c
      Magnus Karlsson 提交于
      There is a potential race in the TX completion code for the SKB
      case. One process enters the sendmsg code of an AF_XDP socket in order
      to send a frame. The execution eventually trickles down to the driver
      that is told to send the packet. However, it decides to drop the
      packet due to some error condition (e.g., rings full) and frees the
      SKB. This will trigger the SKB destructor and a completion will be
      sent to the AF_XDP user space through its
      single-producer/single-consumer queues.
      
      At the same time a TX interrupt has fired on another core and it
      dispatches the TX completion code in the driver. It does its HW
      specific things and ends up freeing the SKB associated with the
      transmitted packet. This will trigger the SKB destructor and a
      completion will be sent to the AF_XDP user space through its
      single-producer/single-consumer queues. With a pseudo call stack, it
      would look like this:
      
      Core 1:
      sendmsg() being called in the application
        netdev_start_xmit()
          Driver entered through ndo_start_xmit
            Driver decides to free the SKB for some reason (e.g., rings full)
              Destructor of SKB called
                xskq_produce_addr() is called to signal completion to user space
      
      Core 2:
      TX completion irq
        NAPI loop
          Driver irq handler for TX completions
            Frees the SKB
              Destructor of SKB called
                xskq_produce_addr() is called to signal completion to user space
      
      We now have a violation of the single-producer/single-consumer
      principle for our queues as there are two threads trying to produce at
      the same time on the same queue.
      
      Fixed by introducing a spin_lock in the destructor. In regards to the
      performance, I get around 1.74 Mpps for txonly before and after the
      introduction of the spinlock. There is of course some impact due to
      the spin lock but it is in the less significant digits that are too
      noisy for me to measure. But let us say that the version without the
      spin lock got 1.745 Mpps in the best case and the version with 1.735
      Mpps in the worst case, then that would mean a maximum drop in
      performance of 0.5%.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a9744f7c
    • M
      xsk: frame could be completed more than once in SKB path · fe588685
      Magnus Karlsson 提交于
      Fixed a bug in which a frame could be completed more than once
      when an error was returned from dev_direct_xmit(). The code
      erroneously retried sending the message leading to multiple
      calls to the SKB destructor and therefore multiple completions
      of the same buffer to user space.
      
      The error code in this case has been changed from EAGAIN to EBUSY
      in order to tell user space that the sending of the packet failed
      and the buffer has been return to user space through the completion
      queue.
      
      Fixes: 35fcde7f ("xsk: support for Tx")
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Reported-by: NPavel Odintsov <pavel@fastnetmon.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fe588685
  15. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  16. 12 6月, 2018 1 次提交
  17. 08 6月, 2018 1 次提交
    • G
      xsk: Fix umem fill/completion queue mmap on 32-bit · a5a16e43
      Geert Uytterhoeven 提交于
      With gcc-4.1.2 on 32-bit:
      
          net/xdp/xsk.c:663: warning: integer constant is too large for ‘long’ type
          net/xdp/xsk.c:665: warning: integer constant is too large for ‘long’ type
      
      Add the missing "ULL" suffixes to the large XDP_UMEM_PGOFF_*_RING values
      to fix this.
      
          net/xdp/xsk.c:663: warning: comparison is always false due to limited range of data type
          net/xdp/xsk.c:665: warning: comparison is always false due to limited range of data type
      
      "unsigned long" is 32-bit on 32-bit systems, hence the offset is
      truncated, and can never be equal to any of the XDP_UMEM_PGOFF_*_RING
      values.  Use loff_t (and the required cast) to fix this.
      
      Fixes: 423f3832 ("xsk: add umem fill queue support and mmap")
      Fixes: fe230832 ("xsk: add umem completion queue support and mmap")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a5a16e43
  18. 05 6月, 2018 2 次提交
  19. 04 6月, 2018 3 次提交
    • B
      xsk: new descriptor addressing scheme · bbff2f32
      Björn Töpel 提交于
      Currently, AF_XDP only supports a fixed frame-size memory scheme where
      each frame is referenced via an index (idx). A user passes the frame
      index to the kernel, and the kernel acts upon the data.  Some NICs,
      however, do not have a fixed frame-size model, instead they have a
      model where a memory window is passed to the hardware and multiple
      frames are filled into that window (referred to as the "type-writer"
      model).
      
      By changing the descriptor format from the current frame index
      addressing scheme, AF_XDP can in the future be extended to support
      these kinds of NICs.
      
      In the index-based model, an idx refers to a frame of size
      frame_size. Addressing a frame in the UMEM is done by offseting the
      UMEM starting address by a global offset, idx * frame_size + offset.
      Communicating via the fill- and completion-rings are done by means of
      idx.
      
      In this commit, the idx is removed in favor of an address (addr),
      which is a relative address ranging over the UMEM. To convert an
      idx-based address to the new addr is simply: addr = idx * frame_size +
      offset.
      
      We also stop referring to the UMEM "frame" as a frame. Instead it is
      simply called a chunk.
      
      To transfer ownership of a chunk to the kernel, the addr of the chunk
      is passed in the fill-ring. Note, that the kernel will mask addr to
      make it chunk aligned, so there is no need for userspace to do
      that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
      3000 to the fill-ring will refer to the same chunk.
      
      On the completion-ring, the addr will match that of the Tx descriptor,
      passed to the kernel.
      
      Changing the descriptor format to use chunks/addr will allow for
      future changes to move to a type-writer based model, where multiple
      frames can reside in one chunk. In this model passing one single chunk
      into the fill-ring, would potentially result in multiple Rx
      descriptors.
      
      This commit changes the uapi of AF_XDP sockets, and updates the
      documentation.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bbff2f32
    • B
      xsk: proper Rx drop statistics update · a509a955
      Björn Töpel 提交于
      Previously, rx_dropped could be updated incorrectly, e.g. if the XDP
      program redirected the frame to a socket bound to a different queue
      than where the XDP program was executing.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a509a955
    • B
      xsk: proper fill queue descriptor validation · 4e64c835
      Björn Töpel 提交于
      Previously the fill queue descriptor was not copied to kernel space
      prior validating it, making it possible for userland to change the
      descriptor post-kernel-validation.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4e64c835
  20. 22 5月, 2018 5 次提交
  21. 18 5月, 2018 2 次提交
  22. 04 5月, 2018 4 次提交