提交 · 2afd23f78f39da84937006ecd24aa664a4ab052b · openeuler / Kernel

24 10月, 2019 1 次提交

xsk: Fix registration of Rx-only sockets · 2afd23f7

由 Magnus Karlsson 提交于 10月 21, 2019

Having Rx-only AF_XDP sockets can potentially lead to a crash in the
system by a NULL pointer dereference in xsk_umem_consume_tx(). This
function iterates through a list of all sockets tied to a umem and
checks if there are any packets to send on the Tx ring. Rx-only
sockets do not have a Tx ring, so this will cause a NULL pointer
dereference. This will happen if you have registered one or more
Rx-only sockets to a umem and the driver is checking the Tx ring even
on Rx, or if the XDP_SHARED_UMEM mode is used and there is a mix of
Rx-only and other sockets tied to the same umem.

Fixed by only putting sockets with a Tx component on the list that
xsk_umem_consume_tx() iterates over.

Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
Reported-by: NKal Cutter Conley <kal.conley@dectris.com>
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1571645818-16244-1-git-send-email-magnus.karlsson@intel.com

2afd23f7

25 9月, 2019 1 次提交

net/xdp: convert put_page() to put_user_page*() · 1edc9769

由 John Hubbard 提交于 9月 23, 2019

For pages that were retained via get_user_pages*(), release those pages
via the new put_user_page*() routines, instead of via put_page() or
release_pages().

This is part a tree-wide conversion, as described in fc1d8e7c ("mm:
introduce put_user_page*(), placeholder versions").

Link: http://lkml.kernel.org/r/20190724044537.10458-4-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1edc9769

19 9月, 2019 1 次提交

xsk: relax UMEM headroom alignment · 733ef7f0

由 Björn Töpel 提交于 9月 18, 2019

This patch removes the 64B alignment of the UMEM headroom. There is
really no reason for it, and having a headroom less than 64B should be
valid.

Fixes: c0c77d8f ("xsk: add user memory registration support sockopt")
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

733ef7f0

31 8月, 2019 1 次提交

xsk: add support to allow unaligned chunk placement · c05cd364

由 Kevin Laatz 提交于 8月 27, 2019

Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.
Signed-off-by: NKevin Laatz <kevin.laatz@intel.com>
Signed-off-by: NCiara Loftus <ciara.loftus@intel.com>
Signed-off-by: NBruce Richardson <bruce.richardson@intel.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

c05cd364

21 8月, 2019 1 次提交

xdp: xdp_umem: replace kmap on vmap for umem map · 624676e7

由 Ivan Khoronzhuk 提交于 8月 15, 2019

For 64-bit there is no reason to use vmap/vunmap, so use page_address
as it was initially. For 32 bits, in some apps, like in samples
xdpsock_user.c when number of pgs in use is quite big, the kmap
memory can be not enough, despite on this, kmap looks like is
deprecated in such cases as it can block and should be used rather
for dynamic mm.
Signed-off-by: NIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

624676e7

20 8月, 2019 1 次提交

xdp: unpin xdp umem pages in error path · fb89c394

由 Ivan Khoronzhuk 提交于 8月 15, 2019

Fix mem leak caused by missed unpin routine for umem pages.

Fixes: 8aef7340 ("xsk: introduce xdp_umem_page")
Signed-off-by: NIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

fb89c394

18 8月, 2019 2 次提交

xsk: add support for need_wakeup flag in AF_XDP rings · 77cd0d7b

由 Magnus Karlsson 提交于 8月 14, 2019

This commit adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set, it means that the
application has to explicitly wake up the kernel Rx (for the bit in
the fill ring) or kernel Tx (for bit in the Tx ring) processing by
issuing a syscall. Poll() can wake up both depending on the flags
submitted and sendto() will wake up tx processing only.

The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none on
the fill ring. This approach works when the application is running on
another core as it can replenish the fill ring while the driver is
busy-spinning. Though, this is a lousy approach if both of them are
running on the same core as the probability of the fill ring getting
more entries when the driver is busy-spinning is zero. With this new
feature the driver now sets the need_wakeup flag and returns to the
application. The application can then replenish the fill queue and
then explicitly wake up the Rx processing in the kernel using the
syscall poll(). For Tx, the flag is only set to one if the driver has
no outstanding Tx completion interrupts. If it has some, the flag is
zero as it will be woken up by a completion interrupt anyway.

As a nice side effect, this new flag also improves the performance of
the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls.

This flag needs some simple driver support. If the driver does not
support this, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behaviour of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.

For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it so far always have had a positive
performance impact.

The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

77cd0d7b

xsk: replace ndo_xsk_async_xmit with ndo_xsk_wakeup · 9116e5e2

由 Magnus Karlsson 提交于 8月 14, 2019

This commit replaces ndo_xsk_async_xmit with ndo_xsk_wakeup. This new
ndo provides the same functionality as before but with the addition of
a new flags field that is used to specifiy if Rx, Tx or both should be
woken up. The previous ndo only woke up Tx, as implied by the
name. The i40e and ixgbe drivers (which are all the supported ones)
are updated with this new interface.

This new ndo will be used by the new need_wakeup functionality of XDP
sockets that need to be able to wake up both Rx and Tx driver
processing.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

9116e5e2

10 8月, 2019 1 次提交

xdp: xdp_umem: fix umem pages mapping for 32bits systems · d9973cec

由 Ivan Khoronzhuk 提交于 8月 08, 2019

Use kmap instead of page_address as it's not always in low memory.
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NIvan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

d9973cec

12 7月, 2019 1 次提交

xdp: fix potential deadlock on socket mutex · 5464c3a0

由 Ilya Maximets 提交于 7月 08, 2019

There are 2 call chains:

  a) xsk_bind --> xdp_umem_assign_dev
  b) unregister_netdevice_queue --> xsk_notifier

with the following locking order:

  a) xs->mutex --> rtnl_lock
  b) rtnl_lock --> xdp.lock --> xs->mutex

Different order of taking 'xs->mutex' and 'rtnl_lock' could produce a
deadlock here. Fix that by moving the 'rtnl_lock' before 'xs->lock' in
the bind call chain (a).

Reported-by: syzbot+bf64ec93de836d7f4c2c@syzkaller.appspotmail.com
Fixes: 455302d1 ("xdp: fix hang while unregistering device bound to xdp socket")
Signed-off-by: NIlya Maximets <i.maximets@samsung.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

5464c3a0

03 7月, 2019 2 次提交

xdp: fix hang while unregistering device bound to xdp socket · 455302d1

由 Ilya Maximets 提交于 6月 28, 2019

Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0    D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a9909 ("xsk: add support for bind for Rx")
Signed-off-by: NIlya Maximets <i.maximets@samsung.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

455302d1

xdp: hold device for umem regardless of zero-copy mode · 162c820e

由 Ilya Maximets 提交于 6月 28, 2019

Device pointer stored in umem regardless of zero-copy mode,
so we heed to hold the device in all cases.

Fixes: c9b47cc1 ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: NIlya Maximets <i.maximets@samsung.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

162c820e

12 6月, 2019 1 次提交

xdp: check device pointer before clearing · 01d76b53

由 Ilya Maximets 提交于 6月 07, 2019

We should not call 'ndo_bpf()' or 'dev_put()' with NULL argument.

Fixes: c9b47cc1 ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: NIlya Maximets <i.maximets@samsung.com>
Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

01d76b53

15 5月, 2019 1 次提交

mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63

由 Ira Weiny 提交于 5月 13, 2019

Pach series "Add FOLL_LONGTERM to GUP fast and use it".

HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
advantages.  These pages can be held for a significant time.  But
get_user_pages_fast() does not protect against mapping FS DAX pages.

Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
retains the performance while also adding the FS DAX checks.  XDP has also
shown interest in using this functionality.[1]

In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
and remove the specialized get_user_pages_longterm call.

[1] https://lkml.org/lkml/2019/3/19/939

"longterm" is a relative thing and at this point is probably a misnomer.
This is really flagging a pin which is going to be given to hardware and
can't move.  I've thought of a couple of alternative names but I think we
have to settle on if we are going to use FL_LAYOUT or something else to
solve the "longterm" problem.  Then I think we can change the flag to a
better name.

Secondly, it depends on how often you are registering memory.  I have
spoken with some RDMA users who consider MR in the performance path...
For the overall application performance.  I don't have the numbers as the
tests for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an aside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

This patch (of 7):

This patch starts a series which aims to support FOLL_LONGTERM in
get_user_pages_fast().  Some callers who would like to do a longterm (user
controlled pin) of pages with the fast variant of GUP for performance
purposes.

Rather than have a separate get_user_pages_longterm() call, introduce
FOLL_LONGTERM and change the longterm callers to use it.

This patch does not change any functionality.  In the short term
"longterm" or user controlled pins are unsafe for Filesystems and FS DAX
in particular has been blocked.  However, callers of get_user_pages_fast()
were not "protected".

FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
requires vmas to determine if DAX is in use.

NOTE: In merging with the CMA changes we opt to change the
get_user_pages() call in check_and_migrate_cma_pages() to a call of
__get_user_pages_locked() on the newly migrated pages.  This makes the
code read better in that we are calling __get_user_pages_locked() on the
pages before and after a potential migration.

As a side affect some of the interfaces are cleaned up but this is not the
primary purpose of the series.

In review[1] it was asked:

<quote>
> This I don't get - if you do lock down long term mappings performance
> of the actual get_user_pages call shouldn't matter to start with.
>
> What do I miss?

A couple of points.

First "longterm" is a relative thing and at this point is probably a
misnomer.  This is really flagging a pin which is going to be given to
hardware and can't move.  I've thought of a couple of alternative names
but I think we have to settle on if we are going to use FL_LAYOUT or
something else to solve the "longterm" problem.  Then I think we can
change the flag to a better name.

Second, It depends on how often you are registering memory.  I have spoken
with some RDMA users who consider MR in the performance path...  For the
overall application performance.  I don't have the numbers as the tests
for HFI1 were done a long time ago.  But there was a significant
advantage.  Some of which is probably due to the fact that you don't have
to hold mmap_sem.

Finally, architecturally I think it would be good for everyone to use
*_fast.  There are patches submitted to the RDMA list which would allow
the use of *_fast (they reworking the use of mmap_sem) and as soon as they
are accepted I'll submit a patch to convert the RDMA core as well.  Also
to this point others are looking to use *_fast.

As an asside, Jasons pointed out in my previous submission that *_fast and
*_unlocked look very much the same.  I agree and I think further cleanup
will be coming.  But I'm focused on getting the final solution for DAX at
the moment.

</quote>

[1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965

[ira.weiny@intel.com: v3]
  Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

932f4a63

16 3月, 2019 1 次提交

xsk: fix umem memory leak on cleanup · 044175a0

由 Björn Töpel 提交于 3月 13, 2019

When the umem is cleaned up, the task that created it might already be
gone. If the task was gone, the xdp_umem_release function did not free
the pages member of struct xdp_umem.

It turned out that the task lookup was not needed at all; The code was
a left-over when we moved from task accounting to user accounting [1].

This patch fixes the memory leak by removing the task lookup logic
completely.

[1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/

Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
Fixes: c0c77d8f ("xsk: add user memory registration support sockopt")
Reported-by: NJiri Slaby <jslaby@suse.cz>
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

044175a0

13 2月, 2019 1 次提交

xsk: do not remove umem from netdevice on fall-back to copy-mode · 1e405c1a

由 Björn Töpel 提交于 2月 12, 2019

Commit c9b47cc1 ("xsk: fix bug when trying to use both copy and
zero-copy on one queue id") stores the umem into the netdev._rx
struct. However, the patch incorrectly removed the umem from the
netdev._rx struct when user-space passed "best-effort" mode
(i.e. select the fastest possible option available), and zero-copy
mode was not available. This commit fixes that.

Fixes: c9b47cc1 ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1e405c1a

12 2月, 2019 1 次提交

xsk: share the mmap_sem for page pinning · e451eb51

由 Davidlohr Bueso 提交于 2月 11, 2019

Holding mmap_sem exclusively for a gup() is an overkill. Lets
share the lock and replace the gup call for gup_longterm(), as
it is better suited for the lifetime of the pinning.

Fixes: c0c77d8f ("xsk: add user memory registration support sockopt")
Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Bjorn Topel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
CC: netdev@vger.kernel.org
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e451eb51

25 1月, 2019 1 次提交

xsk: add id to umem · 50e74c01

由 Björn Töpel 提交于 1月 24, 2019

This commit adds an id to the umem structure. The id uniquely
identifies a umem instance, and will be exposed to user-space via the
socket monitoring interface.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

50e74c01

22 1月, 2019 1 次提交

xsk: export xdp_get_umem_from_qid · 5f4f3b2d

由 Jan Sokolowski 提交于 12月 18, 2018

Export xdp_get_umem_from_qid for other modules to use.
Signed-off-by: NJan Sokolowski <jan.sokolowski@intel.com>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

5f4f3b2d

16 1月, 2019 1 次提交

xsk: Check if a queue exists during umem setup · cc5b5d35

由 Krzysztof Kazimierczak 提交于 1月 10, 2019

In the xdp_umem_assign_dev() path, the xsk code does not
check if a queue for which umem is to be created exists.
It leads to a situation where umem is not assigned to any
Tx/Rx queue of a netdevice, without notifying the stack
about an error. This affects both XDP_SKB and XDP_DRV
modes - in case of XDP_DRV_ZC, queue index is checked by
the driver.

This patch fixes xsk code, so that in both XDP_SKB and
XDP_DRV mode of AF_XDP, an error is returned when requested
queue index exceedes an existing maximum.

Fixes: c9b47cc1 ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Reported-by: NJakub Spizewski <jakub.spizewski@intel.com>
Signed-off-by: NKrzysztof Kazimierczak <krzysztof.kazimierczak@intel.com>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

cc5b5d35

08 10月, 2018 1 次提交

xsk: proper AF_XDP socket teardown ordering · 541d7fdd

由 Björn Töpel 提交于 10月 05, 2018

The AF_XDP socket struct can exist in three different, implicit
states: setup, bound and released. Setup is prior the socket has been
bound to a device. Bound is when the socket is active for receive and
send. Released is when the process/userspace side of the socket is
released, but the sock object is still lingering, e.g. when there is a
reference to the socket in an XSKMAP after process termination.

The Rx fast-path code uses the "dev" member of struct xdp_sock to
check whether a socket is bound or relased, and the Tx code uses the
struct xdp_umem "xsk_list" member in conjunction with "dev" to
determine the state of a socket.

However, the transition from bound to released did not tear the socket
down in correct order.

On the Rx side "dev" was cleared after synchronize_net() making the
synchronization useless. On the Tx side, the internal queues were
destroyed prior removing them from the "xsk_list".

This commit corrects the cleanup order, and by doing so
xdp_del_sk_umem() can be simplified and one synchronize_net() can be
removed.

Fixes: 965a9909 ("xsk: add support for bind for Rx")
Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

541d7fdd

05 10月, 2018 3 次提交

xsk: simplify xdp_clear_umem_at_qid implementation · a41b4f3c

由 Magnus Karlsson 提交于 10月 01, 2018

As we now do not allow ethtool to deactivate the queue id we are
running an AF_XDP socket on, we can simplify the implementation of
xdp_clear_umem_at_qid().
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a41b4f3c

ethtool: don't allow disabling queues with umem installed · 1661d346

由 Jakub Kicinski 提交于 10月 01, 2018

We already check the RSS indirection table does not use queues which
would be disabled by channel reconfiguration. Make sure user does not
try to disable queues which have a UMEM and zero-copy AF_XDP socket
installed.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

1661d346

xsk: fix bug when trying to use both copy and zero-copy on one queue id · c9b47cc1

由 Magnus Karlsson 提交于 10月 01, 2018

Previously, the xsk code did not record which umem was bound to a
specific queue id. This was not required if all drivers were zero-copy
enabled as this had to be recorded in the driver anyway. So if a user
tried to bind two umems to the same queue, the driver would say
no. But if copy-mode was first enabled and then zero-copy mode (or the
reverse order), we mistakenly enabled both of them on the same umem
leading to buggy behavior. The main culprit for this is that we did
not store the association of umem to queue id in the copy case and
only relied on the driver reporting this. As this relation was not
stored in the driver for copy mode (it does not rely on the AF_XDP
NDOs), this obviously could not work.

This patch fixes the problem by always recording the umem to queue id
relationship in the netdev_queue and netdev_rx_queue structs. This way
we always know what kind of umem has been bound to a queue id and can
act appropriately at bind time.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

c9b47cc1

26 9月, 2018 1 次提交

net: xsk: add a simple buffer reuse queue · f5bd9138

由 Jakub Kicinski 提交于 9月 07, 2018

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

f5bd9138

01 9月, 2018 2 次提交

xsk: i40e: get rid of useless struct xdp_umem_props · 93ee30f3

由 Magnus Karlsson 提交于 8月 31, 2018

This commit gets rid of the structure xdp_umem_props. It was there to
be able to break a dependency at one point, but this is no longer
needed. The values in the struct are instead stored directly in the
xdp_umem structure. This simplifies the xsk code as well as af_xdp
zero-copy drivers and as a bonus gets rid of one internal header file.

The i40e driver is also adapted to the new interface in this commit.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

93ee30f3

xsk: remove unnecessary assignment · a29c8bb6

由 Prashant Bhole 提交于 8月 31, 2018

Since xdp_umem_query() was added one assignment of bpf.command was
missed from cleanup. Removing the assignment statement.

Fixes: 84c6b868 ("xsk: don't allow umem replace at stack level")
Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a29c8bb6

22 8月, 2018 1 次提交

xsk: fix return value of xdp_umem_assign_dev() · 96c26e04

由 Prashant Bhole 提交于 8月 20, 2018

s/ENOTSUPP/EOPNOTSUPP/ in function umem_assign_dev().
This function's return value is directly returned by xsk_bind().
EOPNOTSUPP is bind()'s possible return value.

Fixes: f734607e ("xsk: refactor xdp_umem_assign_dev()")
Signed-off-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: NSong Liu <songliubraving@fb.com>
Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

96c26e04

01 8月, 2018 2 次提交

xsk: don't allow umem replace at stack level · 84c6b868

由 Jakub Kicinski 提交于 7月 30, 2018

Currently drivers have to check if they already have a umem
installed for a given queue and return an error if so.  Make
better use of XDP_QUERY_XSK_UMEM and move this functionality
to the core.

We need to keep rtnl across the calls now.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Acked-by: NBjörn Töpel <bjorn.topel@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

84c6b868

xsk: refactor xdp_umem_assign_dev() · f734607e

由 Jakub Kicinski 提交于 7月 30, 2018

Return early and only take the ref on dev once there is no possibility
of failing.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
Acked-by: NBjörn Töpel <bjorn.topel@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f734607e

12 6月, 2018 1 次提交

xsk: silence warning on memory allocation failure · a343993c

由 Björn Töpel 提交于 6月 11, 2018

syzkaller reported a warning from xdp_umem_pin_pages():

  WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  ...
  __do_kmalloc mm/slab.c:3713 [inline]
  __kmalloc+0x25/0x760 mm/slab.c:3727
  kmalloc_array include/linux/slab.h:634 [inline]
  kcalloc include/linux/slab.h:645 [inline]
  xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
  xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
  xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
  xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
  __do_sys_setsockopt net/socket.c:1946 [inline]
  __se_sys_setsockopt net/socket.c:1943 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is a warning about attempting to allocate more than
KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
the request is too big, the kernel is free to deny its allocation. In
this patch, the failed allocation attempt is silenced with
__GFP_NOWARN.

Fixes: c0c77d8f ("xsk: add user memory registration support sockopt")
Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a343993c

08 6月, 2018 1 次提交

bpf, xdp: fix crash in xdp_umem_unaccount_pages · c09290c5

由 Daniel Borkmann 提交于 6月 08, 2018

syzkaller was able to trigger the following panic for AF_XDP:

  BUG: KASAN: null-ptr-deref in atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
  BUG: KASAN: null-ptr-deref in atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
  BUG: KASAN: null-ptr-deref in xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
  Write of size 8 at addr 0000000000000060 by task syz-executor246/4527

  CPU: 1 PID: 4527 Comm: syz-executor246 Not tainted 4.17.0+ #89
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x1b9/0x294 lib/dump_stack.c:113
   kasan_report_error mm/kasan/report.c:352 [inline]
   kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
   check_memory_region_inline mm/kasan/kasan.c:260 [inline]
   check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
   kasan_check_write+0x14/0x20 mm/kasan/kasan.c:278
   atomic64_sub include/asm-generic/atomic-instrumented.h:144 [inline]
   atomic_long_sub include/asm-generic/atomic-long.h:199 [inline]
   xdp_umem_unaccount_pages.isra.4+0x3d/0x80 net/xdp/xdp_umem.c:135
   xdp_umem_reg net/xdp/xdp_umem.c:334 [inline]
   xdp_umem_create+0xd6c/0x10f0 net/xdp/xdp_umem.c:349
   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
   __do_sys_setsockopt net/socket.c:1946 [inline]
   __se_sys_setsockopt net/socket.c:1943 [inline]
   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

In xdp_umem_reg() the call to xdp_umem_account_pages() passed
with CAP_IPC_LOCK where we didn't need to end up charging rlimit
on memlock for the current user and therefore umem->user continues
to be NULL. Later on through fault injection syzkaller triggered
a failure in either umem->pgs or umem->pages allocation such that
we bail out and undo accounting in xdp_umem_unaccount_pages()
where we eventually hit the panic since it tries to deref the
umem->user.

The code is pretty close to mm_account_pinned_pages() and
mm_unaccount_pinned_pages() pair and potentially could reuse
it even in a later cleanup, and it appears that the initial
commit c0c77d8f ("xsk: add user memory registration support
sockopt") got this right while later follow-up introduced the
bug via a49049ea ("xsk: simplified umem setup").

Fixes: a49049ea ("xsk: simplified umem setup")
Reported-by: syzbot+979217770b09ebf5c407@syzkaller.appspotmail.com
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

c09290c5

05 6月, 2018 4 次提交

xsk: wire upp Tx zero-copy functions · ac98d8aa

由 Magnus Karlsson 提交于 6月 04, 2018

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions for the netdevs.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

ac98d8aa

xsk: add zero-copy support for Rx · 173d3adb

由 Björn Töpel 提交于 6月 04, 2018

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

173d3adb

xsk: introduce xdp_umem_page · 8aef7340

由 Björn Töpel 提交于 6月 04, 2018

The xdp_umem_page holds the address for a page. Trade memory for
faster lookup. Later, we'll add DMA address here as well.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8aef7340

xsk: moved struct xdp_umem definition · e61e62b9

由 Björn Töpel 提交于 6月 04, 2018

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e61e62b9

04 6月, 2018 1 次提交

xsk: new descriptor addressing scheme · bbff2f32

由 Björn Töpel 提交于 6月 04, 2018

Currently, AF_XDP only supports a fixed frame-size memory scheme where
each frame is referenced via an index (idx). A user passes the frame
index to the kernel, and the kernel acts upon the data.  Some NICs,
however, do not have a fixed frame-size model, instead they have a
model where a memory window is passed to the hardware and multiple
frames are filled into that window (referred to as the "type-writer"
model).

By changing the descriptor format from the current frame index
addressing scheme, AF_XDP can in the future be extended to support
these kinds of NICs.

In the index-based model, an idx refers to a frame of size
frame_size. Addressing a frame in the UMEM is done by offseting the
UMEM starting address by a global offset, idx * frame_size + offset.
Communicating via the fill- and completion-rings are done by means of
idx.

In this commit, the idx is removed in favor of an address (addr),
which is a relative address ranging over the UMEM. To convert an
idx-based address to the new addr is simply: addr = idx * frame_size +
offset.

We also stop referring to the UMEM "frame" as a frame. Instead it is
simply called a chunk.

To transfer ownership of a chunk to the kernel, the addr of the chunk
is passed in the fill-ring. Note, that the kernel will mask addr to
make it chunk aligned, so there is no need for userspace to do
that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
3000 to the fill-ring will refer to the same chunk.

On the completion-ring, the addr will match that of the Tx descriptor,
passed to the kernel.

Changing the descriptor format to use chunks/addr will allow for
future changes to move to a type-writer based model, where multiple
frames can reside in one chunk. In this model passing one single chunk
into the fill-ring, would potentially result in multiple Rx
descriptors.

This commit changes the uapi of AF_XDP sockets, and updates the
documentation.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

bbff2f32

22 5月, 2018 2 次提交

xsk: convert atomic_t to refcount_t · d3b42f14

由 Björn Töpel 提交于 5月 22, 2018

Introduce refcount_t, in favor of atomic_t.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

d3b42f14

xsk: simplified umem setup · a49049ea

由 Björn Töpel 提交于 5月 22, 2018

As suggested by Daniel Borkmann, the umem setup code was a too
defensive and complex. Here, we reduce the number of checks. Also, the
memory pinning is now folded into the umem creation, and we do correct
locking.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

a49049ea

18 5月, 2018 1 次提交

xsk: fixed some cases of unnecessary parentheses · da60cf00

由 Björn Töpel 提交于 5月 18, 2018

Removed some cases of unnecessary parentheses.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

da60cf00

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功