1. 08 11月, 2018 2 次提交
    • P
      udp: cope with UDP GRO packet misdirection · cf329aa4
      Paolo Abeni 提交于
      In some scenarios, the GRO engine can assemble an UDP GRO packet
      that ultimately lands on a non GRO-enabled socket.
      This patch tries to address the issue explicitly checking for the UDP
      socket features before enqueuing the packet, and eventually segmenting
      the unexpected GRO packet, as needed.
      
      We must also cope with re-insertion requests: after segmentation the
      UDP code calls the helper introduced by the previous patches, as needed.
      
      Segmentation is performed by a common helper, which takes care of
      updating socket and protocol stats is case of failure.
      
      rfc v3 -> v1
       - fix compile issues with rxrpc
       - when gso_segment returns NULL, treat is as an error
       - added 'ipv4' argument to udp_rcv_segment()
      
      rfc v2 -> rfc v3
       - moved udp_rcv_segment() into net/udp.h, account errors to socket
         and ns, always return NULL or segs list
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf329aa4
    • M
      net: ensure unbound datagram socket to be chosen when not in a VRF · 6da5b0f0
      Mike Manning 提交于
      Ensure an unbound datagram skt is chosen when not in a VRF. The check
      for a device match in compute_score() for UDP must be performed when
      there is no device match. For this, a failure is returned when there is
      no device match. This ensures that bound sockets are never selected,
      even if there is no unbound socket.
      
      Allow IPv6 packets to be sent over a datagram skt bound to a VRF. These
      packets are currently blocked, as flowi6_oif was set to that of the
      master vrf device, and the ipi6_ifindex is that of the slave device.
      Allow these packets to be sent by checking the device with ipi6_ifindex
      has the same L3 scope as that of the bound device of the skt, which is
      the master vrf device. Note that this check always succeeds if the skt
      is unbound.
      
      Even though the right datagram skt is now selected by compute_score(),
      a different skt is being returned that is bound to the wrong vrf. The
      difference between these and stream sockets is the handling of the skt
      option for SO_REUSEPORT. While the handling when adding a skt for reuse
      correctly checks that the bound device of the skt is a match, the skts
      in the hashslot are already incorrect. So for the same hash, a skt for
      the wrong vrf may be selected for the required port. The root cause is
      that the skt is immediately placed into a slot when it is created,
      but when the skt is then bound using SO_BINDTODEVICE, it remains in the
      same slot. The solution is to move the skt to the correct slot by
      forcing a rehash.
      Signed-off-by: NMike Manning <mmanning@vyatta.att-mail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Tested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6da5b0f0
  2. 06 10月, 2018 1 次提交
  3. 29 6月, 2018 1 次提交
    • L
      Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL · a11e1d43
      Linus Torvalds 提交于
      The poll() changes were not well thought out, and completely
      unexplained.  They also caused a huge performance regression, because
      "->poll()" was no longer a trivial file operation that just called down
      to the underlying file operations, but instead did at least two indirect
      calls.
      
      Indirect calls are sadly slow now with the Spectre mitigation, but the
      performance problem could at least be largely mitigated by changing the
      "->get_poll_head()" operation to just have a per-file-descriptor pointer
      to the poll head instead.  That gets rid of one of the new indirections.
      
      But that doesn't fix the new complexity that is completely unwarranted
      for the regular case.  The (undocumented) reason for the poll() changes
      was some alleged AIO poll race fixing, but we don't make the common case
      slower and more complex for some uncommon special case, so this all
      really needs way more explanations and most likely a fundamental
      redesign.
      
      [ This revert is a revert of about 30 different commits, not reverted
        individually because that would just be unnecessarily messy  - Linus ]
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a11e1d43
  4. 26 6月, 2018 1 次提交
    • D
      net: Convert GRO SKB handling to list_head. · d4546c25
      David Miller 提交于
      Manage pending per-NAPI GRO packets via list_head.
      
      Return an SKB pointer from the GRO receive handlers.  When GRO receive
      handlers return non-NULL, it means that this SKB needs to be completed
      at this time and removed from the NAPI queue.
      
      Several operations are greatly simplified by this transformation,
      especially timing out the oldest SKB in the list when gro_count
      exceeds MAX_GRO_SKBS, and napi_gro_flush() which walks the queue
      in reverse order.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4546c25
  5. 09 6月, 2018 1 次提交
    • P
      udp: fix rx queue len reported by diag and proc interface · 6c206b20
      Paolo Abeni 提交于
      After commit 6b229cf7 ("udp: add batching to udp_rmem_release()")
      the sk_rmem_alloc field does not measure exactly anymore the
      receive queue length, because we batch the rmem release. The issue
      is really apparent only after commit 0d4a6608 ("udp: do rmem bulk
      free even if the rx sk queue is empty"): the user space can easily
      check for an empty socket with not-0 queue length reported by the 'ss'
      tool or the procfs interface.
      
      We need to use a custom UDP helper to report the correct queue length,
      taking into account the forward allocation deficit.
      
      Reported-by: trevor.francis@46labs.com
      Fixes: 6b229cf7 ("UDP: add batching to udp_rmem_release()")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c206b20
  6. 26 5月, 2018 1 次提交
  7. 16 5月, 2018 2 次提交
  8. 09 5月, 2018 2 次提交
  9. 27 4月, 2018 2 次提交
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
  10. 31 3月, 2018 1 次提交
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
  11. 28 11月, 2017 1 次提交
  12. 01 10月, 2017 1 次提交
  13. 26 8月, 2017 1 次提交
  14. 15 8月, 2017 1 次提交
  15. 12 8月, 2017 1 次提交
    • E
      udp: harden copy_linear_skb() · fd851ba9
      Eric Dumazet 提交于
      syzkaller got crashes with CONFIG_HARDENED_USERCOPY=y configs.
      
      Issue here is that recvfrom() can be used with user buffer of Z bytes,
      and SO_PEEK_OFF of X bytes, from a skb with Y bytes, and following
      condition :
      
      Z < X < Y
      
      kernel BUG at mm/usercopy.c:72!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2917 Comm: syzkaller842281 Not tainted 4.13.0-rc3+ #16
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      task: ffff8801d2fa40c0 task.stack: ffff8801d1fe8000
      RIP: 0010:report_usercopy mm/usercopy.c:64 [inline]
      RIP: 0010:__check_object_size+0x3ad/0x500 mm/usercopy.c:264
      RSP: 0018:ffff8801d1fef8a8 EFLAGS: 00010286
      RAX: 0000000000000078 RBX: ffffffff847102c0 RCX: 0000000000000000
      RDX: 0000000000000078 RSI: 1ffff1003a3fded5 RDI: ffffed003a3fdf09
      RBP: ffff8801d1fef998 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d1ea480e
      R13: fffffffffffffffa R14: ffffffff84710280 R15: dffffc0000000000
      FS:  0000000001360880(0000) GS:ffff8801dc000000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000202ecfe4 CR3: 00000001d1ff8000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       check_object_size include/linux/thread_info.h:108 [inline]
       check_copy_size include/linux/thread_info.h:139 [inline]
       copy_to_iter include/linux/uio.h:105 [inline]
       copy_linear_skb include/net/udp.h:371 [inline]
       udpv6_recvmsg+0x1040/0x1af0 net/ipv6/udp.c:395
       inet_recvmsg+0x14c/0x5f0 net/ipv4/af_inet.c:793
       sock_recvmsg_nosec net/socket.c:792 [inline]
       sock_recvmsg+0xc9/0x110 net/socket.c:799
       SYSC_recvfrom+0x2d6/0x570 net/socket.c:1788
       SyS_recvfrom+0x40/0x50 net/socket.c:1760
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd851ba9
  16. 08 8月, 2017 2 次提交
  17. 30 7月, 2017 1 次提交
    • P
      udp6: fix socket leak on early demux · c9f2c1ae
      Paolo Abeni 提交于
      When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
      sk reference is retrieved and used, but the relevant reference
      count is leaked and the socket destructor is never called.
      Beyond leaking the sk memory, if there are pending UDP packets
      in the receive queue, even the related accounted memory is leaked.
      
      In the long run, this will cause persistent forward allocation errors
      and no UDP skbs (both ipv4 and ipv6) will be able to reach the
      user-space.
      
      Fix this by explicitly accessing the early demux reference before
      the lookup, and properly decreasing the socket reference count
      after usage.
      
      Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
      the now obsoleted comment about "socket cache".
      
      The newly added code is derived from the current ipv4 code for the
      similar path.
      
      v1 -> v2:
        fixed the __udp6_lib_rcv() return code for resubmission,
        as suggested by Eric
      Reported-by: NSam Edwards <CFSworks@gmail.com>
      Reported-by: NMarc Haber <mh+netdev@zugschlus.de>
      Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9f2c1ae
  18. 26 7月, 2017 1 次提交
    • P
      udp: preserve head state for IP_CMSG_PASSSEC · dce4551c
      Paolo Abeni 提交于
      Paul Moore reported a SELinux/IP_PASSSEC regression
      caused by missing skb->sp at recvmsg() time. We need to
      preserve the skb head state to process the IP_CMSG_PASSSEC
      cmsg.
      
      With this commit we avoid releasing the skb head state in the
      BH even if a secpath is attached to the current skb, and stores
      the skb status (with/without head states) in the scratch area,
      so that we can access it at skb deallocation time, without
      incurring in cache-miss penalties.
      
      This also avoids misusing the skb CB for ipv6 packets,
      as introduced by the commit 0ddf3fb2 ("udp: preserve
      skb->dst if required for IP options processing").
      
      Clean a bit the scratch area helpers implementation, to
      reduce the code differences between 32 and 64 bits build.
      Reported-by: NPaul Moore <paul@paul-moore.com>
      Fixes: 0a463c78 ("udp: avoid a cache miss on dequeue")
      Fixes: 0ddf3fb2 ("udp: preserve skb->dst if required for IP options processing")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dce4551c
  19. 28 6月, 2017 1 次提交
  20. 17 5月, 2017 1 次提交
    • P
      udp: use a separate rx queue for packet reception · 2276f58a
      Paolo Abeni 提交于
      under udp flood the sk_receive_queue spinlock is heavily contended.
      This patch try to reduce the contention on such lock adding a
      second receive queue to the udp sockets; recvmsg() looks first
      in such queue and, only if empty, tries to fetch the data from
      sk_receive_queue. The latter is spliced into the newly added
      queue every time the receive path has to acquire the
      sk_receive_queue lock.
      
      The accounting of forward allocated memory is still protected with
      the sk_receive_queue lock, so udp_rmem_release() needs to acquire
      both locks when the forward deficit is flushed.
      
      On specific scenarios we can end up acquiring and releasing the
      sk_receive_queue lock multiple times; that will be covered by
      the next patch
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2276f58a
  21. 25 3月, 2017 1 次提交
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  22. 19 1月, 2017 1 次提交
  23. 16 11月, 2016 1 次提交
  24. 08 11月, 2016 1 次提交
    • P
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni 提交于
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      lock.
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      
      v1 -> v2:
       - do rmem and allocated memory scheduling under the receive lock
       - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
       - avoid the typdef for the dequeue callback
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c13f97f
  25. 23 10月, 2016 1 次提交
    • P
      udp: implement memory accounting helpers · f970bd9e
      Paolo Abeni 提交于
      Avoid using the generic helpers.
      Use the receive queue spin lock to protect the memory
      accounting operation, both on enqueue and on dequeue.
      
      On dequeue perform partial memory reclaiming, trying to
      leave a quantum of forward allocated memory.
      
      On enqueue use a custom helper, to allow some optimizations:
      - use a plain spin_lock() variant instead of the slightly
        costly spin_lock_irqsave(),
      - avoid dst_force check, since the calling code has already
        dropped the skb dst
      - avoid orphaning the skb, since skb_steal_sock() already did
        the work for us
      
      The above needs custom memory reclaiming on shutdown, provided
      by the udp_destruct_sock().
      
      v5 -> v6:
        - don't orphan the skb on enqueue
      
      v4 -> v5:
        - replace the mem_lock with the receive queue spin lock
        - ensure that the bh is always allowed to enqueue at least
          a skb, even if sk_rcvbuf is exceeded
      
      v3 -> v4:
        - reworked memory accunting, simplifying the schema
        - provide an helper for both memory scheduling and enqueuing
      
      v1 -> v2:
        - use a udp specific destrctor to perform memory reclaiming
        - remove a couple of helpers, unneeded after the above cleanup
        - do not reclaim memory on dequeue if not under memory
          pressure
        - reworked the fwd accounting schema to avoid potential
          integer overflow
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f970bd9e
  26. 21 10月, 2016 1 次提交
    • E
      udp: must lock the socket in udp_disconnect() · 286c72de
      Eric Dumazet 提交于
      Baozeng Ding reported KASAN traces showing uses after free in
      udp_lib_get_port() and other related UDP functions.
      
      A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
      
      I could write a reproducer with two threads doing :
      
      static int sock_fd;
      static void *thr1(void *arg)
      {
      	for (;;) {
      		connect(sock_fd, (const struct sockaddr *)arg,
      			sizeof(struct sockaddr_in));
      	}
      }
      
      static void *thr2(void *arg)
      {
      	struct sockaddr_in unspec;
      
      	for (;;) {
      		memset(&unspec, 0, sizeof(unspec));
      	        connect(sock_fd, (const struct sockaddr *)&unspec,
      			sizeof(unspec));
              }
      }
      
      Problem is that udp_disconnect() could run without holding socket lock,
      and this was causing list corruptions.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286c72de
  27. 24 8月, 2016 1 次提交
  28. 02 6月, 2016 1 次提交
  29. 28 4月, 2016 3 次提交
  30. 08 4月, 2016 2 次提交
    • T
      udp: Add GRO functions to UDP socket · a6024562
      Tom Herbert 提交于
      This patch adds GRO functions (gro_receive and gro_complete) to UDP
      sockets. udp_gro_receive is changed to perform socket lookup on a
      packet. If a socket is found the related GRO functions are called.
      
      This features obsoletes using UDP offload infrastructure for GRO
      (udp_offload). This has the advantage of not being limited to provide
      offload on a per port basis, GRO is now applied to whatever individual
      UDP sockets are bound to.  This also allows the possbility of
      "application defined GRO"-- that is we can attach something like
      a BPF program to a UDP socket to perfrom GRO on an application
      layer protocol.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6024562
    • T
      udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb · 63058308
      Tom Herbert 提交于
      Add externally visible functions to lookup a UDP socket by skb. This
      will be used for GRO in UDP sockets. These functions also check
      if skb->dst is set, and if it is not skb->dev is used to get dev_net.
      This allows calling lookup functions before dst has been set on the
      skbuff.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      63058308
  31. 06 4月, 2016 1 次提交
  32. 05 4月, 2016 1 次提交
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c