1. 24 8月, 2016 2 次提交
  2. 20 8月, 2016 1 次提交
  3. 26 7月, 2016 1 次提交
  4. 12 7月, 2016 1 次提交
  5. 15 6月, 2016 2 次提交
    • S
      udp reuseport: fix packet of same flow hashed to different socket · d1e37288
      Su, Xuemin 提交于
      There is a corner case in which udp packets belonging to a same
      flow are hashed to different socket when hslot->count changes from 10
      to 11:
      
      1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
      and always passes 'daddr' to udp_ehashfn().
      
      2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
      but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
      INADDR_ANY instead of some specific addr.
      
      That means when hslot->count changes from 10 to 11, the hash calculated by
      udp_ehashfn() is also changed, and the udp packets belonging to a same
      flow will be hashed to different socket.
      
      This is easily reproduced:
      1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
      2) From the same host send udp packets to 127.0.0.1:40000, record the
      socket index which receives the packets.
      3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
      is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
      same hslot as the aformentioned 10 sockets, and makes the hslot->count
      change from 10 to 11.
      4) From the same host send udp packets to 127.0.0.1:40000, and the socket
      index which receives the packets will be different from the one received
      in step 2.
      This should not happen as the socket bound to 0.0.0.0:44096 should not
      change the behavior of the sockets bound to 0.0.0.0:40000.
      
      It's the same case for IPv6, and this patch also fixes that.
      Signed-off-by: NSu, Xuemin <suxm@chinanetcenter.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1e37288
    • H
      ipv4: fix checksum annotation in udp4_csum_init · b46d9f62
      Hannes Frederic Sowa 提交于
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Fixes: 4068579e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b46d9f62
  6. 03 6月, 2016 1 次提交
  7. 21 5月, 2016 1 次提交
  8. 13 5月, 2016 1 次提交
    • A
      udp: Resolve NULL pointer dereference over flow-based vxlan device · ed7cbbce
      Alexander Duyck 提交于
      While testing an OpenStack configuration using VXLANs I saw the following
      call trace:
      
       RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
       RSP: 0018:ffff88103867bc50  EFLAGS: 00010286
       RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
       RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
       RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
       R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
       R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
       FS:  0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
       Stack:
        ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
        ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
        ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
       Call Trace:
        [<ffffffff81576515>] ? skb_checksum+0x35/0x50
        [<ffffffff815733c0>] ? skb_push+0x40/0x40
        [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
        [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
        [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
        [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
        [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
        [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
        [<ffffffff815899d0>] net_rx_action+0x160/0x380
        [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
        [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
        [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
        [<ffffffff8109a400>] ? sort_range+0x30/0x30
        [<ffffffff81096da8>] kthread+0xd8/0xf0
        [<ffffffff81693c82>] ret_from_fork+0x22/0x40
        [<ffffffff81096cd0>] ? kthread_park+0x60/0x60
      
      The following trace is seen when receiving a DHCP request over a flow-based
      VXLAN tunnel.  I believe this is caused by the metadata dst having a NULL
      dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
      
      To resolve this I am replacing the check for skb_dst(skb)->dev with just
      skb->dev.  This makes sense as the callers of this function are usually in
      the receive path and as such skb->dev should always be populated.  In
      addition other functions in the area where these are called are already
      using dev_net(skb->dev) to determine the namespace the UDP packet belongs
      in.
      
      Fixes: 63058308 ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed7cbbce
  9. 03 5月, 2016 1 次提交
  10. 28 4月, 2016 3 次提交
  11. 19 4月, 2016 1 次提交
  12. 15 4月, 2016 1 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
  13. 14 4月, 2016 2 次提交
  14. 08 4月, 2016 1 次提交
  15. 06 4月, 2016 2 次提交
  16. 05 4月, 2016 3 次提交
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      ipv4: process socket-level control messages in IPv4 · 24025c46
      Soheil Hassas Yeganeh 提交于
      Process socket-level control messages by invoking
      __sock_cmsg_send in ip_cmsg_send for control messages on
      the SOL_SOCKET layer.
      
      This makes sure whenever ip_cmsg_send is called in udp, icmp,
      and raw, we also process socket-level control messages.
      
      Note that this commit interprets new control messages that
      were ignored before. As such, this commit does not change
      the behavior of IPv4 control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24025c46
  17. 23 3月, 2016 1 次提交
  18. 13 2月, 2016 1 次提交
  19. 12 2月, 2016 2 次提交
  20. 11 2月, 2016 1 次提交
    • C
      soreuseport: fast reuseport TCP socket selection · c125e80b
      Craig Gallek 提交于
      This change extends the fast SO_REUSEPORT socket lookup implemented
      for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
      receive address are additionally added to an array for faster
      random access.  This means that only a single socket from the group
      must be found in the listener list before any socket in the group can
      be used to receive a packet.  Previously, every socket in the group
      needed to be considered before handing off the incoming packet.
      
      This feature also exposes the ability to use a BPF program when
      selecting a socket from a reuseport group.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c125e80b
  21. 20 1月, 2016 1 次提交
    • E
      udp: fix potential infinite loop in SO_REUSEPORT logic · ed0dfffd
      Eric Dumazet 提交于
      Using a combination of connected and un-connected sockets, Dmitry
      was able to trigger soft lockups with his fuzzer.
      
      The problem is that sockets in the SO_REUSEPORT array might have
      different scores.
      
      Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and
      bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into
      the soreuseport array, with a score which is smaller than the score of
      first socket sk1 found in hash table (I am speaking of the regular UDP
      hash table), if sk1 had the connect() done, giving a +8 to its score.
      
      hash bucket [X] -> sk1 -> sk2 -> NULL
      
      sk1 score = 14  (because it did a connect())
      sk2 score = 6
      
      SO_REUSEPORT fast selection is an optimization. If it turns out the
      score of the selected socket does not match score of first socket, just
      fallback to old SO_REUSEPORT logic instead of trying to be too smart.
      
      Normal SO_REUSEPORT users do not mix different kind of sockets, as this
      mechanism is used for load balance traffic.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Craig Gallek <kraigatgoog@gmail.com>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed0dfffd
  22. 06 1月, 2016 1 次提交
  23. 05 1月, 2016 4 次提交
    • D
      net: Propagate lookup failure in l3mdev_get_saddr to caller · b5bdacf3
      David Ahern 提交于
      Commands run in a vrf context are not failing as expected on a route lookup:
          root@kenny:~# ip ro ls table vrf-red
          unreachable default
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          ping: Warning: source address might be selected on device other than vrf-red.
          PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.
      
          --- 10.100.1.254 ping statistics ---
          2 packets transmitted, 0 received, 100% packet loss, time 999ms
      
      Since the vrf table does not have a route for 10.100.1.254 the ping
      should have failed. The saddr lookup causes a full VRF table lookup.
      Propogating a lookup failure to the user allows the command to fail as
      expected:
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          connect: No route to host
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5bdacf3
    • C
      soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF · 538950a1
      Craig Gallek 提交于
      Expose socket options for setting a classic or extended BPF program
      for use when selecting sockets in an SO_REUSEPORT group.  These options
      can be used on the first socket to belong to a group before bind or
      on any socket in the group after bind.
      
      This change includes refactoring of the existing sk_filter code to
      allow reuse of the existing BPF filter validation checks.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      538950a1
    • C
      soreuseport: fast reuseport UDP socket selection · e32ea7e7
      Craig Gallek 提交于
      Include a struct sock_reuseport instance when a UDP socket binds to
      a specific address for the first time with the reuseport flag set.
      When selecting a socket for an incoming UDP packet, use the information
      available in sock_reuseport if present.
      
      This required adding an additional field to the UDP source address
      equality function to differentiate between exact and wildcard matches.
      The original use case allowed wildcard matches when checking for
      existing port uses during bind.  The new use case of adding a socket
      to a reuseport group requires exact address matching.
      
      Performance test (using a machine with 2 CPU sockets and a total of
      48 cores):  Create reuseport groups of varying size.  Use one socket
      from this group per user thread (pinning each thread to a different
      core) calling recvmmsg in a tight loop.  Record number of messages
      received per second while saturating a 10G link.
        10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
        20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
        40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)
      
      This work is based off a similar implementation written by
      Ying Cai <ycai@google.com> for implementing policy-based reuseport
      selection.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e32ea7e7
    • E
      udp: properly support MSG_PEEK with truncated buffers · 197c949e
      Eric Dumazet 提交于
      Backport of this upstream commit into stable kernels :
      89c22d8c ("net: Fix skb csum races when peeking")
      exposed a bug in udp stack vs MSG_PEEK support, when user provides
      a buffer smaller than skb payload.
      
      In this case,
      skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
                                       msg->msg_iov);
      returns -EFAULT.
      
      This bug does not happen in upstream kernels since Al Viro did a great
      job to replace this into :
      skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
      This variant is safe vs short buffers.
      
      For the time being, instead reverting Herbert Xu patch and add back
      skb->ip_summed invalid changes, simply store the result of
      udp_lib_checksum_complete() so that we avoid computing the checksum a
      second time, and avoid the problematic
      skb_copy_and_csum_datagram_iovec() call.
      
      This patch can be applied on recent kernels as it avoids a double
      checksumming, then backported to stable kernels as a bug fix.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      197c949e
  24. 16 12月, 2015 1 次提交
  25. 19 11月, 2015 1 次提交
  26. 13 10月, 2015 1 次提交
    • E
      net: SO_INCOMING_CPU setsockopt() support · 70da268b
      Eric Dumazet 提交于
      SO_INCOMING_CPU as added in commit 2c8c56e1 was a getsockopt() command
      to fetch incoming cpu handling a particular TCP flow after accept()
      
      This commits adds setsockopt() support and extends SO_REUSEPORT selection
      logic : If a TCP listener or UDP socket has this option set, a packet is
      delivered to this socket only if CPU handling the packet matches the specified
      one.
      
      This allows to build very efficient TCP servers, using one listener per
      RX queue, as the associated TCP listener should only accept flows handled
      in softirq by the same cpu.
      This provides optimal NUMA behavior and keep cpu caches hot.
      
      Note that __inet_lookup_listener() still has to iterate over the list of
      all listeners. Following patch puts sk_refcnt in a different cache line
      to let this iteration hit only shared and read mostly cache lines.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      70da268b
  27. 07 10月, 2015 2 次提交