1. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  2. 27 10月, 2016 1 次提交
    • E
      udp: fix IP_CHECKSUM handling · 10df8e61
      Eric Dumazet 提交于
      First bug was added in commit ad6f939a ("ip: Add offset parameter to
      ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on
      AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by
      ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr));
      
      Then commit e6afc8ac ("udp: remove headers from UDP packets before
      queueing") forgot to adjust the offsets now UDP headers are pulled
      before skb are put in receive queue.
      
      Fixes: ad6f939a ("ip: Add offset parameter to ip_cmsg_recv")
      Fixes: e6afc8ac ("udp: remove headers from UDP packets before queueing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Sam Kumar <samanthakumar@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10df8e61
  3. 23 10月, 2016 2 次提交
    • P
      udp: use it's own memory accounting schema · 850cbadd
      Paolo Abeni 提交于
      Completely avoid default sock memory accounting and replace it
      with udp-specific accounting.
      
      Since the new memory accounting model encapsulates completely
      the required locking, remove the socket lock on both enqueue and
      dequeue, and avoid using the backlog on enqueue.
      
      Be sure to clean-up rx queue memory on socket destruction, using
      udp its own sk_destruct.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr readers      Kpps (vanilla)  Kpps (patched)
      1               170             440
      3               1250            2150
      6               3000            3650
      9               4200            4450
      12              5700            6250
      
      v4 -> v5:
        - avoid unneeded test in first_packet_length
      
      v3 -> v4:
        - remove useless sk_rcvqueues_full() call
      
      v2 -> v3:
        - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
        - add memory pressure support
        - fixed dropwatch accounting for ipv6
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      850cbadd
    • P
      udp: implement memory accounting helpers · f970bd9e
      Paolo Abeni 提交于
      Avoid using the generic helpers.
      Use the receive queue spin lock to protect the memory
      accounting operation, both on enqueue and on dequeue.
      
      On dequeue perform partial memory reclaiming, trying to
      leave a quantum of forward allocated memory.
      
      On enqueue use a custom helper, to allow some optimizations:
      - use a plain spin_lock() variant instead of the slightly
        costly spin_lock_irqsave(),
      - avoid dst_force check, since the calling code has already
        dropped the skb dst
      - avoid orphaning the skb, since skb_steal_sock() already did
        the work for us
      
      The above needs custom memory reclaiming on shutdown, provided
      by the udp_destruct_sock().
      
      v5 -> v6:
        - don't orphan the skb on enqueue
      
      v4 -> v5:
        - replace the mem_lock with the receive queue spin lock
        - ensure that the bh is always allowed to enqueue at least
          a skb, even if sk_rcvbuf is exceeded
      
      v3 -> v4:
        - reworked memory accunting, simplifying the schema
        - provide an helper for both memory scheduling and enqueuing
      
      v1 -> v2:
        - use a udp specific destrctor to perform memory reclaiming
        - remove a couple of helpers, unneeded after the above cleanup
        - do not reclaim memory on dequeue if not under memory
          pressure
        - reworked the fwd accounting schema to avoid potential
          integer overflow
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f970bd9e
  4. 21 10月, 2016 1 次提交
    • E
      udp: must lock the socket in udp_disconnect() · 286c72de
      Eric Dumazet 提交于
      Baozeng Ding reported KASAN traces showing uses after free in
      udp_lib_get_port() and other related UDP functions.
      
      A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash.
      
      I could write a reproducer with two threads doing :
      
      static int sock_fd;
      static void *thr1(void *arg)
      {
      	for (;;) {
      		connect(sock_fd, (const struct sockaddr *)arg,
      			sizeof(struct sockaddr_in));
      	}
      }
      
      static void *thr2(void *arg)
      {
      	struct sockaddr_in unspec;
      
      	for (;;) {
      		memset(&unspec, 0, sizeof(unspec));
      	        connect(sock_fd, (const struct sockaddr *)&unspec,
      			sizeof(unspec));
              }
      }
      
      Problem is that udp_disconnect() could run without holding socket lock,
      and this was causing list corruptions.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286c72de
  5. 11 9月, 2016 1 次提交
  6. 24 8月, 2016 4 次提交
  7. 20 8月, 2016 1 次提交
  8. 26 7月, 2016 1 次提交
  9. 12 7月, 2016 1 次提交
  10. 15 6月, 2016 2 次提交
    • S
      udp reuseport: fix packet of same flow hashed to different socket · d1e37288
      Su, Xuemin 提交于
      There is a corner case in which udp packets belonging to a same
      flow are hashed to different socket when hslot->count changes from 10
      to 11:
      
      1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash,
      and always passes 'daddr' to udp_ehashfn().
      
      2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2,
      but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to
      INADDR_ANY instead of some specific addr.
      
      That means when hslot->count changes from 10 to 11, the hash calculated by
      udp_ehashfn() is also changed, and the udp packets belonging to a same
      flow will be hashed to different socket.
      
      This is easily reproduced:
      1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000.
      2) From the same host send udp packets to 127.0.0.1:40000, record the
      socket index which receives the packets.
      3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096
      is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the
      same hslot as the aformentioned 10 sockets, and makes the hslot->count
      change from 10 to 11.
      4) From the same host send udp packets to 127.0.0.1:40000, and the socket
      index which receives the packets will be different from the one received
      in step 2.
      This should not happen as the socket bound to 0.0.0.0:44096 should not
      change the behavior of the sockets bound to 0.0.0.0:40000.
      
      It's the same case for IPv6, and this patch also fixes that.
      Signed-off-by: NSu, Xuemin <suxm@chinanetcenter.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1e37288
    • H
      ipv4: fix checksum annotation in udp4_csum_init · b46d9f62
      Hannes Frederic Sowa 提交于
      Reported-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Fixes: 4068579e ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6")
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b46d9f62
  11. 03 6月, 2016 1 次提交
  12. 21 5月, 2016 1 次提交
  13. 13 5月, 2016 1 次提交
    • A
      udp: Resolve NULL pointer dereference over flow-based vxlan device · ed7cbbce
      Alexander Duyck 提交于
      While testing an OpenStack configuration using VXLANs I saw the following
      call trace:
      
       RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80
       RSP: 0018:ffff88103867bc50  EFLAGS: 00010286
       RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff
       RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780
       RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0
       R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794
       R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0
       FS:  0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0
       Stack:
        ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17
        ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080
        ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794
       Call Trace:
        [<ffffffff81576515>] ? skb_checksum+0x35/0x50
        [<ffffffff815733c0>] ? skb_push+0x40/0x40
        [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130
        [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0
        [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270
        [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0
        [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120
        [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan]
        [<ffffffff815899d0>] net_rx_action+0x160/0x380
        [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5
        [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50
        [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160
        [<ffffffff8109a400>] ? sort_range+0x30/0x30
        [<ffffffff81096da8>] kthread+0xd8/0xf0
        [<ffffffff81693c82>] ret_from_fork+0x22/0x40
        [<ffffffff81096cd0>] ? kthread_park+0x60/0x60
      
      The following trace is seen when receiving a DHCP request over a flow-based
      VXLAN tunnel.  I believe this is caused by the metadata dst having a NULL
      dev value and as a result dev_net(dev) is causing a NULL pointer dereference.
      
      To resolve this I am replacing the check for skb_dst(skb)->dev with just
      skb->dev.  This makes sense as the callers of this function are usually in
      the receive path and as such skb->dev should always be populated.  In
      addition other functions in the area where these are called are already
      using dev_net(skb->dev) to determine the namespace the UDP packet belongs
      in.
      
      Fixes: 63058308 ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb")
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed7cbbce
  14. 03 5月, 2016 1 次提交
  15. 28 4月, 2016 3 次提交
  16. 19 4月, 2016 1 次提交
  17. 15 4月, 2016 1 次提交
    • C
      soreuseport: fix ordering for mixed v4/v6 sockets · d894ba18
      Craig Gallek 提交于
      With the SO_REUSEPORT socket option, it is possible to create sockets
      in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address.
      This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on
      the AF_INET6 sockets.
      
      Prior to the commits referenced below, an incoming IPv4 packet would
      always be routed to a socket of type AF_INET when this mixed-mode was used.
      After those changes, the same packet would be routed to the most recently
      bound socket (if this happened to be an AF_INET6 socket, it would
      have an IPv4 mapped IPv6 address).
      
      The change in behavior occurred because the recent SO_REUSEPORT optimizations
      short-circuit the socket scoring logic as soon as they find a match.  They
      did not take into account the scoring logic that favors AF_INET sockets
      over AF_INET6 sockets in the event of a tie.
      
      To fix this problem, this patch changes the insertion order of AF_INET
      and AF_INET6 addresses in the TCP and UDP socket lists when the sockets
      have SO_REUSEPORT set.  AF_INET sockets will be inserted at the head of the
      list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at
      the tail of the list.  This will force AF_INET sockets to always be
      considered first.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection")
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d894ba18
  18. 14 4月, 2016 2 次提交
  19. 08 4月, 2016 1 次提交
  20. 06 4月, 2016 2 次提交
  21. 05 4月, 2016 3 次提交
    • E
      udp: no longer use SLAB_DESTROY_BY_RCU · ca065d0c
      Eric Dumazet 提交于
      Tom Herbert would like not touching UDP socket refcnt for encapsulated
      traffic. For this to happen, we need to use normal RCU rules, with a grace
      period before freeing a socket. UDP sockets are not short lived in the
      high usage case, so the added cost of call_rcu() should not be a concern.
      
      This actually removes a lot of complexity in UDP stack.
      
      Multicast receives no longer need to hold a bucket spinlock.
      
      Note that ip early demux still needs to take a reference on the socket.
      
      Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
      but this might be changed later.
      
      Performance for a single UDP socket receiving flood traffic from
      many RX queues/cpus.
      
      Simple udp_rx using simple recvfrom() loop :
      438 kpps instead of 374 kpps : 17 % increase of the peak rate.
      
      v2: Addressed Willem de Bruijn feedback in multicast handling
       - keep early demux break in __udp4_lib_demux_lookup()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Tested-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca065d0c
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
    • S
      ipv4: process socket-level control messages in IPv4 · 24025c46
      Soheil Hassas Yeganeh 提交于
      Process socket-level control messages by invoking
      __sock_cmsg_send in ip_cmsg_send for control messages on
      the SOL_SOCKET layer.
      
      This makes sure whenever ip_cmsg_send is called in udp, icmp,
      and raw, we also process socket-level control messages.
      
      Note that this commit interprets new control messages that
      were ignored before. As such, this commit does not change
      the behavior of IPv4 control messages.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24025c46
  22. 23 3月, 2016 1 次提交
  23. 13 2月, 2016 1 次提交
  24. 12 2月, 2016 2 次提交
  25. 11 2月, 2016 1 次提交
    • C
      soreuseport: fast reuseport TCP socket selection · c125e80b
      Craig Gallek 提交于
      This change extends the fast SO_REUSEPORT socket lookup implemented
      for UDP to TCP.  Listener sockets with SO_REUSEPORT and the same
      receive address are additionally added to an array for faster
      random access.  This means that only a single socket from the group
      must be found in the listener list before any socket in the group can
      be used to receive a packet.  Previously, every socket in the group
      needed to be considered before handing off the incoming packet.
      
      This feature also exposes the ability to use a BPF program when
      selecting a socket from a reuseport group.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c125e80b
  26. 20 1月, 2016 1 次提交
    • E
      udp: fix potential infinite loop in SO_REUSEPORT logic · ed0dfffd
      Eric Dumazet 提交于
      Using a combination of connected and un-connected sockets, Dmitry
      was able to trigger soft lockups with his fuzzer.
      
      The problem is that sockets in the SO_REUSEPORT array might have
      different scores.
      
      Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and
      bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into
      the soreuseport array, with a score which is smaller than the score of
      first socket sk1 found in hash table (I am speaking of the regular UDP
      hash table), if sk1 had the connect() done, giving a +8 to its score.
      
      hash bucket [X] -> sk1 -> sk2 -> NULL
      
      sk1 score = 14  (because it did a connect())
      sk2 score = 6
      
      SO_REUSEPORT fast selection is an optimization. If it turns out the
      score of the selected socket does not match score of first socket, just
      fallback to old SO_REUSEPORT logic instead of trying to be too smart.
      
      Normal SO_REUSEPORT users do not mix different kind of sockets, as this
      mechanism is used for load balance traffic.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Craig Gallek <kraigatgoog@gmail.com>
      Acked-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed0dfffd
  27. 06 1月, 2016 1 次提交
  28. 05 1月, 2016 1 次提交
    • D
      net: Propagate lookup failure in l3mdev_get_saddr to caller · b5bdacf3
      David Ahern 提交于
      Commands run in a vrf context are not failing as expected on a route lookup:
          root@kenny:~# ip ro ls table vrf-red
          unreachable default
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          ping: Warning: source address might be selected on device other than vrf-red.
          PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.
      
          --- 10.100.1.254 ping statistics ---
          2 packets transmitted, 0 received, 100% packet loss, time 999ms
      
      Since the vrf table does not have a route for 10.100.1.254 the ping
      should have failed. The saddr lookup causes a full VRF table lookup.
      Propogating a lookup failure to the user allows the command to fail as
      expected:
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          connect: No route to host
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5bdacf3