1. 06 12月, 2016 1 次提交
  2. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  3. 14 11月, 2016 1 次提交
  4. 10 11月, 2016 1 次提交
  5. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  6. 13 10月, 2016 1 次提交
    • E
      ipv6: tcp: restore IP6CB for pktoptions skbs · 8ce48623
      Eric Dumazet 提交于
      Baozeng Ding reported following KASAN splat :
      
      BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
      Read of size 1 by task poc/25548
      Call Trace:
       [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
       [<     inline     >] print_address_description /mm/kasan/report.c:204
       [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
       [<     inline     >] kasan_report /mm/kasan/report.c:303
       [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
       [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
       [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
       [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
       [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
       [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
       [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
       [<     inline     >] SYSC_getsockopt /net/socket.c:1776
       [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
       [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
      Memory state around the buggy address:
       ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      > ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
       ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      He also provided a syzkaller reproducer.
      
      Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
      data that was moved at a different place in tcp_v6_rcv()
      
      This patch moves tcp_v6_restore_cb() up and calls it from
      tcp_v6_do_rcv() when np->pktoptions is set.
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ce48623
  7. 11 9月, 2016 1 次提交
  8. 29 8月, 2016 1 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
  9. 26 8月, 2016 2 次提交
  10. 24 8月, 2016 2 次提交
  11. 01 7月, 2016 1 次提交
  12. 28 6月, 2016 1 次提交
  13. 15 6月, 2016 1 次提交
  14. 08 6月, 2016 1 次提交
  15. 17 5月, 2016 1 次提交
    • E
      tcp: minor optimizations around tcp_hdr() usage · ea1627c2
      Eric Dumazet 提交于
      tcp_hdr() is slightly more expensive than using skb->data in contexts
      where we know they point to the same byte.
      
      In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
      as tcp header has not been pulled yet.
      
      In output path, the same can be said when we just pushed the tcp header
      in the skb, in tcp_transmit_skb() and tcp_make_synack()
      
      Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
      tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
      so that compiler can further optimize and avoid a reload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea1627c2
  16. 07 5月, 2016 1 次提交
  17. 03 5月, 2016 1 次提交
  18. 28 4月, 2016 3 次提交
  19. 16 4月, 2016 1 次提交
    • E
      tcp: do not mess with listener sk_wmem_alloc · b3d05147
      Eric Dumazet 提交于
      When removing sk_refcnt manipulation on synflood, I missed that
      using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
      transitioned to 0.
      
      We should hold sk_refcnt instead, but this is a big deal under attack.
      (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
      
      In this patch, I chose to not attach a socket to syncookies skb.
      
      Performance is now 5 Mpps instead of 3.2 Mpps.
      
      Following patch will remove last known false sharing in
      tcp_rcv_state_process()
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3d05147
  20. 08 4月, 2016 1 次提交
  21. 05 4月, 2016 3 次提交
  22. 15 3月, 2016 1 次提交
    • M
      tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In · a44d6eac
      Martin KaFai Lau 提交于
      Per RFC4898, they count segments sent/received
      containing a positive length data segment (that includes
      retransmission segments carrying data).  Unlike
      tcpi_segs_out/in, tcpi_data_segs_out/in excludes segments
      carrying no data (e.g. pure ack).
      
      The patch also updates the segs_in in tcp_fastopen_add_skb()
      so that segs_in >= data_segs_in property is kept.
      
      Together with retransmission data, tcpi_data_segs_out
      gives a better signal on the rxmit rate.
      
      v6: Rebase on the latest net-next
      
      v5: Eric pointed out that checking skb->len is still needed in
      tcp_fastopen_add_skb() because skb can carry a FIN without data.
      Hence, instead of open coding segs_in and data_segs_in, tcp_segs_in()
      helper is used.  Comment is added to the fastopen case to explain why
      segs_in has to be reset and tcp_segs_in() has to be called before
      __skb_pull().
      
      v4: Add comment to the changes in tcp_fastopen_add_skb()
      and also add remark on this case in the commit message.
      
      v3: Add const modifier to the skb parameter in tcp_segs_in()
      
      v2: Rework based on recent fix by Eric:
      commit a9d99ce2 ("tcp: fix tcpi_segs_in after connection establishment")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Chris Rapier <rapier@psc.edu>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a44d6eac
  23. 19 2月, 2016 1 次提交
    • E
      tcp/dccp: fix another race at listener dismantle · 7716682c
      Eric Dumazet 提交于
      Ilya reported following lockdep splat:
      
      kernel: =========================
      kernel: [ BUG: held lock freed! ]
      kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
      kernel: -------------------------
      kernel: swapper/5/0 is freeing memory
      ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
      kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      kernel: 4 locks held by swapper/5/0:
      kernel: #0:  (rcu_read_lock){......}, at: [<ffffffff8169ef6b>]
      netif_receive_skb_internal+0x4b/0x1f0
      kernel: #1:  (rcu_read_lock){......}, at: [<ffffffff816e977f>]
      ip_local_deliver_finish+0x3f/0x380
      kernel: #2:  (slock-AF_INET){+.-...}, at: [<ffffffff81685ffb>]
      sk_clone_lock+0x19b/0x440
      kernel: #3:  (&(&queue->rskq_lock)->rlock){+.-...}, at:
      [<ffffffff816f6a88>] inet_csk_reqsk_queue_add+0x28/0xa0
      
      To properly fix this issue, inet_csk_reqsk_queue_add() needs
      to return to its callers if the child as been queued
      into accept queue.
      
      We also need to make sure listener is still there before
      calling sk->sk_data_ready(), by holding a reference on it,
      since the reference carried by the child can disappear as
      soon as the child is put on accept queue.
      Reported-by: NIlya Dryomov <idryomov@gmail.com>
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7716682c
  24. 11 2月, 2016 2 次提交
    • C
      inet: refactor inet[6]_lookup functions to take skb · a583636a
      Craig Gallek 提交于
      This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
      groups.  Doing so with a BPF filter will require access to the
      skb in question.  This change plumbs the skb (and offset to payload
      data) through the call stack to the listening socket lookup
      implementations where it will be used in a following patch.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a583636a
    • C
      inet: create IPv6-equivalent inet_hash function · 496611d7
      Craig Gallek 提交于
      In order to support fast lookups for TCP sockets with SO_REUSEPORT,
      the function that adds sockets to the listening hash set needs
      to be able to check receive address equality.  Since this equality
      check is different for IPv4 and IPv6, we will need two different
      socket hashing functions.
      
      This patch adds inet6_hash identical to the existing inet_hash function
      and updates the appropriate references.  A following patch will
      differentiate the two by passing different comparison functions to
      __inet_hash.
      
      Additionally, in order to use the IPv6 address equality function from
      inet6_hashtables (which is compiled as a built-in object when IPv6 is
      enabled) it also needs to be in a built-in object file as well.  This
      moves ipv6_rcv_saddr_equal into inet_hashtables to accomplish this.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      496611d7
  25. 09 2月, 2016 1 次提交
  26. 27 1月, 2016 1 次提交
  27. 21 1月, 2016 1 次提交
  28. 15 1月, 2016 1 次提交
  29. 11 1月, 2016 1 次提交
  30. 23 12月, 2015 2 次提交
  31. 16 12月, 2015 1 次提交
  32. 15 12月, 2015 1 次提交
    • E
      net: fix IP early demux races · 5037e9ef
      Eric Dumazet 提交于
      David Wilder reported crashes caused by dst reuse.
      
      <quote David>
        I am seeing a crash on a distro V4.2.3 kernel caused by a double
        release of a dst_entry.  In ipv4_dst_destroy() the call to
        list_empty() finds a poisoned next pointer, indicating the dst_entry
        has already been removed from the list and freed. The crash occurs
        18 to 24 hours into a run of a network stress exerciser.
      </quote>
      
      Thanks to his detailed report and analysis, we were able to understand
      the core issue.
      
      IP early demux can associate a dst to skb, after a lookup in TCP/UDP
      sockets.
      
      When socket cache is not properly set, we want to store into
      sk->sk_dst_cache the dst for future IP early demux lookups,
      by acquiring a stable refcount on the dst.
      
      Problem is this acquisition is simply using an atomic_inc(),
      which works well, unless the dst was queued for destruction from
      dst_release() noticing dst refcount went to zero, if DST_NOCACHE
      was set on dst.
      
      We need to make sure current refcount is not zero before incrementing
      it, or risk double free as David reported.
      
      This patch, being a stable candidate, adds two new helpers, and use
      them only from IP early demux problematic paths.
      
      It might be possible to merge in net-next skb_dst_force() and
      skb_dst_force_safe(), but I prefer having the smallest patch for stable
      kernels : Maybe some skb_dst_force() callers do not expect skb->dst
      can suddenly be cleared.
      
      Can probably be backported back to linux-3.6 kernels
      Reported-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Tested-by: NDavid J. Wilder <dwilder@us.ibm.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5037e9ef