1. 14 3月, 2017 1 次提交
    • J
      dccp/tcp: fix routing redirect race · 45caeaa5
      Jon Maxwell 提交于
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206 ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45caeaa5
  2. 23 2月, 2017 1 次提交
    • A
      tcp: setup timestamp offset when write_seq already set · 00355fa5
      Alexey Kodanev 提交于
      Found that when randomized tcp offsets are enabled (by default)
      TCP client can still start new connections without them. Later,
      if server does active close and re-uses sockets in TIME-WAIT
      state, new SYN from client can be rejected on PAWS check inside
      tcp_timewait_state_process(), because either tw_ts_recent or
      rcv_tsval doesn't really have an offset set.
      
      Here is how to reproduce it with LTP netstress tool:
          netstress -R 1 &
          netstress -H 127.0.0.1 -lr 1000000 -a1
      
          [...]
          < S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
          > .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
          < R  seq 1956911535 win 0 length 0
      +1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
          > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00355fa5
  3. 15 2月, 2017 1 次提交
    • J
      ipv6: Handle IPv4-mapped src to in6addr_any dst. · 052d2369
      Jonathan T. Leighton 提交于
      This patch adds a check on the type of the source address for the case
      where the destination address is in6addr_any. If the source is an
      IPv4-mapped IPv6 source address, the destination is changed to
      ::ffff:127.0.0.1, and otherwise the destination is changed to ::1. This
      is done in three locations to handle UDP calls to either connect() or
      sendmsg() and TCP calls to connect(). Note that udpv6_sendmsg() delays
      handling an in6addr_any destination until very late, so the patch only
      needs to handle the case where the source is an IPv4-mapped IPv6
      address.
      Signed-off-by: NJonathan T. Leighton <jtleight@udel.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      052d2369
  4. 06 2月, 2017 1 次提交
  5. 04 2月, 2017 1 次提交
  6. 27 1月, 2017 1 次提交
  7. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net: Remove __sk_dst_reset() in tcp_v6_connect() · 25776aa9
      Wei Wang 提交于
      Remove __sk_dst_reset() in the failure handling because __sk_dst_reset()
      will eventually get called when sk is released. No need to handle it in
      the protocol specific connect call.
      This is also to make the code path consistent with ipv4.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25776aa9
  8. 19 1月, 2017 1 次提交
  9. 14 1月, 2017 2 次提交
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
  10. 10 1月, 2017 2 次提交
  11. 30 12月, 2016 1 次提交
  12. 06 12月, 2016 1 次提交
  13. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  14. 14 11月, 2016 1 次提交
  15. 10 11月, 2016 1 次提交
  16. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  17. 13 10月, 2016 1 次提交
    • E
      ipv6: tcp: restore IP6CB for pktoptions skbs · 8ce48623
      Eric Dumazet 提交于
      Baozeng Ding reported following KASAN splat :
      
      BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8
      Read of size 1 by task poc/25548
      Call Trace:
       [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15
       [<     inline     >] print_address_description /mm/kasan/report.c:204
       [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283
       [<     inline     >] kasan_report /mm/kasan/report.c:303
       [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321
       [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687
       [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40
       [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150
       [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230
       [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035
       [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647
       [<     inline     >] SYSC_getsockopt /net/socket.c:1776
       [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758
       [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
      Memory state around the buggy address:
       ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      > ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                    ^
       ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      He also provided a syzkaller reproducer.
      
      Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB
      data that was moved at a different place in tcp_v6_rcv()
      
      This patch moves tcp_v6_restore_cb() up and calls it from
      tcp_v6_do_rcv() when np->pktoptions is set.
      
      Fixes: 971f10ec ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NBaozeng Ding <sploving1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ce48623
  18. 11 9月, 2016 1 次提交
  19. 29 8月, 2016 1 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
  20. 26 8月, 2016 2 次提交
  21. 24 8月, 2016 2 次提交
  22. 01 7月, 2016 1 次提交
  23. 28 6月, 2016 1 次提交
  24. 15 6月, 2016 1 次提交
  25. 08 6月, 2016 1 次提交
  26. 17 5月, 2016 1 次提交
    • E
      tcp: minor optimizations around tcp_hdr() usage · ea1627c2
      Eric Dumazet 提交于
      tcp_hdr() is slightly more expensive than using skb->data in contexts
      where we know they point to the same byte.
      
      In receive path, tcp_v4_rcv() and tcp_v6_rcv() are in this situation,
      as tcp header has not been pulled yet.
      
      In output path, the same can be said when we just pushed the tcp header
      in the skb, in tcp_transmit_skb() and tcp_make_synack()
      
      Also factorize the two checks for tcb->tcp_flags & TCPHDR_SYN in
      tcp_transmit_skb() and pass tcp header pointer to tcp_ecn_send(),
      so that compiler can further optimize and avoid a reload.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea1627c2
  27. 07 5月, 2016 1 次提交
  28. 03 5月, 2016 1 次提交
  29. 28 4月, 2016 3 次提交
  30. 16 4月, 2016 1 次提交
    • E
      tcp: do not mess with listener sk_wmem_alloc · b3d05147
      Eric Dumazet 提交于
      When removing sk_refcnt manipulation on synflood, I missed that
      using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already
      transitioned to 0.
      
      We should hold sk_refcnt instead, but this is a big deal under attack.
      (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only)
      
      In this patch, I chose to not attach a socket to syncookies skb.
      
      Performance is now 5 Mpps instead of 3.2 Mpps.
      
      Following patch will remove last known false sharing in
      tcp_rcv_state_process()
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3d05147
  31. 08 4月, 2016 1 次提交
  32. 05 4月, 2016 2 次提交