1. 19 8月, 2017 2 次提交
  2. 17 8月, 2017 1 次提交
  3. 16 8月, 2017 2 次提交
  4. 15 8月, 2017 3 次提交
    • E
      tcp: fix possible deadlock in TCP stack vs BPF filter · d624d276
      Eric Dumazet 提交于
      Filtering the ACK packet was not put at the right place.
      
      At this place, we already allocated a child and put it
      into accept queue.
      
      We absolutely need to call tcp_child_process() to release
      its spinlock, or we will deadlock at accept() or close() time.
      
      Found by syzkaller team (Thanks a lot !)
      
      Fixes: 8fac365f ("tcp: Add a tcp_filter hook before handle ack packet")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Chenbo Feng <fengc@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d624d276
    • S
      tcp: ulp: avoid module refcnt leak in tcp_set_ulp · 539a06ba
      Sabrina Dubroca 提交于
      __tcp_ulp_find_autoload returns tcp_ulp_ops after taking a reference on
      the module. Then, if ->init fails, tcp_set_ulp propagates the error but
      nothing releases that reference.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      539a06ba
    • F
      ipv4: route: fix inet_rtm_getroute induced crash · 2c87d63a
      Florian Westphal 提交于
      "ip route get $daddr iif eth0 from $saddr" causes:
       BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50
       Call Trace:
        ip_route_input_rcu+0x1535/0x1b50
        ip_route_input_noref+0xf9/0x190
        tcp_v4_early_demux+0x1a4/0x2b0
        ip_rcv+0xbcb/0xc05
        __netif_receive_skb+0x9c/0xd0
        netif_receive_skb_internal+0x5a8/0x890
      
      Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an
      iif was provided) or ip_route_output_key_hash_rcu.
      
      But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already
      associates the dst_entry with the skb.  This clears the SKB_DST_NOREF
      bit (i.e. skb_dst_drop will release/free the entry while it should not).
      
      Thus only set the dst if we called ip_route_output_key_hash_rcu().
      
      I tested this patch by running:
       while true;do ip r get 10.0.1.2;done > /dev/null &
       while true;do ip r get 10.0.1.2 iif eth0  from 10.0.1.1;done > /dev/null &
      ... and saw no crash or memory leak.
      
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: ba52d61e ("ipv4: route: restore skb_dst_set in inet_rtm_getroute")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c87d63a
  5. 14 8月, 2017 2 次提交
  6. 12 8月, 2017 1 次提交
    • D
      net: ipv4: set orig_oif based on fib result for local traffic · 839da4d9
      David Ahern 提交于
      Attempts to connect to a local address with a socket bound
      to a device with the local address hangs if there is no listener:
      
        $ ip addr sh dev eth1
        3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
          inet 10.100.1.4/24 scope global eth1
             valid_lft forever preferred_lft forever
          inet6 2001:db8:1::4/120 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::e0:f9ff:fe1c:37/64 scope link
             valid_lft forever preferred_lft forever
      
        $ vrf-test -I eth1 -r 10.100.1.4
        <hangs when there is no server>
      
      (don't let the command name fool you; vrf-test works without vrfs.)
      
      The problem is that the original intended device, eth1 in this case, is
      lost when the tcp reset is sent, so the socket lookup does not find a
      match for the reset and the connect attempt hangs. Fix by adjusting
      orig_oif for local traffic to the device from the fib lookup result.
      
      With this patch you get the more user friendly:
        $ vrf-test -I eth1 -r 10.100.1.4
        connect failed: 111: Connection refused
      
      orig_oif is saved to the newly created rtable as rt_iif and when set
      it is used as the dif for socket lookups. It is set based on flowi4_oif
      passed in to ip_route_output_key_hash_rcu and will be set to either
      the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which
      is the case in the example above) or a netdev index depending on the
      lookup path. In each case, resetting orig_oif to the device in the fib
      result for the RTN_LOCAL case allows the actual device to be preserved
      as the skb tx and rx is done over the loopback or VRF device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      839da4d9
  7. 11 8月, 2017 1 次提交
  8. 10 8月, 2017 2 次提交
  9. 09 8月, 2017 2 次提交
  10. 08 8月, 2017 6 次提交
  11. 07 8月, 2017 6 次提交
  12. 04 8月, 2017 9 次提交
    • W
      tcp: enable MSG_ZEROCOPY · f214f915
      Willem de Bruijn 提交于
      Enable support for MSG_ZEROCOPY to the TCP stack. TSO and GSO are
      both supported. Only data sent to remote destinations is sent without
      copying. Packets looped onto a local destination have their payload
      copied to avoid unbounded latency.
      
      Tested:
        A 10x TCP_STREAM between two hosts showed a reduction in netserver
        process cycles by up to 70%, depending on packet size. Systemwide,
        savings are of course much less pronounced, at up to 20% best case.
      
        msg_zerocopy.sh 4 tcp:
      
        without zerocopy
          tx=121792 (7600 MB) txc=0 zc=n
          rx=60458 (7600 MB)
      
        with zerocopy
          tx=286257 (17863 MB) txc=286257 zc=y
          rx=140022 (17863 MB)
      
        This test opens a pair of sockets over veth, one one calls send with
        64KB and optionally MSG_ZEROCOPY and on the other reads the initial
        bytes. The receiver truncates, so this is strictly an upper bound on
        what is achievable. It is more representative of sending data out of
        a physical NIC (when payload is not touched, either).
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f214f915
    • N
      tcp: fix xmit timer to only be reset if data ACKed/SACKed · df92c839
      Neal Cardwell 提交于
      Fix a TCP loss recovery performance bug raised recently on the netdev
      list, in two threads:
      
      (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
      (ii) July 26, 2017: netdev thread:
           "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
           outstanding TLP retransmission"
      
      The basic problem is that incoming TCP packets that did not indicate
      forward progress could cause the xmit timer (TLP or RTO) to be rearmed
      and pushed back in time. In certain corner cases this could result in
      the following problems noted in these threads:
      
       - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
         could cause TCP to repeatedly schedule TLPs forever. We kept
         sending TLPs after every ~200ms, which elicited bogus SACKs, which
         caused more TLPs, ad infinitum; we never fired an RTO to fill in
         the holes.
      
       - Incoming data segments could, in some cases, cause us to reschedule
         our RTO or TLP timer further out in time, for no good reason. This
         could cause repeated inbound data to result in stalls in outbound
         data, in the presence of packet loss.
      
      This commit fixes these bugs by changing the TLP and RTO ACK
      processing to:
      
       (a) Only reschedule the xmit timer once per ACK.
      
       (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
           ACK indicates sufficient forward progress (a packet was
           cumulatively ACKed, or we got a SACK for a packet that was sent
           before the most recent retransmit of the write queue head).
      
      This brings us back into closer compliance with the RFCs, since, as
      the comment for tcp_rearm_rto() notes, we should only restart the RTO
      timer after forward progress on the connection. Previously we were
      restarting the xmit timer even in these cases where there was no
      forward progress.
      
      As a side benefit, this commit simplifies and speeds up the TCP timer
      arming logic. We had been calling inet_csk_reset_xmit_timer() three
      times on normal ACKs that cumulatively acknowledged some data:
      
      1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
              if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                     tcp_rearm_rto(sk);
      
      2) Once in tcp_clean_rtx_queue(), to update the RTO:
              if (flag & FLAG_ACKED) {
                     tcp_rearm_rto(sk);
      
      3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
         to TLP:
              if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                     tcp_schedule_loss_probe(sk);
      
      This commit, by only rescheduling the xmit timer once per ACK,
      simplifies the code and reduces CPU overhead.
      
      This commit was tested in an A/B test with Google web server
      traffic. SNMP stats and request latency metrics were within noise
      levels, substantiating that for normal web traffic patterns this is a
      rare issue. This commit was also tested with packetdrill tests to
      verify that it fixes the timer behavior in the corner cases discussed
      in the netdev threads mentioned above.
      
      This patch is a bug fix patch intended to be queued for -stable
      relases.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Reported-by: NKlavs Klavsen <kl@vsen.dk>
      Reported-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df92c839
    • N
      tcp: enable xmit timer fix by having TLP use time when RTO should fire · a2815817
      Neal Cardwell 提交于
      Have tcp_schedule_loss_probe() base the TLP scheduling decision based
      on when the RTO *should* fire. This is to enable the upcoming xmit
      timer fix in this series, where tcp_schedule_loss_probe() cannot
      assume that the last timer installed was an RTO timer (because we are
      no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
      ACK). So tcp_schedule_loss_probe() must independently figure out when
      an RTO would want to fire.
      
      In the new TLP implementation following in this series, we cannot
      assume that icsk_timeout was set based on an RTO; after processing a
      cumulative ACK the icsk_timeout we see can be from a previous TLP or
      RTO. So we need to independently recalculate the RTO time (instead of
      reading it out of icsk_timeout). Removing this dependency on the
      nature of icsk_timeout makes things a little easier to reason about
      anyway.
      
      Note that the old and new code should be equivalent, since they are
      both saying: "if the RTO is in the future, but at an earlier time than
      the normal TLP time, then set the TLP timer to fire when the RTO would
      have fired".
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2815817
    • N
      tcp: introduce tcp_rto_delta_us() helper for xmit timer fix · e1a10ef7
      Neal Cardwell 提交于
      Pure refactor. This helper will be required in the xmit timer fix
      later in the patch series. (Because the TLP logic will want to make
      this calculation.)
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1a10ef7
    • I
      net: fib_rules: Implement notification logic in core · 1b2a4440
      Ido Schimmel 提交于
      Unlike the routing tables, the FIB rules share a common core, so instead
      of replicating the same logic for each address family we can simply dump
      the rules and send notifications from the core itself.
      
      To protect the integrity of the dump, a rules-specific sequence counter
      is added for each address family and incremented whenever a rule is
      added or deleted (under RTNL).
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b2a4440
    • I
      net: core: Make the FIB notification chain generic · 04b1d4e5
      Ido Schimmel 提交于
      The FIB notification chain is currently soley used by IPv4 code.
      However, we're going to introduce IPv6 FIB offload support, which
      requires these notification as well.
      
      As explained in commit c3852ef7 ("ipv4: fib: Replay events when
      registering FIB notifier"), upon registration to the chain, the callee
      receives a full dump of the FIB tables and rules by traversing all the
      net namespaces. The integrity of the dump is ensured by a per-namespace
      sequence counter that is incremented whenever a change to the tables or
      rules occurs.
      
      In order to allow more address families to use the chain, each family is
      expected to register its fib_notifier_ops in its pernet init. These
      operations allow the common code to read the family's sequence counter
      as well as dump its tables and rules in the given net namespace.
      
      Additionally, a 'family' parameter is added to sent notifications, so
      that listeners could distinguish between the different families.
      
      Implement the common code that allows listeners to register to the chain
      and for address families to register their fib_notifier_ops. Subsequent
      patches will implement these operations in IPv6.
      
      In the future, ipmr and ip6mr will be extended to provide these
      notifications as well.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04b1d4e5
    • E
      net: fix keepalive code vs TCP_FASTOPEN_CONNECT · 2dda6400
      Eric Dumazet 提交于
      syzkaller was able to trigger a divide by 0 in TCP stack [1]
      
      Issue here is that keepalive timer needs to be updated to not attempt
      to send a probe if the connection setup was deferred using
      TCP_FASTOPEN_CONNECT socket option added in linux-4.11
      
      [1]
       divide error: 0000 [#1] SMP
       CPU: 18 PID: 0 Comm: swapper/18 Not tainted
       task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
       RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
       Call Trace:
        <IRQ>
        [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
        [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
        [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
        [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
        [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
        [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
        [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
        [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
        [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
        [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
        <EOI>
        [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
        [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0
      
      Tested:
      
      Following packetdrill no longer crashes the kernel
      
      `echo 0 >/proc/sys/net/ipv4/tcp_timestamps`
      
      // Cache warmup: send a Fast Open cookie request
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
         +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
       +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
         +0 > . 1:1(0) ack 1
         +0 close(3) = 0
         +0 > F. 1:1(0) ack 1
         +0 < F. 1:1(0) ack 2 win 92
         +0 > .  2:2(0) ack 2
      
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
         +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
       +.01 connect(4, ..., ...) = 0
         +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
         +10 close(4) = 0
      
      `echo 1 >/proc/sys/net/ipv4/tcp_timestamps`
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dda6400
    • N
      tcp: remove extra POLL_OUT added for finished active connect() · d06c3583
      Neal Cardwell 提交于
      Commit 45f119bf ("tcp: remove header prediction") introduced a
      minor bug: the sk_state_change() and sk_wake_async() notifications for
      a completed active connection happen twice: once in this new spot
      inside tcp_finish_connect() and once in the existing code in
      tcp_rcv_synsent_state_process() immediately after it calls
      tcp_finish_connect(). This commit remoes the duplicate POLL_OUT
      notifications.
      
      Fixes: 45f119bf ("tcp: remove header prediction")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d06c3583
    • T
      ipv4: Introduce ipip_offload_init helper function. · 93b1b31f
      Tonghao Zhang 提交于
      It's convenient to init ipip offload. We will check
      the return value, and print KERN_CRIT info on failure.
      Signed-off-by: NTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b1b31f
  13. 03 8月, 2017 3 次提交