1. 04 9月, 2017 1 次提交
    • J
      Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · fb452a1a
      Jesper Dangaard Brouer 提交于
      This reverts commit 6d7b857d.
      
      There is a bug in fragmentation codes use of the percpu_counter API,
      that can cause issues on systems with many CPUs.
      
      The frag_mem_limit() just reads the global counter (fbc->count),
      without considering other CPUs can have upto batch size (130K) that
      haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
      this become dangerous at >=24 CPUs (3*1024*1024/130000=24).
      
      The correct API usage would be to use __percpu_counter_compare() which
      does the right thing, and takes into account the number of (online)
      CPUs and batch size, to account for this and call __percpu_counter_sum()
      when needed.
      
      We choose to revert the use of the lib/percpu_counter API for frag
      memory accounting for several reasons:
      
      1) On systems with CPUs > 24, the heavier fully locked
         __percpu_counter_sum() is always invoked, which will be more
         expensive than the atomic_t that is reverted to.
      
      Given systems with more than 24 CPUs are becoming common this doesn't
      seem like a good option.  To mitigate this, the batch size could be
      decreased and thresh be increased.
      
      2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
         CPU, before SKBs are pushed into sockets on remote CPUs.  Given
         NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
         likely be limited.  Thus, a fair chance that atomic add+dec happen
         on the same CPU.
      
      Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
      removed init_frag_mem_limit() and instead use inet_frags_init_net().
      After this revert, inet_frags_uninit_net() becomes empty.
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Fixes: 1d6119ba ("net: fix percpu memory leaks")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb452a1a
  2. 02 9月, 2017 1 次提交
  3. 26 8月, 2017 2 次提交
    • P
      udp6: set rx_dst_cookie on rx_dst updates · 64f0f5d1
      Paolo Abeni 提交于
      Currently, in the udp6 code, the dst cookie is not initialized/updated
      concurrently with the RX dst used by early demux.
      
      As a result, the dst_check() in the early_demux path always fails,
      the rx dst cache is always invalidated, and we can't really
      leverage significant gain from the demux lookup.
      
      Fix it adding udp6 specific variant of sk_rx_dst_set() and use it
      to set the dst cookie when the dst entry is really changed.
      
      The issue is there since the introduction of early demux for ipv6.
      
      Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64f0f5d1
    • S
      tcp: fix refcnt leak with ebpf congestion control · ebfa00c5
      Sabrina Dubroca 提交于
      There are a few bugs around refcnt handling in the new BPF congestion
      control setsockopt:
      
       - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
         cannot get a reference on it. This would lead to a use after free,
         since that ca is going away soon.
      
       - Changing the congestion control case doesn't release the refcnt on
         the previous ca.
      
       - In the reinit case, we first leak a reference on the old ca, then we
         call tcp_reinit_congestion_control on the ca that we have just
         assigned, leading to deinitializing the wrong ca (->release of the
         new ca on the old ca's data) and releasing the refcount on the ca
         that we actually want to use.
      
      This is visible by building (for example) BIC as a module and setting
      net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
      samples/bpf.
      
      This patch fixes the refcount issues, and moves reinit back into tcp
      core to avoid passing a ca pointer back to BPF.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebfa00c5
  4. 25 8月, 2017 2 次提交
  5. 19 8月, 2017 3 次提交
    • N
      tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP · cdbeb633
      Neal Cardwell 提交于
      In some situations tcp_send_loss_probe() can realize that it's unable
      to send a loss probe (TLP), and falls back to calling tcp_rearm_rto()
      to schedule an RTO timer. In such cases, sometimes tcp_rearm_rto()
      realizes that the RTO was eligible to fire immediately or at some
      point in the past (delta_us <= 0). Previously in such cases
      tcp_rearm_rto() was scheduling such "overdue" RTOs to happen at now +
      icsk_rto, which caused needless delays of hundreds of milliseconds
      (and non-linear behavior that made reproducible testing
      difficult). This commit changes the logic to schedule "overdue" RTOs
      ASAP, rather than at now + icsk_rto.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Suggested-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdbeb633
    • R
      net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set · bc3aae2b
      Roopa Prabhu 提交于
      Syzkaller hit 'general protection fault in fib_dump_info' bug on
      commit 4.13-rc5..
      
      Guilty file: net/ipv4/fib_semantics.c
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff880078562700 task.stack: ffff880078110000
      RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
      RSP: 0018:ffff880078117010 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: 00000000000000fe RCX: 0000000000000002
      RDX: 0000000000000006 RSI: ffff880078117084 RDI: 0000000000000030
      RBP: ffff880078117268 R08: 000000000000000c R09: ffff8800780d80c8
      R10: 0000000058d629b4 R11: 0000000067fce681 R12: 0000000000000000
      R13: ffff8800784bd540 R14: ffff8800780d80b5 R15: ffff8800780d80a4
      FS:  00000000022fa940(0000) GS:ffff88007fc00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004387d0 CR3: 0000000079135000 CR4: 00000000000006f0
      Call Trace:
        inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
        rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
        netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
        rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
        netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
        netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
        netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg+0xca/0x110 net/socket.c:643
        ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
        __sys_sendmsg+0xd1/0x170 net/socket.c:2069
        SYSC_sendmsg net/socket.c:2080 [inline]
        SyS_sendmsg+0x2d/0x50 net/socket.c:2076
        entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x4512e9
        RSP: 002b:00007ffc75584cc8 EFLAGS: 00000216 ORIG_RAX:
        000000000000002e
        RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004512e9
        RDX: 0000000000000000 RSI: 0000000020f2cfc8 RDI: 0000000000000003
        RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000216 R12: fffffffffffffffe
        R13: 0000000000718000 R14: 0000000020c44ff0 R15: 0000000000000000
        Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
        28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03
        <0f>
        b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
        RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
        ffff880078117010
      ---[ end trace 254a7af28348f88b ]---
      
      This patch adds a res->fi NULL check.
      
      example run:
      $ip route get 0.0.0.0 iif virt1-0
      broadcast 0.0.0.0 dev lo
          cache <local,brd> iif virt1-0
      
      $ip route get 0.0.0.0 iif virt1-0 fibmatch
      RTNETLINK answers: No route to host
      Reported-by: Nidaifish <idaifish@gmail.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b6179813 ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc3aae2b
    • M
      datagram: When peeking datagrams with offset < 0 don't skip empty skbs · a0917e0b
      Matthew Dawson 提交于
      Due to commit e6afc8ac ("udp: remove
      headers from UDP packets before queueing"), when udp packets are being
      peeked the requested extra offset is always 0 as there is no need to skip
      the udp header.  However, when the offset is 0 and the next skb is
      of length 0, it is only returned once.  The behaviour can be seen with
      the following python script:
      
      from socket import *;
      f=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      g=socket(AF_INET6, SOCK_DGRAM | SOCK_NONBLOCK, 0);
      f.bind(('::', 0));
      addr=('::1', f.getsockname()[1]);
      g.sendto(b'', addr)
      g.sendto(b'b', addr)
      print(f.recvfrom(10, MSG_PEEK));
      print(f.recvfrom(10, MSG_PEEK));
      
      Where the expected output should be the empty string twice.
      
      Instead, make sk_peek_offset return negative values, and pass those values
      to __skb_try_recv_datagram/__skb_try_recv_from_queue.  If the passed offset
      to __skb_try_recv_from_queue is negative, the checked skb is never skipped.
      __skb_try_recv_from_queue will then ensure the offset is reset back to 0
      if a peek is requested without an offset, unless no packets are found.
      
      Also simplify the if condition in __skb_try_recv_from_queue.  If _off is
      greater then 0, and off is greater then or equal to skb->len, then
      (_off || skb->len) must always be true assuming skb->len >= 0 is always
      true.
      
      Also remove a redundant check around a call to sk_peek_offset in af_unix.c,
      as it double checked if MSG_PEEK was set in the flags.
      
      V2:
       - Moved the negative fixup into __skb_try_recv_from_queue, and remove now
      redundant checks
       - Fix peeking in udp{,v6}_recvmsg to report the right value when the
      offset is 0
      
      V3:
       - Marked new branch in __skb_try_recv_from_queue as unlikely.
      Signed-off-by: NMatthew Dawson <matthew@mjdsystems.ca>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0917e0b
  6. 17 8月, 2017 2 次提交
  7. 16 8月, 2017 1 次提交
  8. 15 8月, 2017 3 次提交
    • E
      tcp: fix possible deadlock in TCP stack vs BPF filter · d624d276
      Eric Dumazet 提交于
      Filtering the ACK packet was not put at the right place.
      
      At this place, we already allocated a child and put it
      into accept queue.
      
      We absolutely need to call tcp_child_process() to release
      its spinlock, or we will deadlock at accept() or close() time.
      
      Found by syzkaller team (Thanks a lot !)
      
      Fixes: 8fac365f ("tcp: Add a tcp_filter hook before handle ack packet")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Chenbo Feng <fengc@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d624d276
    • S
      tcp: ulp: avoid module refcnt leak in tcp_set_ulp · 539a06ba
      Sabrina Dubroca 提交于
      __tcp_ulp_find_autoload returns tcp_ulp_ops after taking a reference on
      the module. Then, if ->init fails, tcp_set_ulp propagates the error but
      nothing releases that reference.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      539a06ba
    • F
      ipv4: route: fix inet_rtm_getroute induced crash · 2c87d63a
      Florian Westphal 提交于
      "ip route get $daddr iif eth0 from $saddr" causes:
       BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50
       Call Trace:
        ip_route_input_rcu+0x1535/0x1b50
        ip_route_input_noref+0xf9/0x190
        tcp_v4_early_demux+0x1a4/0x2b0
        ip_rcv+0xbcb/0xc05
        __netif_receive_skb+0x9c/0xd0
        netif_receive_skb_internal+0x5a8/0x890
      
      Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an
      iif was provided) or ip_route_output_key_hash_rcu.
      
      But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already
      associates the dst_entry with the skb.  This clears the SKB_DST_NOREF
      bit (i.e. skb_dst_drop will release/free the entry while it should not).
      
      Thus only set the dst if we called ip_route_output_key_hash_rcu().
      
      I tested this patch by running:
       while true;do ip r get 10.0.1.2;done > /dev/null &
       while true;do ip r get 10.0.1.2 iif eth0  from 10.0.1.1;done > /dev/null &
      ... and saw no crash or memory leak.
      
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: ba52d61e ("ipv4: route: restore skb_dst_set in inet_rtm_getroute")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c87d63a
  9. 11 8月, 2017 1 次提交
  10. 10 8月, 2017 1 次提交
  11. 09 8月, 2017 2 次提交
  12. 07 8月, 2017 1 次提交
  13. 04 8月, 2017 4 次提交
    • N
      tcp: fix xmit timer to only be reset if data ACKed/SACKed · df92c839
      Neal Cardwell 提交于
      Fix a TCP loss recovery performance bug raised recently on the netdev
      list, in two threads:
      
      (i)  July 26, 2017: netdev thread "TCP fast retransmit issues"
      (ii) July 26, 2017: netdev thread:
           "[PATCH V2 net-next] TLP: Don't reschedule PTO when there's one
           outstanding TLP retransmission"
      
      The basic problem is that incoming TCP packets that did not indicate
      forward progress could cause the xmit timer (TLP or RTO) to be rearmed
      and pushed back in time. In certain corner cases this could result in
      the following problems noted in these threads:
      
       - Repeated ACKs coming in with bogus SACKs corrupted by middleboxes
         could cause TCP to repeatedly schedule TLPs forever. We kept
         sending TLPs after every ~200ms, which elicited bogus SACKs, which
         caused more TLPs, ad infinitum; we never fired an RTO to fill in
         the holes.
      
       - Incoming data segments could, in some cases, cause us to reschedule
         our RTO or TLP timer further out in time, for no good reason. This
         could cause repeated inbound data to result in stalls in outbound
         data, in the presence of packet loss.
      
      This commit fixes these bugs by changing the TLP and RTO ACK
      processing to:
      
       (a) Only reschedule the xmit timer once per ACK.
      
       (b) Only reschedule the xmit timer if tcp_clean_rtx_queue() deems the
           ACK indicates sufficient forward progress (a packet was
           cumulatively ACKed, or we got a SACK for a packet that was sent
           before the most recent retransmit of the write queue head).
      
      This brings us back into closer compliance with the RFCs, since, as
      the comment for tcp_rearm_rto() notes, we should only restart the RTO
      timer after forward progress on the connection. Previously we were
      restarting the xmit timer even in these cases where there was no
      forward progress.
      
      As a side benefit, this commit simplifies and speeds up the TCP timer
      arming logic. We had been calling inet_csk_reset_xmit_timer() three
      times on normal ACKs that cumulatively acknowledged some data:
      
      1) Once near the top of tcp_ack() to switch from TLP timer to RTO:
              if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                     tcp_rearm_rto(sk);
      
      2) Once in tcp_clean_rtx_queue(), to update the RTO:
              if (flag & FLAG_ACKED) {
                     tcp_rearm_rto(sk);
      
      3) Once in tcp_ack() after tcp_fastretrans_alert() to switch from RTO
         to TLP:
              if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                     tcp_schedule_loss_probe(sk);
      
      This commit, by only rescheduling the xmit timer once per ACK,
      simplifies the code and reduces CPU overhead.
      
      This commit was tested in an A/B test with Google web server
      traffic. SNMP stats and request latency metrics were within noise
      levels, substantiating that for normal web traffic patterns this is a
      rare issue. This commit was also tested with packetdrill tests to
      verify that it fixes the timer behavior in the corner cases discussed
      in the netdev threads mentioned above.
      
      This patch is a bug fix patch intended to be queued for -stable
      relases.
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Reported-by: NKlavs Klavsen <kl@vsen.dk>
      Reported-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df92c839
    • N
      tcp: enable xmit timer fix by having TLP use time when RTO should fire · a2815817
      Neal Cardwell 提交于
      Have tcp_schedule_loss_probe() base the TLP scheduling decision based
      on when the RTO *should* fire. This is to enable the upcoming xmit
      timer fix in this series, where tcp_schedule_loss_probe() cannot
      assume that the last timer installed was an RTO timer (because we are
      no longer doing the "rearm RTO, rearm RTO, rearm TLP" dance on every
      ACK). So tcp_schedule_loss_probe() must independently figure out when
      an RTO would want to fire.
      
      In the new TLP implementation following in this series, we cannot
      assume that icsk_timeout was set based on an RTO; after processing a
      cumulative ACK the icsk_timeout we see can be from a previous TLP or
      RTO. So we need to independently recalculate the RTO time (instead of
      reading it out of icsk_timeout). Removing this dependency on the
      nature of icsk_timeout makes things a little easier to reason about
      anyway.
      
      Note that the old and new code should be equivalent, since they are
      both saying: "if the RTO is in the future, but at an earlier time than
      the normal TLP time, then set the TLP timer to fire when the RTO would
      have fired".
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2815817
    • N
      tcp: introduce tcp_rto_delta_us() helper for xmit timer fix · e1a10ef7
      Neal Cardwell 提交于
      Pure refactor. This helper will be required in the xmit timer fix
      later in the patch series. (Because the TLP logic will want to make
      this calculation.)
      
      Fixes: 6ba8a3b1 ("tcp: Tail loss probe (TLP)")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1a10ef7
    • E
      net: fix keepalive code vs TCP_FASTOPEN_CONNECT · 2dda6400
      Eric Dumazet 提交于
      syzkaller was able to trigger a divide by 0 in TCP stack [1]
      
      Issue here is that keepalive timer needs to be updated to not attempt
      to send a probe if the connection setup was deferred using
      TCP_FASTOPEN_CONNECT socket option added in linux-4.11
      
      [1]
       divide error: 0000 [#1] SMP
       CPU: 18 PID: 0 Comm: swapper/18 Not tainted
       task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
       RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
       Call Trace:
        <IRQ>
        [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
        [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
        [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
        [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
        [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
        [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
        [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
        [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
        [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
        [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
        <EOI>
        [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
        [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0
      
      Tested:
      
      Following packetdrill no longer crashes the kernel
      
      `echo 0 >/proc/sys/net/ipv4/tcp_timestamps`
      
      // Cache warmup: send a Fast Open cookie request
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
         +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
       +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
         +0 > . 1:1(0) ack 1
         +0 close(3) = 0
         +0 > F. 1:1(0) ack 1
         +0 < F. 1:1(0) ack 2 win 92
         +0 > .  2:2(0) ack 2
      
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
         +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
         +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
         +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
       +.01 connect(4, ..., ...) = 0
         +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
         +10 close(4) = 0
      
      `echo 1 >/proc/sys/net/ipv4/tcp_timestamps`
      
      Fixes: 19f6d3f3 ("net/tcp-fastopen: Add new API support")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2dda6400
  14. 03 8月, 2017 1 次提交
  15. 02 8月, 2017 2 次提交
  16. 01 8月, 2017 2 次提交
    • I
      ipv4: fib: Fix NULL pointer deref during fib_sync_down_dev() · 71ed7ee3
      Ido Schimmel 提交于
      Michał reported a NULL pointer deref during fib_sync_down_dev() when
      unregistering a netdevice. The problem is that we don't check for
      'in_dev' being NULL, which can happen in very specific cases.
      
      Usually routes are flushed upon NETDEV_DOWN sent in either the netdev or
      the inetaddr notification chains. However, if an interface isn't
      configured with any IP address, then it's possible for host routes to be
      flushed following NETDEV_UNREGISTER, after NULLing dev->ip_ptr in
      inetdev_destroy().
      
      To reproduce:
      $ ip link add type dummy
      $ ip route add local 1.1.1.0/24 dev dummy0
      $ ip link del dev dummy0
      
      Fix this by checking for the presence of 'in_dev' before referencing it.
      
      Fixes: 982acb97 ("ipv4: fib: Notify about nexthop status changes")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Tested-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      71ed7ee3
    • T
      netfilter: x_tables: Fix use-after-free in ipt_do_table. · 9beceb54
      Taehee Yoo 提交于
      If verdict is NF_STOLEN in the SYNPROXY target,
      the skb is consumed.
      However, ipt_do_table() always tries to get ip header from the skb.
      So that, KASAN triggers the use-after-free message.
      
      We can reproduce this message using below command.
        # iptables -I INPUT -p tcp -j SYNPROXY --mss 1460
      
      [ 193.542265] BUG: KASAN: use-after-free in ipt_do_table+0x1405/0x1c10
      [ ... ]
      [ 193.578603] Call Trace:
      [ 193.581590] <IRQ>
      [ 193.584107] dump_stack+0x68/0xa0
      [ 193.588168] print_address_description+0x78/0x290
      [ 193.593828] ? ipt_do_table+0x1405/0x1c10
      [ 193.598690] kasan_report+0x230/0x340
      [ 193.603194] __asan_report_load2_noabort+0x19/0x20
      [ 193.608950] ipt_do_table+0x1405/0x1c10
      [ 193.613591] ? rcu_read_lock_held+0xae/0xd0
      [ 193.618631] ? ip_route_input_rcu+0x27d7/0x4270
      [ 193.624348] ? ipt_do_table+0xb68/0x1c10
      [ 193.629124] ? do_add_counters+0x620/0x620
      [ 193.634234] ? iptable_filter_net_init+0x60/0x60
      [ ... ]
      
      After this patch, only when verdict is XT_CONTINUE,
      ipt_do_table() tries to get ip header.
      Also arpt_do_table() is modified because it has same bug.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9beceb54
  17. 30 7月, 2017 2 次提交
    • A
      tcp: avoid bogus gcc-7 array-bounds warning · efe967cd
      Arnd Bergmann 提交于
      When using CONFIG_UBSAN_SANITIZE_ALL, the TCP code produces a
      false-positive warning:
      
      net/ipv4/tcp_output.c: In function 'tcp_connect':
      net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
         tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
                                              ^~
      net/ipv4/tcp_output.c:2207:40: error: array subscript is below array bounds [-Werror=array-bounds]
         tp->chrono_stat[tp->chrono_type - 1] += now - tp->chrono_start;
         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      
      I have opened a gcc bug for this, but distros have already shipped
      compilers with this problem, and it's not clear yet whether there is
      a way for gcc to avoid the warning. As the problem is related to the
      bitfield access, this introduces a temporary variable to store the old
      enum value.
      
      I did not notice this warning earlier, since UBSAN is disabled when
      building with COMPILE_TEST, and that was always turned on in both
      allmodconfig and randconfig tests.
      
      Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81601Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efe967cd
    • P
      udp6: fix socket leak on early demux · c9f2c1ae
      Paolo Abeni 提交于
      When an early demuxed packet reaches __udp6_lib_lookup_skb(), the
      sk reference is retrieved and used, but the relevant reference
      count is leaked and the socket destructor is never called.
      Beyond leaking the sk memory, if there are pending UDP packets
      in the receive queue, even the related accounted memory is leaked.
      
      In the long run, this will cause persistent forward allocation errors
      and no UDP skbs (both ipv4 and ipv6) will be able to reach the
      user-space.
      
      Fix this by explicitly accessing the early demux reference before
      the lookup, and properly decreasing the socket reference count
      after usage.
      
      Also drop the skb_steal_sock() in __udp6_lib_lookup_skb(), and
      the now obsoleted comment about "socket cache".
      
      The newly added code is derived from the current ipv4 code for the
      similar path.
      
      v1 -> v2:
        fixed the __udp6_lib_rcv() return code for resubmission,
        as suggested by Eric
      Reported-by: NSam Edwards <CFSworks@gmail.com>
      Reported-by: NMarc Haber <mh+netdev@zugschlus.de>
      Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9f2c1ae
  18. 27 7月, 2017 1 次提交
  19. 26 7月, 2017 1 次提交
    • P
      udp: preserve head state for IP_CMSG_PASSSEC · dce4551c
      Paolo Abeni 提交于
      Paul Moore reported a SELinux/IP_PASSSEC regression
      caused by missing skb->sp at recvmsg() time. We need to
      preserve the skb head state to process the IP_CMSG_PASSSEC
      cmsg.
      
      With this commit we avoid releasing the skb head state in the
      BH even if a secpath is attached to the current skb, and stores
      the skb status (with/without head states) in the scratch area,
      so that we can access it at skb deallocation time, without
      incurring in cache-miss penalties.
      
      This also avoids misusing the skb CB for ipv6 packets,
      as introduced by the commit 0ddf3fb2 ("udp: preserve
      skb->dst if required for IP options processing").
      
      Clean a bit the scratch area helpers implementation, to
      reduce the code differences between 32 and 64 bits build.
      Reported-by: NPaul Moore <paul@paul-moore.com>
      Fixes: 0a463c78 ("udp: avoid a cache miss on dequeue")
      Fixes: 0ddf3fb2 ("udp: preserve skb->dst if required for IP options processing")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dce4551c
  20. 21 7月, 2017 1 次提交
    • M
      ipv4: initialize fib_trie prior to register_netdev_notifier call. · 8799a221
      Mahesh Bandewar 提交于
      Net stack initialization currently initializes fib-trie after the
      first call to netdevice_notifier() call. In fact fib_trie initialization
      needs to happen before first rtnl_register(). It does not cause any problem
      since there are no devices UP at this moment, but trying to bring 'lo'
      UP at initialization would make this assumption wrong and exposes the issue.
      
      Fixes following crash
      
       Call Trace:
        ? alternate_node_alloc+0x76/0xa0
        fib_table_insert+0x1b7/0x4b0
        fib_magic.isra.17+0xea/0x120
        fib_add_ifaddr+0x7b/0x190
        fib_netdev_event+0xc0/0x130
        register_netdevice_notifier+0x1c1/0x1d0
        ip_fib_init+0x72/0x85
        ip_rt_init+0x187/0x1e9
        ip_init+0xe/0x1a
        inet_init+0x171/0x26c
        ? ipv4_offload_init+0x66/0x66
        do_one_initcall+0x43/0x160
        kernel_init_freeable+0x191/0x219
        ? rest_init+0x80/0x80
        kernel_init+0xe/0x150
        ret_from_fork+0x22/0x30
       Code: f6 46 23 04 74 86 4c 89 f7 e8 ae 45 01 00 49 89 c7 4d 85 ff 0f 85 7b ff ff ff 31 db eb 08 4c 89 ff e8 16 47 01 00 48 8b 44 24 38 <45> 8b 6e 14 4d 63 76 74 48 89 04 24 0f 1f 44 00 00 48 83 c4 08
       RIP: kmem_cache_alloc+0xcf/0x1c0 RSP: ffff9b1500017c28
       CR2: 0000000000000014
      
      Fixes: 7b1a74fd ("[NETNS]: Refactor fib initialization so it can handle multiple namespaces.")
      Fixes: 7f9b8052 ("[IPV4]: fib hash|trie initialization")
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8799a221
  21. 19 7月, 2017 3 次提交
    • S
      netfilter: ipt_CLUSTERIP: fix use-after-free of proc entry · 3840538a
      Sabrina Dubroca 提交于
      When we delete a netns with a CLUSTERIP rule, clusterip_net_exit() is
      called first, removing /proc/net/ipt_CLUSTERIP.
      Then clusterip_config_entry_put() is called from clusterip_tg_destroy(),
      and tries to remove its entry under /proc/net/ipt_CLUSTERIP/.
      
      Fix this by checking that the parent directory of the entry to remove
      hasn't already been deleted.
      
      The following triggers a KASAN splat (stealing the reproducer from
      202f59af, thanks to Jianlin Shi and Xin Long):
      
          ip netns add test
          ip link add veth0_in type veth peer name veth0_out
          ip link set veth0_in netns test
          ip netns exec test ip link set lo up
          ip netns exec test ip link set veth0_in up
          ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j     \
              CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
              --local-node 1 --hashmode sourceip-sourceport
          ip netns del test
      
      Fixes: ce4ff76c ("netfilter: ipt_CLUSTERIP: make proc directory per net namespace")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3840538a
    • P
      udp: preserve skb->dst if required for IP options processing · 0ddf3fb2
      Paolo Abeni 提交于
      Eric noticed that in udp_recvmsg() we still need to access
      skb->dst while processing the IP options.
      Since commit 0a463c78 ("udp: avoid a cache miss on dequeue")
      skb->dst is no more available at recvmsg() time and bad things
      will happen if we enter the relevant code path.
      
      This commit address the issue, avoid clearing skb->dst if
      any IP options are present into the relevant skb.
      Since the IP CB is contained in the first skb cacheline, we can
      test it to decide to leverage the consume_stateless_skb()
      optimization, without measurable additional cost in the faster
      path.
      
      v1 -> v2: updated commit message tags
      
      Fixes: 0a463c78 ("udp: avoid a cache miss on dequeue")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ddf3fb2
    • A
      ipv4: ipv6: initialize treq->txhash in cookie_v[46]_check() · 18bcf290
      Alexander Potapenko 提交于
      KMSAN reported use of uninitialized memory in skb_set_hash_from_sk(),
      which originated from the TCP request socket created in
      cookie_v6_check():
      
       ==================================================================
       BUG: KMSAN: use of uninitialized memory in tcp_transmit_skb+0xf77/0x3ec0
       CPU: 1 PID: 2949 Comm: syz-execprog Not tainted 4.11.0-rc5+ #2931
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       TCP: request_sock_TCPv6: Possible SYN flooding on port 20028. Sending cookies.  Check SNMP counters.
       Call Trace:
        <IRQ>
        __dump_stack lib/dump_stack.c:16
        dump_stack+0x172/0x1c0 lib/dump_stack.c:52
        kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927
        __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469
        skb_set_hash_from_sk ./include/net/sock.h:2011
        tcp_transmit_skb+0xf77/0x3ec0 net/ipv4/tcp_output.c:983
        tcp_send_ack+0x75b/0x830 net/ipv4/tcp_output.c:3493
        tcp_delack_timer_handler+0x9a6/0xb90 net/ipv4/tcp_timer.c:284
        tcp_delack_timer+0x1b0/0x310 net/ipv4/tcp_timer.c:309
        call_timer_fn+0x240/0x520 kernel/time/timer.c:1268
        expire_timers kernel/time/timer.c:1307
        __run_timers+0xc13/0xf10 kernel/time/timer.c:1601
        run_timer_softirq+0x36/0xa0 kernel/time/timer.c:1614
        __do_softirq+0x485/0x942 kernel/softirq.c:284
        invoke_softirq kernel/softirq.c:364
        irq_exit+0x1fa/0x230 kernel/softirq.c:405
        exiting_irq+0xe/0x10 ./arch/x86/include/asm/apic.h:657
        smp_apic_timer_interrupt+0x5a/0x80 arch/x86/kernel/apic/apic.c:966
        apic_timer_interrupt+0x86/0x90 arch/x86/entry/entry_64.S:489
       RIP: 0010:native_restore_fl ./arch/x86/include/asm/irqflags.h:36
       RIP: 0010:arch_local_irq_restore ./arch/x86/include/asm/irqflags.h:77
       RIP: 0010:__msan_poison_alloca+0xed/0x120 mm/kmsan/kmsan_instr.c:440
       RSP: 0018:ffff880024917cd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
       RAX: 0000000000000246 RBX: ffff8800224c0000 RCX: 0000000000000005
       RDX: 0000000000000004 RSI: ffff880000000000 RDI: ffffea0000b6d770
       RBP: ffff880024917d58 R08: 0000000000000dd8 R09: 0000000000000004
       R10: 0000160000000000 R11: 0000000000000000 R12: ffffffff85abf810
       R13: ffff880024917dd8 R14: 0000000000000010 R15: ffffffff81cabde4
        </IRQ>
        poll_select_copy_remaining+0xac/0x6b0 fs/select.c:293
        SYSC_select+0x4b4/0x4e0 fs/select.c:653
        SyS_select+0x76/0xa0 fs/select.c:634
        entry_SYSCALL_64_fastpath+0x13/0x94 arch/x86/entry/entry_64.S:204
       RIP: 0033:0x4597e7
       RSP: 002b:000000c420037ee0 EFLAGS: 00000246 ORIG_RAX: 0000000000000017
       RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00000000004597e7
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
       RBP: 000000c420037ef0 R08: 000000c420037ee0 R09: 0000000000000059
       R10: 0000000000000000 R11: 0000000000000246 R12: 000000000042dc20
       R13: 00000000000000f3 R14: 0000000000000030 R15: 0000000000000003
       chained origin:
        save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
        kmsan_save_stack mm/kmsan/kmsan.c:317
        kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547
        __msan_store_shadow_origin_4+0xac/0x110 mm/kmsan/kmsan_instr.c:259
        tcp_create_openreq_child+0x709/0x1ae0 net/ipv4/tcp_minisocks.c:472
        tcp_v6_syn_recv_sock+0x7eb/0x2a30 net/ipv6/tcp_ipv6.c:1103
        tcp_get_cookie_sock+0x136/0x5f0 net/ipv4/syncookies.c:212
        cookie_v6_check+0x17a9/0x1b50 net/ipv6/syncookies.c:245
        tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
        tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
        tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
        ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
        NF_HOOK ./include/linux/netfilter.h:257
        ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
        dst_input ./include/net/dst.h:492
        ip6_rcv_finish net/ipv6/ip6_input.c:69
        NF_HOOK ./include/linux/netfilter.h:257
        ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
        __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
        __netif_receive_skb net/core/dev.c:4246
        process_backlog+0x667/0xba0 net/core/dev.c:4866
        napi_poll net/core/dev.c:5268
        net_rx_action+0xc95/0x1590 net/core/dev.c:5333
        __do_softirq+0x485/0x942 kernel/softirq.c:284
       origin:
        save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302
        kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198
        kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337
        kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766
        reqsk_alloc ./include/net/request_sock.h:87
        inet_reqsk_alloc+0xa4/0x5b0 net/ipv4/tcp_input.c:6200
        cookie_v6_check+0x4f4/0x1b50 net/ipv6/syncookies.c:169
        tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:989
        tcp_v6_do_rcv+0xdd8/0x1c60 net/ipv6/tcp_ipv6.c:1298
        tcp_v6_rcv+0x41a3/0x4f00 net/ipv6/tcp_ipv6.c:1487
        ip6_input_finish+0x82f/0x1ee0 net/ipv6/ip6_input.c:279
        NF_HOOK ./include/linux/netfilter.h:257
        ip6_input+0x239/0x290 net/ipv6/ip6_input.c:322
        dst_input ./include/net/dst.h:492
        ip6_rcv_finish net/ipv6/ip6_input.c:69
        NF_HOOK ./include/linux/netfilter.h:257
        ipv6_rcv+0x1dbd/0x22e0 net/ipv6/ip6_input.c:203
        __netif_receive_skb_core+0x2f6f/0x3a20 net/core/dev.c:4208
        __netif_receive_skb net/core/dev.c:4246
        process_backlog+0x667/0xba0 net/core/dev.c:4866
        napi_poll net/core/dev.c:5268
        net_rx_action+0xc95/0x1590 net/core/dev.c:5333
        __do_softirq+0x485/0x942 kernel/softirq.c:284
       ==================================================================
      
      Similar error is reported for cookie_v4_check().
      
      Fixes: 58d607d3 ("tcp: provide skb->hash to synack packets")
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18bcf290
  22. 17 7月, 2017 1 次提交
  23. 16 7月, 2017 2 次提交