1. 27 4月, 2018 1 次提交
  2. 16 4月, 2018 1 次提交
  3. 13 4月, 2018 1 次提交
    • E
      tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets · 72123032
      Eric Dumazet 提交于
      syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]
      
      I believe this was caused by a TCP_MD5SIG being set on live
      flow.
      
      This is highly unexpected, since TCP option space is limited.
      
      For instance, presence of TCP MD5 option automatically disables
      TCP TimeStamp option at SYN/SYNACK time, which we can not do
      once flow has been established.
      
      Really, adding/deleting an MD5 key only makes sense on sockets
      in CLOSE or LISTEN state.
      
      [1]
      BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
      CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x185/0x1d0 lib/dump_stack.c:53
       kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
       __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
       tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
       tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
       tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
       tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
       tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
       SyS_sendto+0x8a/0xb0 net/socket.c:1715
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      RIP: 0033:0x448fe9
      RSP: 002b:00007fd472c64d38 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00000000006e5a30 RCX: 0000000000448fe9
      RDX: 000000000000029f RSI: 0000000020a88f88 RDI: 0000000000000004
      RBP: 00000000006e5a34 R08: 0000000020e68000 R09: 0000000000000010
      R10: 00000000200007fd R11: 0000000000000216 R12: 0000000000000000
      R13: 00007fff074899ef R14: 00007fd472c659c0 R15: 0000000000000009
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
       kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
       kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
       kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
       slab_post_alloc_hook mm/slab.h:445 [inline]
       slab_alloc_node mm/slub.c:2737 [inline]
       __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
       __kmalloc_reserve net/core/skbuff.c:138 [inline]
       __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
       alloc_skb include/linux/skbuff.h:984 [inline]
       tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
       __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
       tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
       tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
       tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
       sk_backlog_rcv include/net/sock.h:908 [inline]
       __release_sock+0x2d6/0x680 net/core/sock.c:2271
       release_sock+0x97/0x2a0 net/core/sock.c:2786
       tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
       inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
       sock_sendmsg_nosec net/socket.c:630 [inline]
       sock_sendmsg net/socket.c:640 [inline]
       SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
       SyS_sendto+0x8a/0xb0 net/socket.c:1715
       do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: cfb6eeb4 ("[TCP]: MD5 Signature Option (RFC2385) support.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72123032
  4. 30 3月, 2018 1 次提交
    • J
      bpf: sockmap redirect ingress support · 8934ce2f
      John Fastabend 提交于
      Add support for the BPF_F_INGRESS flag in sk_msg redirect helper.
      To do this add a scatterlist ring for receiving socks to check
      before calling into regular recvmsg call path. Additionally, because
      the poll wakeup logic only checked the skb recv queue we need to
      add a hook in TCP stack (similar to write side) so that we have
      a way to wake up polling socks when a scatterlist is redirected
      to that sock.
      
      After this all that is needed is for the redirect helper to
      push the scatterlist into the psock receive queue.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8934ce2f
  5. 20 3月, 2018 1 次提交
  6. 17 3月, 2018 1 次提交
  7. 08 3月, 2018 1 次提交
  8. 05 3月, 2018 2 次提交
  9. 22 2月, 2018 4 次提交
  10. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  11. 30 1月, 2018 1 次提交
  12. 26 1月, 2018 2 次提交
    • L
      bpf: Add BPF_SOCK_OPS_STATE_CB · d4487491
      Lawrence Brakmo 提交于
      Adds support for calling sock_ops BPF program when there is a TCP state
      change. Two arguments are used; one for the old state and another for
      the new state.
      
      There is a new enum in include/uapi/linux/bpf.h that exports the TCP
      states that prepends BPF_ to the current TCP state names. If it is ever
      necessary to change the internal TCP state values (other than adding
      more to the end), then it will become necessary to convert from the
      internal TCP state value to the BPF value before calling the BPF
      sock_ops function. There are a set of compile checks added in tcp.c
      to detect if the internal and BPF values differ so we can make the
      necessary fixes.
      
      New op: BPF_SOCK_OPS_STATE_CB.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      d4487491
    • L
      bpf: Support passing args to sock_ops bpf function · de525be2
      Lawrence Brakmo 提交于
      Adds support for passing up to 4 arguments to sock_ops bpf functions. It
      reusues the reply union, so the bpf_sock_ops structures are not
      increased in size.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      de525be2
  13. 25 1月, 2018 1 次提交
    • D
      net: tcp: close sock if net namespace is exiting · 4ee806d5
      Dan Streetman 提交于
      When a tcp socket is closed, if it detects that its net namespace is
      exiting, close immediately and do not wait for FIN sequence.
      
      For normal sockets, a reference is taken to their net namespace, so it will
      never exit while the socket is open.  However, kernel sockets do not take a
      reference to their net namespace, so it may begin exiting while the kernel
      socket is still open.  In this case if the kernel socket is a tcp socket,
      it will stay open trying to complete its close sequence.  The sock's dst(s)
      hold a reference to their interface, which are all transferred to the
      namespace's loopback interface when the real interfaces are taken down.
      When the namespace tries to take down its loopback interface, it hangs
      waiting for all references to the loopback interface to release, which
      results in messages like:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 1
      
      These messages continue until the socket finally times out and closes.
      Since the net namespace cleanup holds the net_mutex while calling its
      registered pernet callbacks, any new net namespace initialization is
      blocked until the current net namespace finishes exiting.
      
      After this change, the tcp socket notices the exiting net namespace, and
      closes immediately, releasing its dst(s) and their reference to the
      loopback interface, which lets the net namespace continue exiting.
      
      Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811Signed-off-by: NDan Streetman <ddstreet@canonical.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ee806d5
  14. 11 1月, 2018 1 次提交
  15. 06 1月, 2018 1 次提交
  16. 28 12月, 2017 3 次提交
  17. 21 12月, 2017 2 次提交
  18. 08 12月, 2017 1 次提交
  19. 03 12月, 2017 1 次提交
  20. 28 11月, 2017 1 次提交
  21. 11 11月, 2017 2 次提交
  22. 10 11月, 2017 1 次提交
  23. 05 11月, 2017 1 次提交
    • P
      tcp: higher throughput under reordering with adaptive RACK reordering wnd · 1f255691
      Priyaranjan Jha 提交于
      Currently TCP RACK loss detection does not work well if packets are
      being reordered beyond its static reordering window (min_rtt/4).Under
      such reordering it may falsely trigger loss recoveries and reduce TCP
      throughput significantly.
      
      This patch improves that by increasing and reducing the reordering
      window based on DSACK, which is now supported in major TCP implementations.
      It makes RACK's reo_wnd adaptive based on DSACK and no. of recoveries.
      
      - If DSACK is received, increment reo_wnd by min_rtt/4 (upper bounded
        by srtt), since there is possibility that spurious retransmission was
        due to reordering delay longer than reo_wnd.
      
      - Persist the current reo_wnd value for TCP_RACK_RECOVERY_THRESH (16)
        no. of successful recoveries (accounts for full DSACK-based loss
        recovery undo). After that, reset it to default (min_rtt/4).
      
      - At max, reo_wnd is incremented only once per rtt. So that the new
        DSACK on which we are reacting, is due to the spurious retx (approx)
        after the reo_wnd has been updated last time.
      
      - reo_wnd is tracked in terms of steps (of min_rtt/4), rather than
        absolute value to account for change in rtt.
      
      In our internal testing, we observed significant increase in throughput,
      in scenarios where reordering exceeds min_rtt/4 (previous static value).
      Signed-off-by: NPriyaranjan Jha <priyarjha@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f255691
  24. 28 10月, 2017 2 次提交
  25. 27 10月, 2017 1 次提交
  26. 26 10月, 2017 1 次提交
  27. 24 10月, 2017 2 次提交
  28. 20 10月, 2017 1 次提交
  29. 07 10月, 2017 1 次提交
    • E
      tcp: implement rb-tree based retransmit queue · 75c119af
      Eric Dumazet 提交于
      Using a linear list to store all skbs in write queue has been okay
      for quite a while : O(N) is not too bad when N < 500.
      
      Things get messy when N is the order of 100,000 : Modern TCP stacks
      want 10Gbit+ of throughput even with 200 ms RTT flows.
      
      40 ns per cache line miss means a full scan can use 4 ms,
      blowing away CPU caches.
      
      SACK processing often can use various hints to avoid parsing
      whole retransmit queue. But with high packet losses and/or high
      reordering, hints no longer work.
      
      Sender has to process thousands of unfriendly SACK, accumulating
      a huge socket backlog, burning a cpu and massively dropping packets.
      
      Using an rb-tree for retransmit queue has been avoided for years
      because it added complexity and overhead, but now is the time
      to be more resistant and say no to quadratic behavior.
      
      1) RTX queue is no longer part of the write queue : already sent skbs
      are stored in one rb-tree.
      
      2) Since reaching the head of write queue no longer needs
      sk->sk_send_head, we added an union of sk_send_head and tcp_rtx_queue
      
      Tested:
      
       On receiver :
       netem on ingress : delay 150ms 200us loss 1
       GRO disabled to force stress and SACK storms.
      
      for f in `seq 1 10`
      do
       ./netperf -H lpaa6 -l30 -- -K bbr -o THROUGHPUT|tail -1
      done | awk '{print $0} {sum += $0} END {printf "%7u\n",sum}'
      
      Before patch :
      
      323.87
      351.48
      339.59
      338.62
      306.72
      204.07
      304.93
      291.88
      202.47
      176.88
         2840
      
      After patch:
      
      1700.83
      2207.98
      2070.17
      1544.26
      2114.76
      2124.89
      1693.14
      1080.91
      2216.82
      1299.94
        18053
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75c119af