1. 24 10月, 2017 1 次提交
  2. 22 10月, 2017 2 次提交
    • L
      bpf: Add BPF_SOCKET_OPS_BASE_RTT support to tcp_nv · 85cce215
      Lawrence Brakmo 提交于
      TCP_NV will try to get the base RTT from a socket_ops BPF program if one
      is loaded. NV will then use the base RTT to bound its min RTT (its
      notion of the base RTT). It uses the base RTT as an upper bound and 80%
      of the base RTT as its lower bound.
      
      In other words, NV will consider filtered RTTs larger than base RTT as a
      sign of congestion. As a result, there is no minRTT inflation when there
      is a lot of congestion. For example, in a DC where the RTTs are less
      than 40us when there is no congestion, a base RTT value of 80us improves
      the performance of NV. The difference between the uncongested RTT and
      the base RTT provided represents how much queueing we are willing to
      have (in practice it can be higher).
      
      NV has been tunned to reduce congestion when there are many flows at the
      cost of one flow not achieving full bandwith utilization. When a
      reasonable base RTT is provided, one NV flow can now fully utilize the
      full bandwidth. In addition, the performance is also improved when there
      are many flows.
      
      In the following examples the NV results are using a kernel with this
      patch set (i.e. both NV results are using the new nv_loss_dec_factor).
      
      With one host sending to another host and only one flow the
      goodputs are:
        Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps
      
      With 2 hosts sending to one host (1 flow per host, the goodput per flow
      is:
        Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps
      
      But the RTTs seen by a ping process in the sender is:
        Cubic: 3.3ms  NV: 97us,  NV (baseRTT=80us): 146us
      
      With a lot of flows things look even better for NV with baseRTT. Here we
      have 3 hosts sending to one host. Each sending host has 6 flows: 1
      stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully
      utilize the full available bandwidth. However, the distribution of
      bandwidth among the flows is very different. For the 10KB RPC flow:
        Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps
      
      The 99% latencies for the 10KB flows are:
        Cubic: 26ms,  NV: 1ms,  NV (baseRTT=80us): 500us
      
      The RTT seen by a ping process at the senders:
        Cubic: 3.2ms  NV: 720us,  NV (baseRTT=80us): 330us
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      85cce215
    • C
      soreuseport: fix initialization race · 1b5f962e
      Craig Gallek 提交于
      Syzkaller stumbled upon a way to trigger
      WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
      reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39
      
      There are two initialization paths for the sock_reuseport structure in a
      socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
      SO_ATTACH_REUSEPORT_[CE]BPF before bind.  The existing implementation
      assumedthat the socket lock protected both of these paths when it actually
      only protects the SO_ATTACH_REUSEPORT path.  Syzkaller triggered this
      double allocation by running these paths concurrently.
      
      This patch moves the check for double allocation into the reuseport_alloc
      function which is protected by a global spin lock.
      
      Fixes: e32ea7e7 ("soreuseport: fast reuseport UDP socket selection")
      Fixes: c125e80b ("soreuseport: fast reuseport TCP socket selection")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b5f962e
  3. 21 10月, 2017 4 次提交
    • M
      udp: make some messages more descriptive · 197df02c
      Matteo Croce 提交于
      In the UDP code there are two leftover error messages with very few meaning.
      Replace them with a more descriptive error message as some users
      reported them as "strange network error".
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      197df02c
    • E
      ipv4: ipv4_default_advmss() should use route mtu · 164a5e7a
      Eric Dumazet 提交于
      ipv4_default_advmss() incorrectly uses the device MTU instead
      of the route provided one. IPv6 has the proper behavior,
      lets harmonize the two protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      164a5e7a
    • E
      tcp: fix tcp_send_syn_data() · ba233b34
      Eric Dumazet 提交于
      syn_data was allocated by sk_stream_alloc_skb(), meaning
      its destructor and _skb_refdst fields are mangled.
      
      We need to call tcp_skb_tsorted_anchor_cleanup() before
      calling kfree_skb() or kernel crashes.
      
      Bug was reported by syzkaller bot.
      
      Fixes: e2080072 ("tcp: new list for sent but unacked skbs for RACK recovery")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba233b34
    • E
      tcp/dccp: fix ireq->opt races · c92e8c02
      Eric Dumazet 提交于
      syzkaller found another bug in DCCP/TCP stacks [1]
      
      For the reasons explained in commit ce105008 ("tcp/dccp: fix
      ireq->pktopts race"), we need to make sure we do not access
      ireq->opt unless we own the request sock.
      
      Note the opt field is renamed to ireq_opt to ease grep games.
      
      [1]
      BUG: KASAN: use-after-free in ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
      Read of size 1 at addr ffff8801c951039c by task syz-executor5/3295
      
      CPU: 1 PID: 3295 Comm: syz-executor5 Not tainted 4.14.0-rc4+ #80
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       print_address_description+0x73/0x250 mm/kasan/report.c:252
       kasan_report_error mm/kasan/report.c:351 [inline]
       kasan_report+0x25b/0x340 mm/kasan/report.c:409
       __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
       ip_queue_xmit+0x1687/0x18e0 net/ipv4/ip_output.c:474
       tcp_transmit_skb+0x1ab7/0x3840 net/ipv4/tcp_output.c:1135
       tcp_send_ack.part.37+0x3bb/0x650 net/ipv4/tcp_output.c:3587
       tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3557
       __tcp_ack_snd_check+0x2c6/0x4b0 net/ipv4/tcp_input.c:5072
       tcp_ack_snd_check net/ipv4/tcp_input.c:5085 [inline]
       tcp_rcv_state_process+0x2eff/0x4850 net/ipv4/tcp_input.c:6071
       tcp_child_process+0x342/0x990 net/ipv4/tcp_minisocks.c:816
       tcp_v4_rcv+0x1827/0x2f80 net/ipv4/tcp_ipv4.c:1682
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:464 [inline]
       ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
       netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
       netif_receive_skb+0xae/0x390 net/core/dev.c:4611
       tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
       tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
       tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
       call_write_iter include/linux/fs.h:1770 [inline]
       new_sync_write fs/read_write.c:468 [inline]
       __vfs_write+0x68a/0x970 fs/read_write.c:481
       vfs_write+0x18f/0x510 fs/read_write.c:543
       SYSC_write fs/read_write.c:588 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:580
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x40c341
      RSP: 002b:00007f469523ec10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 000000000040c341
      RDX: 0000000000000037 RSI: 0000000020004000 RDI: 0000000000000015
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 00000000000f4240 R11: 0000000000000293 R12: 00000000004b7fd1
      R13: 00000000ffffffff R14: 0000000020000000 R15: 0000000000025000
      
      Allocated by task 3295:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
       __do_kmalloc mm/slab.c:3725 [inline]
       __kmalloc+0x162/0x760 mm/slab.c:3734
       kmalloc include/linux/slab.h:498 [inline]
       tcp_v4_save_options include/net/tcp.h:1962 [inline]
       tcp_v4_init_req+0x2d3/0x3e0 net/ipv4/tcp_ipv4.c:1271
       tcp_conn_request+0xf6d/0x3410 net/ipv4/tcp_input.c:6283
       tcp_v4_conn_request+0x157/0x210 net/ipv4/tcp_ipv4.c:1313
       tcp_rcv_state_process+0x8ea/0x4850 net/ipv4/tcp_input.c:5857
       tcp_v4_do_rcv+0x55c/0x7d0 net/ipv4/tcp_ipv4.c:1482
       tcp_v4_rcv+0x2d10/0x2f80 net/ipv4/tcp_ipv4.c:1711
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:464 [inline]
       ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
       netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
       netif_receive_skb+0xae/0x390 net/core/dev.c:4611
       tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
       tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
       tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
       call_write_iter include/linux/fs.h:1770 [inline]
       new_sync_write fs/read_write.c:468 [inline]
       __vfs_write+0x68a/0x970 fs/read_write.c:481
       vfs_write+0x18f/0x510 fs/read_write.c:543
       SYSC_write fs/read_write.c:588 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:580
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Freed by task 3306:
       save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
       save_stack+0x43/0xd0 mm/kasan/kasan.c:447
       set_track mm/kasan/kasan.c:459 [inline]
       kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
       __cache_free mm/slab.c:3503 [inline]
       kfree+0xca/0x250 mm/slab.c:3820
       inet_sock_destruct+0x59d/0x950 net/ipv4/af_inet.c:157
       __sk_destruct+0xfd/0x910 net/core/sock.c:1560
       sk_destruct+0x47/0x80 net/core/sock.c:1595
       __sk_free+0x57/0x230 net/core/sock.c:1603
       sk_free+0x2a/0x40 net/core/sock.c:1614
       sock_put include/net/sock.h:1652 [inline]
       inet_csk_complete_hashdance+0xd5/0xf0 net/ipv4/inet_connection_sock.c:959
       tcp_check_req+0xf4d/0x1620 net/ipv4/tcp_minisocks.c:765
       tcp_v4_rcv+0x17f6/0x2f80 net/ipv4/tcp_ipv4.c:1675
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_local_deliver+0x1ce/0x6e0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:464 [inline]
       ip_rcv_finish+0x887/0x19a0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:249 [inline]
       ip_rcv+0xc3f/0x1820 net/ipv4/ip_input.c:493
       __netif_receive_skb_core+0x1a3e/0x34b0 net/core/dev.c:4476
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4514
       netif_receive_skb_internal+0x10b/0x670 net/core/dev.c:4587
       netif_receive_skb+0xae/0x390 net/core/dev.c:4611
       tun_rx_batched.isra.50+0x5ed/0x860 drivers/net/tun.c:1372
       tun_get_user+0x249c/0x36d0 drivers/net/tun.c:1766
       tun_chr_write_iter+0xbf/0x160 drivers/net/tun.c:1792
       call_write_iter include/linux/fs.h:1770 [inline]
       new_sync_write fs/read_write.c:468 [inline]
       __vfs_write+0x68a/0x970 fs/read_write.c:481
       vfs_write+0x18f/0x510 fs/read_write.c:543
       SYSC_write fs/read_write.c:588 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:580
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c92e8c02
  4. 20 10月, 2017 3 次提交
  5. 18 10月, 2017 5 次提交
  6. 17 10月, 2017 2 次提交
  7. 15 10月, 2017 2 次提交
    • C
      tcp: add a tracepoint for tcp retransmission · e086101b
      Cong Wang 提交于
      We need a real-time notification for tcp retransmission
      for monitoring.
      
      Of course we could use ftrace to dynamically instrument this
      kernel function too, however we can't retrieve the connection
      information at the same time, for example perf-tools [1] reads
      /proc/net/tcp for socket details, which is slow when we have
      a lots of connections.
      
      Therefore, this patch adds a tracepoint for __tcp_retransmit_skb()
      and exposes src/dst IP addresses and ports of the connection.
      This also makes it easier to integrate into perf.
      
      Note, I expose both IPv4 and IPv6 addresses at the same time:
      for a IPv4 socket, v4 mapped address is used as IPv6 addresses,
      for a IPv6 socket, LOOPBACK4_IPV6 is already filled by kernel.
      Also, add sk and skb pointers as they are useful for BPF.
      
      1. https://github.com/brendangregg/perf-tools/blob/master/net/tcpretrans
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NBrendan Gregg <bgregg@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e086101b
    • M
      icmp: don't fail on fragment reassembly time exceeded · 258bbb1b
      Matteo Croce 提交于
      The ICMP implementation currently replies to an ICMP time exceeded message
      (type 11) with an ICMP host unreachable message (type 3, code 1).
      
      However, time exceeded messages can either represent "time to live exceeded
      in transit" (code 0) or "fragment reassembly time exceeded" (code 1).
      
      Unconditionally replying to "fragment reassembly time exceeded" with
      host unreachable messages might cause unjustified connection resets
      which are now easily triggered as UFO has been removed, because, in turn,
      sending large buffers triggers IP fragmentation.
      
      The issue can be easily reproduced by running a lot of UDP streams
      which is likely to trigger IP fragmentation:
      
        # start netserver in the test namespace
        ip netns add test
        ip netns exec test netserver
      
        # create a VETH pair
        ip link add name veth0 type veth peer name veth0 netns test
        ip link set veth0 up
        ip -n test link set veth0 up
      
        for i in $(seq 20 29); do
            # assign addresses to both ends
            ip addr add dev veth0 192.168.$i.1/24
            ip -n test addr add dev veth0 192.168.$i.2/24
      
            # start the traffic
            netperf -L 192.168.$i.1 -H 192.168.$i.2 -t UDP_STREAM -l 0 &
        done
      
        # wait
        send_data: data send error: No route to host (errno 113)
        netperf: send_omni: send_data failed: No route to host
      
      We need to differentiate instead: if fragment reassembly time exceeded
      is reported, we need to silently drop the packet,
      if time to live exceeded is reported, maintain the current behaviour.
      In both cases increment the related error count "icmpInTimeExcds".
      
      While at it, fix a typo in a comment, and convert the if statement
      into a switch to mate it more readable.
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      258bbb1b
  8. 13 10月, 2017 1 次提交
  9. 11 10月, 2017 1 次提交
  10. 10 10月, 2017 4 次提交
  11. 09 10月, 2017 2 次提交
  12. 08 10月, 2017 1 次提交
  13. 07 10月, 2017 7 次提交
  14. 06 10月, 2017 5 次提交