1. 21 9月, 2017 1 次提交
    • P
      udp: do rmem bulk free even if the rx sk queue is empty · 0d4a6608
      Paolo Abeni 提交于
      The commit 6b229cf7 ("udp: add batching to udp_rmem_release()")
      reduced greatly the cacheline contention between the BH and the US
      reader batching the rmem updates in most scenarios.
      
      Such optimization is explicitly avoided if the US reader is faster
      then BH processing.
      
      My fault, I initially suggested this kind of behavior due to concerns
      of possible regressions with small sk_rcvbuf values. Tests showed
      such concerns are misplaced, so this commit relaxes the condition
      for rmem bulk updates, obtaining small but measurable performance
      gain in the scenario described above.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d4a6608
  2. 20 9月, 2017 3 次提交
    • E
      ipv4: speedup ipv6 tunnels dismantle · 64bc1781
      Eric Dumazet 提交于
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real    1m38.965s
      user    0m0.688s
      sys     0m37.017s
      
      After patch:
      $ time ./add_del_unshare.sh
      net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0
      
      real	0m22.117s
      user	0m0.728s
      sys	0m35.328s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64bc1781
    • E
      tcp: batch tcp_net_metrics_exit · 789e6ddb
      Eric Dumazet 提交于
      When dealing with a list of dismantling netns, we can scan
      tcp_metrics once, saving cpu cycles.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      789e6ddb
    • E
      net: sk_buff rbnode reorg · bffa72cf
      Eric Dumazet 提交于
      skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
      
      Current uses (TCP receive ofo queue and netem) need to save/restore
      tstamp, while skb->dev is either NULL (TCP) or a constant for a given
      queue (netem).
      
      Since we plan using an RB tree for TCP retransmit queue to speedup SACK
      processing with large BDP, this patch exchanges skb->dev and
      skb->tstamp.
      
      This saves some overhead in both TCP and netem.
      
      v2: removes the swtstamp field from struct tcp_skb_cb
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bffa72cf
  3. 19 9月, 2017 2 次提交
  4. 17 9月, 2017 1 次提交
  5. 16 9月, 2017 2 次提交
    • E
      tcp: update skb->skb_mstamp more carefully · 8c72c65b
      Eric Dumazet 提交于
      liujian reported a problem in TCP_USER_TIMEOUT processing with a patch
      in tcp_probe_timer() :
            https://www.spinics.net/lists/netdev/msg454496.html
      
      After investigations, the root cause of the problem is that we update
      skb->skb_mstamp of skbs in write queue, even if the attempt to send a
      clone or copy of it failed. One reason being a routing problem.
      
      This patch prevents this, solving liujian issue.
      
      It also removes a potential RTT miscalculation, since
      __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with
      TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has
      been changed.
      
      A future ACK would then lead to a very small RTT sample and min_rtt
      would then be lowered to this too small value.
      
      Tested:
      
      # cat user_timeout.pkt
      --local_ip=192.168.102.64
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0`
      
         +0 < S 0:0(0) win 0 <mss 1460>
         +0 > S. 0:0(0) ack 1 <mss 1460>
      
        +.1 < . 1:1(0) ack 1 win 65530
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0
         +0 write(4, ..., 24) = 24
         +0 > P. 1:25(24) ack 1 win 29200
         +.1 < . 1:1(0) ack 25 win 65530
      
      //change the ipaddress
         +1 `ifconfig tun0 192.168.0.10/16`
      
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
         +1 write(4, ..., 24) = 24
      
         +0 `ifconfig tun0 192.168.102.64/16`
         +0 < . 1:2(1) ack 25 win 65530
         +0 `ifconfig tun0 192.168.0.10/16`
      
         +3 write(4, ..., 24) = -1
      
      # ./packetdrill user_timeout.pkt
      Signed-off-by: NEric Dumazet <edumazet@googl.com>
      Reported-by: Nliujian <liujian56@huawei.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c72c65b
    • D
      net: ipv4: fix l3slave check for index returned in IP_PKTINFO · cbea8f02
      David Ahern 提交于
      rt_iif is only set to the actual egress device for the output path. The
      recent change to consider the l3slave flag when returning IP_PKTINFO
      works for local traffic (the correct device index is returned), but it
      broke the more typical use case of packets received from a remote host
      always returning the VRF index rather than the original ingress device.
      Update the fixup to consider l3slave and rt_iif actually getting set.
      
      Fixes: 1dfa7639 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbea8f02
  6. 13 9月, 2017 2 次提交
    • H
      ip_tunnel: fix ip tunnel lookup in collect_md mode · 833a8b40
      Haishuang Yan 提交于
      In collect_md mode, if the tun dev is down, it still can call
      ip_tunnel_rcv to receive on packets, and the rx statistics increase
      improperly.
      
      When the md tunnel is down, it's not neccessary to increase RX drops
      for the tunnel device, packets would be recieved on fallback tunnel,
      and the RX drops on fallback device will be increased as expected.
      
      Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.")
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      833a8b40
    • E
      tcp/dccp: remove reqsk_put() from inet_child_forget() · da8ab578
      Eric Dumazet 提交于
      Back in linux-4.4, I inadvertently put a call to reqsk_put() in
      inet_child_forget(), forgetting it could be called from two different
      points.
      
      In the case it is called from inet_csk_reqsk_queue_add(), we want to
      keep the reference on the request socket, since it is released later by
      the caller (tcp_v{4|6}_rcv())
      
      This bug never showed up because atomic_dec_and_test() was not signaling
      the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets
      prevented the request to be put in quarantine.
      
      Recent conversion of socket refcount from atomic_t to refcount_t finally
      exposed the bug.
      
      So move the reqsk_put() to inet_csk_listen_stop() to fix this.
      
      Thanks to Shankara Pailoor for using syzkaller and providing
      a nice set of .config and C repro.
      
      WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186
      refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0xf7/0x1aa lib/dump_stack.c:52
       panic+0x1ae/0x3a7 kernel/panic.c:180
       __warn+0x1c4/0x1d9 kernel/panic.c:541
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
       do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
       do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846
      RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186
      RSP: 0018:ffff88006e006b60 EFLAGS: 00010286
      RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60
      RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d
      R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211
       reqsk_put+0x71/0x2b0 include/net/request_sock.h:123
       tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729
       ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257
       dst_input include/net/dst.h:477 [inline]
       ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397
       NF_HOOK include/linux/netfilter.h:248 [inline]
       ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488
       __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298
       __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336
       process_backlog+0x1c5/0x6d0 net/core/dev.c:5102
       napi_poll net/core/dev.c:5499 [inline]
       net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565
       __do_softirq+0x2cb/0xb2d kernel/softirq.c:284
       do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898
       </IRQ>
       do_softirq.part.16+0x63/0x80 kernel/softirq.c:328
       do_softirq kernel/softirq.c:176 [inline]
       __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181
       local_bh_enable include/linux/bottom_half.h:31 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline]
       ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231
       ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:237 [inline]
       ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:471 [inline]
       ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124
       ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504
       tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123
       tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575
       tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545
       tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline]
       tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930
       tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483
       sk_backlog_rcv include/net/sock.h:907 [inline]
       __release_sock+0x124/0x360 net/core/sock.c:2223
       release_sock+0xa4/0x2a0 net/core/sock.c:2715
       inet_wait_for_connect net/ipv4/af_inet.c:557 [inline]
       __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643
       inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682
       SYSC_connect+0x204/0x470 net/socket.c:1628
       SyS_connect+0x24/0x30 net/socket.c:1609
       entry_SYSCALL_64_fastpath+0x18/0xad
      RIP: 0033:0x451e59
      RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a
      RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59
      RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007
      RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000
      R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000
      
      Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NShankara Pailoor <sp3485@columbia.edu>
      Tested-by: NShankara Pailoor <sp3485@columbia.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da8ab578
  7. 09 9月, 2017 3 次提交
  8. 08 9月, 2017 1 次提交
  9. 04 9月, 2017 5 次提交
  10. 02 9月, 2017 6 次提交
  11. 31 8月, 2017 3 次提交
  12. 30 8月, 2017 1 次提交
  13. 29 8月, 2017 5 次提交
    • D
      net: Add comment that early_demux can change via sysctl · a8e3bb34
      David Ahern 提交于
      Twice patches trying to constify inet{6}_protocol have been reverted:
      39294c3d ("Revert "ipv6: constify inet6_protocol structures"") to
      revert 3a3a4e30 and then 03157937 ("Revert "ipv4: make
      net_protocol const"") to revert aa8db499.
      
      Add a comment that the structures can not be const because the
      early_demux field can change based on a sysctl.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8e3bb34
    • W
      gre: add collect_md mode to ERSPAN tunnel · 1a66a836
      William Tu 提交于
      Similar to gre, vxlan, geneve, ipip tunnels, allow ERSPAN tunnels to
      operate in 'collect metadata' mode.  bpf_skb_[gs]et_tunnel_key() helpers
      can make use of it right away.  OVS can use it as well in the future.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a66a836
    • W
      gre: refactor the gre_fb_xmit · 862a03c3
      William Tu 提交于
      The patch refactors the gre_fb_xmit function, by creating
      prepare_fb_xmit function for later ERSPAN collect_md mode patch.
      Signed-off-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      862a03c3
    • D
      Revert "ipv4: make net_protocol const" · 03157937
      David Ahern 提交于
      This reverts commit aa8db499.
      
      Early demux structs can not be made const. Doing so results in:
      [   84.967355] BUG: unable to handle kernel paging request at ffffffff81684b10
      [   84.969272] IP: proc_configure_early_demux+0x1e/0x3d
      [   84.970544] PGD 1a0a067
      [   84.970546] P4D 1a0a067
      [   84.971212] PUD 1a0b063
      [   84.971733] PMD 80000000016001e1
      
      [   84.972669] Oops: 0003 [#1] SMP
      [   84.973065] Modules linked in: ip6table_filter ip6_tables veth vrf
      [   84.973833] CPU: 0 PID: 955 Comm: sysctl Not tainted 4.13.0-rc6+ #22
      [   84.974612] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [   84.975855] task: ffff88003854ce00 task.stack: ffffc900005a4000
      [   84.976580] RIP: 0010:proc_configure_early_demux+0x1e/0x3d
      [   84.977253] RSP: 0018:ffffc900005a7dd0 EFLAGS: 00010246
      [   84.977891] RAX: ffffffff81684b10 RBX: 0000000000000001 RCX: 0000000000000000
      [   84.978759] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000000
      [   84.979628] RBP: ffffc900005a7dd0 R08: 0000000000000000 R09: 0000000000000000
      [   84.980501] R10: 0000000000000001 R11: 0000000000000008 R12: 0000000000000001
      [   84.981373] R13: ffffffffffffffea R14: ffffffff81a9b4c0 R15: 0000000000000002
      [   84.982249] FS:  00007feb237b7700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
      [   84.983231] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   84.983941] CR2: ffffffff81684b10 CR3: 0000000038492000 CR4: 00000000000406f0
      [   84.984817] Call Trace:
      [   84.985133]  proc_tcp_early_demux+0x29/0x30
      
      I think this is the second time such a patch has been reverted.
      
      Cc: Bhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03157937
    • B
      ipv4: make net_protocol const · aa8db499
      Bhumika Goyal 提交于
      Make these const as they are only passed to a const argument of the
      function inet_add_protocol.
      Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa8db499
  14. 26 8月, 2017 3 次提交
    • P
      udp6: set rx_dst_cookie on rx_dst updates · 64f0f5d1
      Paolo Abeni 提交于
      Currently, in the udp6 code, the dst cookie is not initialized/updated
      concurrently with the RX dst used by early demux.
      
      As a result, the dst_check() in the early_demux path always fails,
      the rx dst cache is always invalidated, and we can't really
      leverage significant gain from the demux lookup.
      
      Fix it adding udp6 specific variant of sk_rx_dst_set() and use it
      to set the dst cookie when the dst entry is really changed.
      
      The issue is there since the introduction of early demux for ipv6.
      
      Fixes: 5425077d ("net: ipv6: Add early demux handler for UDP unicast")
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64f0f5d1
    • E
      tcp: fix hang in tcp_sendpage_locked() · bd9dfc54
      Eric Dumazet 提交于
      syszkaller got a hang in tcp stack, related to a bug in
      tcp_sendpage_locked()
      
      root@syzkaller:~# cat /proc/3059/stack
      [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
      [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
      [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
      [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
      [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
      [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
      [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
      [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
      [<ffffffff84089203>] tcp_sendpage+0x43/0x60
      [<ffffffff841641da>] inet_sendpage+0x1aa/0x660
      [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
      [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
      [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
      [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
      [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
      [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
      [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
      [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: 306b13eb ("proto_ops: Add locked held versions of sendmsg and sendpage")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd9dfc54
    • S
      tcp: fix refcnt leak with ebpf congestion control · ebfa00c5
      Sabrina Dubroca 提交于
      There are a few bugs around refcnt handling in the new BPF congestion
      control setsockopt:
      
       - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
         cannot get a reference on it. This would lead to a use after free,
         since that ca is going away soon.
      
       - Changing the congestion control case doesn't release the refcnt on
         the previous ca.
      
       - In the reinit case, we first leak a reference on the old ca, then we
         call tcp_reinit_congestion_control on the ca that we have just
         assigned, leading to deinitializing the wrong ca (->release of the
         new ca on the old ca's data) and releasing the refcount on the ca
         that we actually want to use.
      
      This is visible by building (for example) BIC as a module and setting
      net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
      samples/bpf.
      
      This patch fixes the refcount issues, and moves reinit back into tcp
      core to avoid passing a ca pointer back to BPF.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebfa00c5
  15. 25 8月, 2017 2 次提交