1. 14 10月, 2019 3 次提交
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  2. 10 10月, 2019 4 次提交
    • E
      net: annotate sk->sk_rcvlowat lockless reads · eac66402
      Eric Dumazet 提交于
      sock_rcvlowat() or int_sk_rcvlowat() might be called without the socket
      lock for example from tcp_poll().
      
      Use READ_ONCE() to document the fact that other cpus might change
      sk->sk_rcvlowat under us and avoid KCSAN splats.
      
      Use WRITE_ONCE() on write sides too.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      eac66402
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · 8265792b
      Eric Dumazet 提交于
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      8265792b
    • E
      tcp: annotate lockless access to tcp_memory_pressure · 1f142c17
      Eric Dumazet 提交于
      tcp_memory_pressure is read without holding any lock,
      and its value could be changed on other cpus.
      
      Use READ_ONCE() to annotate these lockless reads.
      
      The write side is already using atomic ops.
      
      Fixes: b8da51eb ("tcp: introduce tcp_under_memory_pressure()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      1f142c17
    • E
      net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head · 60b173ca
      Eric Dumazet 提交于
      reqsk_queue_empty() is called from inet_csk_listen_poll() while
      other cpus might write ->rskq_accept_head value.
      
      Use {READ|WRITE}_ONCE() to avoid compiler tricks
      and potential KCSAN splats.
      
      Fixes: fff1f300 ("tcp: add a spinlock to protect struct request_sock_queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      60b173ca
  3. 05 10月, 2019 1 次提交
    • P
      net: ipv4: avoid mixed n_redirects and rate_tokens usage · b406472b
      Paolo Abeni 提交于
      Since commit c09551c6 ("net: ipv4: use a dedicated counter
      for icmp_v4 redirect packets") we use 'n_redirects' to account
      for redirect packets, but we still use 'rate_tokens' to compute
      the redirect packets exponential backoff.
      
      If the device sent to the relevant peer any ICMP error packet
      after sending a redirect, it will also update 'rate_token' according
      to the leaking bucket schema; typically 'rate_token' will raise
      above BITS_PER_LONG and the redirect packets backoff algorithm
      will produce undefined behavior.
      
      Fix the issue using 'n_redirects' to compute the exponential backoff
      in ip_rt_send_redirect().
      
      Note that we still clear rate_tokens after a redirect silence period,
      to avoid changing an established behaviour.
      
      The root cause predates git history; before the mentioned commit in
      the critical scenario, the kernel stopped sending redirects, after
      the mentioned commit the behavior more randomic.
      Reported-by: NXiumei Mu <xmu@redhat.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Fixes: c09551c6 ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b406472b
  4. 04 10月, 2019 1 次提交
    • E
      tcp: fix slab-out-of-bounds in tcp_zerocopy_receive() · 3afb0961
      Eric Dumazet 提交于
      Apparently a refactoring patch brought a bug, that was caught
      by syzbot [1]
      
      Original code was correct, do not try to be smarter than the
      compiler :/
      
      [1]
      BUG: KASAN: slab-out-of-bounds in tcp_zerocopy_receive net/ipv4/tcp.c:1807 [inline]
      BUG: KASAN: slab-out-of-bounds in do_tcp_getsockopt.isra.0+0x2c6c/0x3120 net/ipv4/tcp.c:3654
      Read of size 4 at addr ffff8880943cf188 by task syz-executor.2/17508
      
      CPU: 0 PID: 17508 Comm: syz-executor.2 Not tainted 5.3.0-rc7+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0xd4/0x306 mm/kasan/report.c:351
       __kasan_report.cold+0x1b/0x36 mm/kasan/report.c:482
       kasan_report+0x12/0x17 mm/kasan/common.c:618
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131
       tcp_zerocopy_receive net/ipv4/tcp.c:1807 [inline]
       do_tcp_getsockopt.isra.0+0x2c6c/0x3120 net/ipv4/tcp.c:3654
       tcp_getsockopt+0xbf/0xe0 net/ipv4/tcp.c:3680
       sock_common_getsockopt+0x94/0xd0 net/core/sock.c:3098
       __sys_getsockopt+0x16d/0x310 net/socket.c:2129
       __do_sys_getsockopt net/socket.c:2144 [inline]
       __se_sys_getsockopt net/socket.c:2141 [inline]
       __x64_sys_getsockopt+0xbe/0x150 net/socket.c:2141
       do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
      
      Fixes: d8e18a51 ("net: Use skb accessors in network core")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3afb0961
  5. 03 10月, 2019 2 次提交
  6. 02 10月, 2019 2 次提交
    • E
      tcp: adjust rto_base in retransmits_timed_out() · 3256a2d6
      Eric Dumazet 提交于
      The cited commit exposed an old retransmits_timed_out() bug
      which assumed it could call tcp_model_timeout() with
      TCP_RTO_MIN as rto_base for all states.
      
      But flows in SYN_SENT or SYN_RECV state uses a different
      RTO base (1 sec instead of 200 ms, unless BPF choses
      another value)
      
      This caused a reduction of SYN retransmits from 6 to 4 with
      the default /proc/sys/net/ipv4/tcp_syn_retries value.
      
      Fixes: a41e8a88 ("tcp: better handle TCP_USER_TIMEOUT in SYN_SENT state")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Marek Majkowski <marek@cloudflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3256a2d6
    • F
      netfilter: drop bridge nf reset from nf_reset · 895b5c9f
      Florian Westphal 提交于
      commit 174e2381
      ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
      recycle always drop skb extensions.  The additional skb_ext_del() that is
      performed via nf_reset on napi skb recycle is not needed anymore.
      
      Most nf_reset() calls in the stack are there so queued skb won't block
      'rmmod nf_conntrack' indefinitely.
      
      This removes the skb_ext_del from nf_reset, and renames it to a more
      fitting nf_reset_ct().
      
      In a few selected places, add a call to skb_ext_reset to make sure that
      no active extensions remain.
      
      I am submitting this for "net", because we're still early in the release
      cycle.  The patch applies to net-next too, but I think the rename causes
      needless divergence between those trees.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      895b5c9f
  7. 01 10月, 2019 1 次提交
    • H
      erspan: remove the incorrect mtu limit for erspan · 0e141f75
      Haishuang Yan 提交于
      erspan driver calls ether_setup(), after commit 61e84623
      ("net: centralize net_device min/max MTU checking"), the range
      of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
      
      It causes the dev mtu of the erspan device to not be greater
      than 1500, this limit value is not correct for ipgre tap device.
      
      Tested:
      Before patch:
      # ip link set erspan0 mtu 1600
      Error: mtu greater than device maximum.
      After patch:
      # ip link set erspan0 mtu 1600
      # ip -d link show erspan0
      21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN
      mode DEFAULT group default qlen 1000
          link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e141f75
  8. 28 9月, 2019 2 次提交
  9. 27 9月, 2019 1 次提交
    • E
      tcp: honor SO_PRIORITY in TIME_WAIT state · f6c0f5d2
      Eric Dumazet 提交于
      ctl packets sent on behalf of TIME_WAIT sockets currently
      have a zero skb->priority, which can cause various problems.
      
      In this patch we :
      
      - add a tw_priority field in struct inet_timewait_sock.
      
      - populate it from sk->sk_priority when a TIME_WAIT is created.
      
      - For IPv4, change ip_send_unicast_reply() and its two
        callers to propagate tw_priority correctly.
        ip_send_unicast_reply() no longer changes sk->sk_priority.
      
      - For IPv6, make sure TIME_WAIT sockets pass their tw_priority
        field to tcp_v6_send_response() and tcp_v6_send_ack().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6c0f5d2
  10. 26 9月, 2019 1 次提交
  11. 21 9月, 2019 1 次提交
  12. 16 9月, 2019 3 次提交
  13. 14 9月, 2019 1 次提交
    • W
      ip: support SO_MARK cmsg · c6af0c22
      Willem de Bruijn 提交于
      Enable setting skb->mark for UDP and RAW sockets using cmsg.
      
      This is analogous to existing support for TOS, TTL, txtime, etc.
      
      Packet sockets already support this as of commit c7d39e32
      ("packet: support per-packet fwmark for af_packet sendmsg").
      
      Similar to other fields, implement by
      1. initialize the sockcm_cookie.mark from socket option sk_mark
      2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
      3. initialize inet_cork.mark from sockcm_cookie.mark
      4. initialize each (usually just one) skb->mark from inet_cork.mark
      
      Step 1 is handled in one location for most protocols by ipcm_init_sk
      as of commit 35178206 ("ipv4: ipcm_cookie initializers").
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6af0c22
  14. 13 9月, 2019 1 次提交
    • J
      netfilter: fix coding-style errors. · b0edba2a
      Jeremy Sowden 提交于
      Several header-files, Kconfig files and Makefiles have trailing
      white-space.  Remove it.
      
      In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
      correctly.
      
      There are semicolons at the end of two function definitions in
      include/net/netfilter/nf_conntrack_acct.h and
      include/net/netfilter/nf_conntrack_ecache.h. Remove them.
      
      Fix indentation in nf_conntrack_l4proto.h.
      Signed-off-by: NJeremy Sowden <jeremy@azazel.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b0edba2a
  15. 12 9月, 2019 2 次提交
    • E
      tcp: force a PSH flag on TSO packets · 051ba674
      Eric Dumazet 提交于
      When tcp sends a TSO packet, adding a PSH flag on it
      reduces the sojourn time of GRO packet in GRO receivers.
      
      This is particularly the case under pressure, since RX queues
      receive packets for many concurrent flows.
      
      A sender can give a hint to GRO engines when it is
      appropriate to flush a super-packet, especially when pacing
      is in the picture, since next packet is probably delayed by
      one ms.
      
      Having less packets in GRO engine reduces chance
      of LRU eviction or inflated RTT, and reduces GRO cost.
      
      We found recently that we must not set the PSH flag on
      individual full-size MSS segments [1] :
      
       Under pressure (CWR state), we better let the packet sit
       for a small delay (depending on NAPI logic) so that the
       ACK packet is delayed, and thus next packet we send is
       also delayed a bit. Eventually the bottleneck queue can
       be drained. DCTCP flows with CWND=1 have demonstrated
       the issue.
      
      This patch allows to slowdown the aggregate traffic without
      involving high resolution timers on senders and/or
      receivers.
      
      It has been used at Google for about four years,
      and has been discussed at various networking conferences.
      
      [1] segments smaller than MSS already have PSH flag set
          by tcp_sendmsg() / tcp_mark_push(), unless MSG_MORE
          has been requested by the user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      051ba674
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · af38d07e
      Neal Cardwell 提交于
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af38d07e
  16. 07 9月, 2019 2 次提交
  17. 05 9月, 2019 1 次提交
    • D
      net: Properly update v4 routes with v6 nexthop · 7bdf4de1
      Donald Sharp 提交于
      When creating a v4 route that uses a v6 nexthop from a nexthop group.
      Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
      attribute.
      
      Broken behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      Fixed behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      v2, v3: Addresses code review comments from David Ahern
      
      Fixes: dcb1ecb5 (“ipv4: Prepare for fib6_nh from a nexthop object”)
      Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bdf4de1
  18. 01 9月, 2019 1 次提交
  19. 31 8月, 2019 1 次提交
  20. 29 8月, 2019 1 次提交
  21. 28 8月, 2019 1 次提交
  22. 25 8月, 2019 2 次提交
    • J
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend 提交于
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e93fb3e9
    • H
      ipv4/icmp: fix rt dst dev null pointer dereference · e2c69393
      Hangbin Liu 提交于
      In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
      e,g, with tunnel collect_md mode, which will cause kernel crash.
      Here is what the code path looks like, for GRE:
      
      - ip6gre_tunnel_xmit
        - ip6gre_xmit_ipv4
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmp_send
            - net = dev_net(rt->dst.dev); <-- here
      
      The reason is __metadata_dst_init() init dst->dev to NULL by default.
      We could not fix it in __metadata_dst_init() as there is no dev supplied.
      On the other hand, the reason we need rt->dst.dev is to get the net.
      So we can just try get it from skb->dev when rt->dst.dev is NULL.
      
      v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
      still use dst.dev and do a check to avoid crash.
      
      v3: No changes.
      
      v2: fix the issue in __icmp_send() instead of updating shared dst dev
      in {ip_md, ip6}_tunnel_xmit.
      
      Fixes: c8b34e68 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2c69393
  23. 23 8月, 2019 1 次提交
  24. 22 8月, 2019 1 次提交
  25. 21 8月, 2019 1 次提交
  26. 20 8月, 2019 1 次提交
  27. 10 8月, 2019 1 次提交
    • J
      tcp: add new tcp_mtu_probe_floor sysctl · c04b79b6
      Josh Hunt 提交于
      The current implementation of TCP MTU probing can considerably
      underestimate the MTU on lossy connections allowing the MSS to get down to
      48. We have found that in almost all of these cases on our networks these
      paths can handle much larger MTUs meaning the connections are being
      artificially limited. Even though TCP MTU probing can raise the MSS back up
      we have seen this not to be the case causing connections to be "stuck" with
      an MSS of 48 when heavy loss is present.
      
      Prior to pushing out this change we could not keep TCP MTU probing enabled
      b/c of the above reasons. Now with a reasonble floor set we've had it
      enabled for the past 6 months.
      
      The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
      administrators the ability to control the floor of MSS probing.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04b79b6