1. 16 4月, 2020 5 次提交
    • E
      tcp: refine rule to allow EPOLLOUT generation under mem pressure · 4cf54ea1
      Eric Dumazet 提交于
      commit 216808c6ba6d00169fd2aa928ec3c0e63bef254f upstream.
      
      At the time commit ce5ec440 ("tcp: ensure epoll edge trigger
      wakeup when write queue is empty") was added to the kernel,
      we still had a single write queue, combining rtx and write queues.
      
      Once we moved the rtx queue into a separate rb-tree, testing
      if sk_write_queue is empty has been suboptimal.
      
      Indeed, if we have packets in the rtx queue, we probably want
      to delay the EPOLLOUT generation at the time incoming packets
      will free them, making room, but more importantly avoiding
      flooding application with EPOLLOUT events.
      
      Solution is to use tcp_rtx_and_write_queues_empty() helper.
      
      Fixes: 75c119af ("tcp: implement rb-tree based retransmit queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4cf54ea1
    • P
      tcp: fix marked lost packets not being retransmitted · 5984aa9d
      Pengcheng Yang 提交于
      [ Upstream commit e176b1ba476cf36f723cfcc7a9e57f3cb47dec70 ]
      
      When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
      retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
      If packet loss is detected at this time, retransmit_skb_hint will be set
      to point to the current packet loss in tcp_verify_retransmit_hint(),
      then the packets that were previously marked lost but not retransmitted
      due to the restriction of cwnd will be skipped and cannot be
      retransmitted.
      
      To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
      be reset only after all marked lost packets are retransmitted
      (retrans_out >= lost_out), otherwise we need to traverse from
      tcp_rtx_queue_head in tcp_xmit_retransmit_queue().
      
      Packetdrill to demonstrate:
      
      // Disable RACK and set max_reordering to keep things simple
          0 `sysctl -q net.ipv4.tcp_recovery=0`
         +0 `sysctl -q net.ipv4.tcp_max_reordering=3`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 8 data segments
         +0 write(4, ..., 8000) = 8000
         +0 > P. 1:8001(8000) ack 1
      
      // Enter recovery and 1:3001 is marked lost
       +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>
      
      // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
         +0 > . 1:1001(1000) ack 1
      
      // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
       +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
      // Now retransmit_skb_hint points to 4001:5001 which is now marked lost
      
      // BUG: 2001:3001 was not retransmitted
         +0 > . 2001:3001(1000) ack 1
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5984aa9d
    • F
      netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct · 65c05700
      Florian Westphal 提交于
      commit 212e7f56605ef9688d0846db60c6c6ec06544095 upstream.
      
      An earlier commit (1b789577f655060d98d20e,
      "netfilter: arp_tables: init netns pointer in xt_tgchk_param struct")
      fixed missing net initialization for arptables, but turns out it was
      incomplete.  We can get a very similar struct net NULL deref during
      error unwinding:
      
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77
       xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175
       cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline]
       translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587
       do_replace net/ipv4/netfilter/arp_tables.c:981 [inline]
       do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461
      
      Also init the netns pointer in xt_tgdtor_param struct.
      
      Fixes: add67461 ("netfilter: add struct net * to target parameters")
      Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      65c05700
    • F
      netfilter: arp_tables: init netns pointer in xt_tgchk_param struct · 0c8d6bc6
      Florian Westphal 提交于
      commit 1b789577f655060d98d20ed0c6f9fbd469d6ba63 upstream.
      
      We get crash when the targets checkentry function tries to make
      use of the network namespace pointer for arptables.
      
      When the net pointer got added back in 2010, only ip/ip6/ebtables were
      changed to initialize it, so arptables has this set to NULL.
      
      This isn't a problem for normal arptables because no existing
      arptables target has a checkentry function that makes use of par->net.
      
      However, direct users of the setsockopt interface can provide any
      target they want as long as its registered for ARP or UNPSEC protocols.
      
      syzkaller managed to send a semi-valid arptables rule for RATEEST target
      which is enough to trigger NULL deref:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      RIP: xt_rateest_tg_checkentry+0x11d/0xb40 net/netfilter/xt_RATEEST.c:109
      [..]
       xt_check_target+0x283/0x690 net/netfilter/x_tables.c:1019
       check_target net/ipv4/netfilter/arp_tables.c:399 [inline]
       find_check_entry net/ipv4/netfilter/arp_tables.c:422 [inline]
       translate_table+0x1005/0x1d70 net/ipv4/netfilter/arp_tables.c:572
       do_replace net/ipv4/netfilter/arp_tables.c:977 [inline]
       do_arpt_set_ctl+0x310/0x640 net/ipv4/netfilter/arp_tables.c:1456
      
      Fixes: add67461 ("netfilter: add struct net * to target parameters")
      Reported-by: syzbot+d7358a458d8a81aee898@syzkaller.appspotmail.com
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0c8d6bc6
    • P
      tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK · c969e232
      Pengcheng Yang 提交于
      [ Upstream commit c9655008e7845bcfdaac10a1ed8554ec167aea88 ]
      
      When we receive a D-SACK, where the sequence number satisfies:
      	undo_marker <= start_seq < end_seq <= prior_snd_una
      we consider this is a valid D-SACK and tcp_is_sackblock_valid()
      returns true, then this D-SACK is discarded as "old stuff",
      but the variable first_sack_index is not marked as negative
      in tcp_sacktag_write_queue().
      
      If this D-SACK also carries a SACK that needs to be processed
      (for example, the previous SACK segment was lost), this SACK
      will be treated as a D-SACK in the following processing of
      tcp_sacktag_write_queue(), which will eventually lead to
      incorrect updates of undo_retrans and reordering.
      
      Fixes: fd6dad61 ("[TCP]: Earlier SACK block verification & simplify access to them")
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c969e232
  2. 15 4月, 2020 16 次提交
    • E
      tcp: do not send empty skb from tcp_write_xmit() · aa486176
      Eric Dumazet 提交于
      [ Upstream commit 1f85e6267caca44b30c54711652b0726fadbb131 ]
      
      Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
      write queue in error cases") in linux-4.14 stable triggered
      various bugs. One of them has been fixed in commit ba2ddb43f270
      ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
      we still have crashes in some occasions.
      
      Root-cause is that when tcp_sendmsg() has allocated a fresh
      skb and could not append a fragment before being blocked
      in sk_stream_wait_memory(), tcp_write_xmit() might be called
      and decide to send this fresh and empty skb.
      
      Sending an empty packet is not only silly, it might have caused
      many issues we had in the past with tp->packets_out being
      out of sync.
      
      Fixes: c65f7f00 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      aa486176
    • E
      tcp/dccp: fix possible race __inet_lookup_established() · 7c74c073
      Eric Dumazet 提交于
      commit 8dbd76e79a16b45b2ccb01d2f2e08dbf64e71e40 upstream.
      
      Michal Kubecek and Firo Yang did a very nice analysis of crashes
      happening in __inet_lookup_established().
      
      Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
      (via a close()/socket()/listen() cycle) without a RCU grace period,
      I should not have changed listeners linkage in their hash table.
      
      They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
      so that a lookup can detect a socket in a hash list was moved in
      another one.
      
      Since we added code in commit d296ba60 ("soreuseport: Resolve
      merge conflict for v4/v6 ordering fix"), we have to add
      hlist_nulls_add_tail_rcu() helper.
      
      Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      Reported-by: NFiro Yang <firo.yang@suse.com>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      [stable-4.19: we also need to update code in __inet_lookup_listener() and
       inet6_lookup_listener() which has been removed in 5.0-rc1.]
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7c74c073
    • H
      vti: do not confirm neighbor when do pmtu update · f576cf5d
      Hangbin Liu 提交于
      [ Upstream commit 8247a79efa2f28b44329f363272550c1738377de ]
      
      When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
      we should not call dst_confirm_neigh() as there is no two-way communication.
      
      Although vti and vti6 are immune to this problem because they are IFF_NOARP
      interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
      here.
      
      v5: Update commit description.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f576cf5d
    • H
      tunnel: do not confirm neighbor when do pmtu update · f1b6f3ee
      Hangbin Liu 提交于
      [ Upstream commit 7a1592bcb15d71400a98632727791d1e68ea0ee8 ]
      
      When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
      we should not call dst_confirm_neigh() as there is no two-way communication.
      
      v5: No Change.
      v4: Update commit description
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      
      Fixes: 0dec879f ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Tested-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f1b6f3ee
    • H
      net: add bool confirm_neigh parameter for dst_ops.update_pmtu · d9da81f2
      Hangbin Liu 提交于
      [ Upstream commit bd085ef678b2cc8c38c105673dfe8ff8f5ec0c57 ]
      
      The MTU update code is supposed to be invoked in response to real
      networking events that update the PMTU. In IPv6 PMTU update function
      __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
      confirmed time.
      
      But for tunnel code, it will call pmtu before xmit, like:
        - tnl_update_pmtu()
          - skb_dst_update_pmtu()
            - ip6_rt_update_pmtu()
              - __ip6_rt_update_pmtu()
                - dst_confirm_neigh()
      
      If the tunnel remote dst mac address changed and we still do the neigh
      confirm, we will not be able to update neigh cache and ping6 remote
      will failed.
      
      So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
      should not be invoking dst_confirm_neigh() as we have no evidence
      of successful two-way communication at this point.
      
      On the other hand it is also important to keep the neigh reachability fresh
      for TCP flows, so we cannot remove this dst_confirm_neigh() call.
      
      To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
      to choose whether we should do neigh update or not. I will add the parameter
      in this patch and set all the callers to true to comply with the previous
      way, and fix the tunnel code one by one on later patches.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d9da81f2
    • A
      udp: fix integer overflow while computing available space in sk_rcvbuf · 7501ac7a
      Antonio Messina 提交于
      [ Upstream commit feed8a4fc9d46c3126fb9fcae0e9248270c6321a ]
      
      When the size of the receive buffer for a socket is close to 2^31 when
      computing if we have enough space in the buffer to copy a packet from
      the queue to the buffer we might hit an integer overflow.
      
      When an user set net.core.rmem_default to a value close to 2^31 UDP
      packets are dropped because of this overflow. This can be visible, for
      instance, with failure to resolve hostnames.
      
      This can be fixed by casting sk_rcvbuf (which is an int) to unsigned
      int, similarly to how it is done in TCP.
      Signed-off-by: NAntonio Messina <amessina@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7501ac7a
    • C
      tcp: Fix highest_sack and highest_sack_seq · 6ef6a350
      Cambda Zhu 提交于
      [ Upstream commit 853697504de043ff0bfd815bd3a64de1dce73dc7 ]
      
      >From commit 50895b9d ("tcp: highest_sack fix"), the logic about
      setting tp->highest_sack to the head of the send queue was removed.
      Of course the logic is error prone, but it is logical. Before we
      remove the pointer to the highest sack skb and use the seq instead,
      we need to set tp->highest_sack to NULL when there is no skb after
      the last sack, and then replace NULL with the real skb when new skb
      inserted into the rtx queue, because the NULL means the highest sack
      seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
      the next ACK with sack option will increase tp->reordering unexpectedly.
      
      This patch sets tp->highest_sack to the tail of the rtx queue if
      it's NULL and new data is sent. The patch keeps the rule that the
      highest_sack can only be maintained by sack processing, except for
      this only case.
      
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6ef6a350
    • E
      net: icmp: fix data-race in cmp_global_allow() · 78090826
      Eric Dumazet 提交于
      commit bbab7ef235031f6733b5429ae7877bfa22339712 upstream.
      
      This code reads two global variables without protection
      of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to
      avoid load/store-tearing and better document the intent.
      
      KCSAN reported :
      BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow
      
      read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0:
       icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254
       icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
       icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
       icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
       icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
       ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
       dst_link_failure include/net/dst.h:419 [inline]
       vti_xmit net/ipv4/ip_vti.c:243 [inline]
       vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
       __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
       netdev_start_xmit include/linux/netdevice.h:4434 [inline]
       xmit_one net/core/dev.c:3280 [inline]
       dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
       __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
       neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
      
      write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1:
       icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272
       icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
       icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
       icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
       icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
       ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
       dst_link_failure include/net/dst.h:419 [inline]
       vti_xmit net/ipv4/ip_vti.c:243 [inline]
       vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
       __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
       netdev_start_xmit include/linux/netdevice.h:4434 [inline]
       xmit_one net/core/dev.c:3280 [inline]
       dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
       __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
       neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 4cdf507d ("icmp: add a global rate limitation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      78090826
    • E
      inetpeer: fix data-race in inet_putpeer / inet_putpeer · b243ea6d
      Eric Dumazet 提交于
      commit 71685eb4ce80ae9c49eff82ca4dd15acab215de9 upstream.
      
      We need to explicitely forbid read/store tearing in inet_peer_gc()
      and inet_putpeer().
      
      The following syzbot report reminds us about inet_putpeer()
      running without a lock held.
      
      BUG: KCSAN: data-race in inet_putpeer / inet_putpeer
      
      write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 0:
       inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
       ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
       inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
       __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
       rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
       rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
       rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0xbb/0xe0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
       native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
       arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
       default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x1af/0x280 kernel/sched/idle.c:263
      
      write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 1:
       inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
       ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
       inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
       __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
       rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
       rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
       rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 4b9d9be8 ("inetpeer: remove unused list")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b243ea6d
    • E
      tcp: md5: fix potential overestimation of TCP option space · 5408fb9c
      Eric Dumazet 提交于
      [ Upstream commit 9424e2e7 ]
      
      Back in 2008, Adam Langley fixed the corner case of packets for flows
      having all of the following options : MD5 TS SACK
      
      Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
      can be cooked from the remaining 8 bytes.
      
      tcp_established_options() correctly sets opts->num_sack_blocks
      to zero, but returns 36 instead of 32.
      
      This means TCP cooks packets with 4 extra bytes at the end
      of options, containing unitialized bytes.
      
      Fixes: 33ad798c ("tcp: options clean up")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5408fb9c
    • E
      inet: protect against too small mtu values. · 51b0ec7b
      Eric Dumazet 提交于
      [ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ]
      
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      51b0ec7b
    • C
      gre: refetch erspan header from skb->data after pskb_may_pull() · 22f907b9
      Cong Wang 提交于
      [ Upstream commit 0e4940928c26527ce8f97237fef4c8a91cd34207 ]
      
      After pskb_may_pull() we should always refetch the header
      pointers from the skb->data in case it got reallocated.
      
      In gre_parse_header(), the erspan header is still fetched
      from the 'options' pointer which is fetched before
      pskb_may_pull().
      
      Found this during code review of a KMSAN bug report.
      
      Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
      Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Reviewed-by: NSimon Horman <simon.horman@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      22f907b9
    • E
      net: silence KCSAN warnings around sk_add_backlog() calls · b25fcaeb
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-5.4-rc4
      commit 8265792bf887
      category: bugfix
      bugzilla: 24071
      CVE: NA
      
      -------------------------------------------------
      
      sk_add_backlog() callers usually read sk->sk_rcvbuf without
      owning the socket lock. This means sk_rcvbuf value can
      be changed by other cpus, and KCSAN complains.
      
      Add READ_ONCE() annotations to document the lockless nature
      of these reads.
      
      Note that writes over sk_rcvbuf should also use WRITE_ONCE(),
      but this will be done in separate patches to ease stable
      backports (if we decide this is relevant for stable trees).
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff88812ab369f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:902 [inline]
       sk_add_backlog include/net/sock.h:933 [inline]
       tcp_add_backlog+0x45a/0xcc0 net/ipv4/tcp_ipv4.c:1737
       tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6352 [inline]
       net_rx_action+0x3ae/0xa50 net/core/dev.c:6418
      
      read to 0xffff88812ab369f8 of 8 bytes by task 7271 on cpu 0:
       tcp_recvmsg+0x470/0x1a30 net/ipv4/tcp.c:2047
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1864 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7271 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NHuang Guobin <huangguobin4@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      b25fcaeb
    • Y
      tcp: fix SNMP TCP timeout under-estimation · 5dfc8284
      Yuchung Cheng 提交于
      [ Upstream commit e1561fe2dd69dc5dddd69bd73aa65355bdfb048b ]
      
      Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
      1. It counts all SYN and SYN-ACK timeouts
      2. It counts timeouts in other states except recurring timeouts and
         timeouts after fast recovery or disorder state.
      
      Such selective accounting makes analysis difficult and complicated. For
      example the monitoring system needs to collect many other SNMP counters
      to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
      counter simply counts all the retransmit timeout (SYN or data or FIN).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5dfc8284
    • Y
      tcp: fix SNMP under-estimation on failed retransmission · bcd8b0ce
      Yuchung Cheng 提交于
      [ Upstream commit ec641b39457e17774313b66697a8a1dc070257bd ]
      
      Previously the SNMP counter LINUX_MIB_TCPRETRANSFAIL is not counting
      the TSO/GSO properly on failed retransmission. This patch fixes that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bcd8b0ce
    • Y
      tcp: fix off-by-one bug on aborting window-probing socket · c7fb5d94
      Yuchung Cheng 提交于
      [ Upstream commit 3976535af0cb9fe34a55f2ffb8d7e6b39a2f8188 ]
      
      Previously there is an off-by-one bug on determining when to abort
      a stalled window-probing socket. This patch fixes that so it is
      consistent with tcp_write_timeout().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c7fb5d94
  3. 14 4月, 2020 7 次提交
    • J
      net: don't clear sock->sk early to avoid trouble in strparser · d69702fd
      Jakub Kicinski 提交于
      mainline inclusion
      from mainline-v5.2-rc3
      commit 2b81f8161dfeda4017cef4f2498ccb64b13f0d61
      category: bugfix
      bugzilla: 16972
      CVE: NA
      
      -------------------------------------
      
      af_inet sets sock->sk to NULL which trips strparser over:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000012
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.2.0-rc1-00139-g14629453a6d3 #21
      RIP: 0010:tcp_peek_len+0x10/0x60
      RSP: 0018:ffffc02e41c54b98 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff9cf924c4e030 RCX: 0000000000000051
      RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff9cf97128f480
      RBP: ffff9cf9365e0300 R08: ffff9cf94fe7d2c0 R09: 0000000000000000
      R10: 000000000000036b R11: ffff9cf939735e00 R12: ffff9cf91ad9ae40
      R13: ffff9cf924c4e000 R14: ffff9cf9a8fcbaae R15: 0000000000000020
      FS: 0000000000000000(0000) GS:ffff9cf9af7c0000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000012 CR3: 000000013920a003 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
       <IRQ>
       strp_data_ready+0x48/0x90
       tls_data_ready+0x22/0xd0 [tls]
       tcp_rcv_established+0x569/0x620
       tcp_v4_do_rcv+0x127/0x1e0
       tcp_v4_rcv+0xad7/0xbf0
       ip_protocol_deliver_rcu+0x2c/0x1c0
       ip_local_deliver_finish+0x41/0x50
       ip_local_deliver+0x6b/0xe0
       ? ip_protocol_deliver_rcu+0x1c0/0x1c0
       ip_rcv+0x52/0xd0
       ? ip_rcv_finish_core.isra.20+0x380/0x380
       __netif_receive_skb_one_core+0x7e/0x90
       netif_receive_skb_internal+0x42/0xf0
       napi_gro_receive+0xed/0x150
       nfp_net_poll+0x7a2/0xd30 [nfp]
       ? kmem_cache_free_bulk+0x286/0x310
       net_rx_action+0x149/0x3b0
       __do_softirq+0xe3/0x30a
       ? handle_irq_event_percpu+0x6a/0x80
       irq_exit+0xe8/0xf0
       do_IRQ+0x85/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      RIP: 0010:cpuidle_enter_state+0xbc/0x450
      
      To avoid this issue set sock->sk after sk_prot->close.
      My grepping and testing did not discover any code which
      would depend on the current behaviour.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Reported-by: NDavid Beckett <david.beckett@netronome.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d69702fd
    • E
      tcp: annotate tp->rcv_nxt lockless reads · a0adc066
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.4-rc4
      commit dba7d9b8c739df27ff3a234c81d6c6b23e3986fa
      category: bugfix
      bugzilla: 24157
      CVE: NA
      
      -------------------------------------
      
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a0adc066
    • E
      tcp: annotate lockless access to tcp_memory_pressure · b59127ac
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.4-rc4
      commit 1f142c17
      category: bugfix
      bugzilla: 24138
      CVE: NA
      
      -------------------------------------
      
      tcp_memory_pressure is read without holding any lock,
      and its value could be changed on other cpus.
      
      Use READ_ONCE() to annotate these lockless reads.
      
      The write side is already using atomic ops.
      
      Fixes: b8da51eb ("tcp: introduce tcp_under_memory_pressure()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b59127ac
    • Y
      tcp: exit if nothing to retransmit on RTO timeout · c23a2ee8
      Yuchung Cheng 提交于
      commit 88f8598d0a302a08380eadefd09b9f5cb1c4c428 upstream.
      
      Previously TCP only warns if its RTO timer fires and the
      retransmission queue is empty, but it'll cause null pointer
      reference later on. It's better to avoid such catastrophic failure
      and simply exit with a warning.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c23a2ee8
    • W
      ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel · 3c76bb25
      wenxu 提交于
      [ Upstream commit d71b57532d70c03f4671dd04e84157ac6bf021b0 ]
      
      ip l add dev tun type gretap key 1000
      ip a a dev tun 10.0.0.1/24
      
      Packets with tun-id 1000 can be recived by tun dev. But packet can't
      be sent through dev tun for non-tunnel-dst
      
      With this patch: tunnel-dst can be get through lwtunnel like beflow:
      ip r a 10.0.0.7 encap ip dst 172.168.0.11 dev tun
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3c76bb25
    • T
      net: bpfilter: fix iptables failure if bpfilter_umh is disabled · 841578a1
      Taehee Yoo 提交于
      [ Upstream commit 97adaddaa6db7a8af81b9b11e30cbe3628cd6700 ]
      
      When iptables command is executed, ip_{set/get}sockopt() try to upload
      bpfilter.ko if bpfilter is enabled. if it couldn't find bpfilter.ko,
      command is failed.
      bpfilter.ko is generated if CONFIG_BPFILTER_UMH is enabled.
      ip_{set/get}sockopt() only checks CONFIG_BPFILTER.
      So that if CONFIG_BPFILTER is enabled and CONFIG_BPFILTER_UMH is disabled,
      iptables command is always failed.
      
      test config:
         CONFIG_BPFILTER=y
         # CONFIG_BPFILTER_UMH is not set
      
      test command:
         %iptables -L
         iptables: No chain/target/match by that name.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      841578a1
    • H
      ipv4/igmp: fix v1/v2 switchback timeout based on rfc3376, 8.12 · c4930a2f
      Hangbin Liu 提交于
      [ Upstream commit 966c37f2d77eb44d47af8e919267b1ba675b2eca ]
      
      Similiar with ipv6 mcast commit 89225d1c ("net: ipv6: mld: fix v1/v2
      switchback timeout to rfc3810, 9.12.")
      
      i) RFC3376 8.12. Older Version Querier Present Timeout says:
      
         The Older Version Querier Interval is the time-out for transitioning
         a host back to IGMPv3 mode once an older version query is heard.
         When an older version query is received, hosts set their Older
         Version Querier Present Timer to Older Version Querier Interval.
      
         This value MUST be ((the Robustness Variable) times (the Query
         Interval in the last Query received)) plus (one Query Response
         Interval).
      
      Currently we only use a hardcode value IGMP_V1/v2_ROUTER_PRESENT_TIMEOUT.
      Fix it by adding two new items mr_qi(Query Interval) and mr_qri(Query Response
      Interval) in struct in_device.
      
      Now we can calculate the switchback time via (mr_qrv * mr_qi) + mr_qri.
      We need update these values when receive IGMPv3 queries.
      Reported-by: NYing Xu <yinxu@redhat.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c4930a2f
  4. 27 12月, 2019 12 次提交
    • Y
      tcp: start receiver buffer autotuning sooner · ec0bd6ee
      Yuchung Cheng 提交于
      [ Upstream commit 041a14d2671573611ffd6412bc16e2f64469f7fb ]
      
      Previously receiver buffer auto-tuning starts after receiving
      one advertised window amount of data. After the initial receiver
      buffer was raised by patch a337531b942b ("tcp: up initial rmem to
      128KB and SYN rwin to around 64KB"), the reciver buffer may take
      too long to start raising. To address this issue, this patch lowers
      the initial bytes expected to receive roughly the expected sender's
      initial window.
      
      Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec0bd6ee
    • Y
      tcp: up initial rmem to 128KB and SYN rwin to around 64KB · 3c975d25
      Yuchung Cheng 提交于
      [ Upstream commit a337531b942bd8a03e7052444d7e36972aac2d92 ]
      
      Previously TCP initial receive buffer is ~87KB by default and
      the initial receive window is ~29KB (20 MSS). This patch changes
      the two numbers to 128KB and ~64KB (rounding down to the multiples
      of MSS) respectively. The patch also simplifies the calculations s.t.
      the two numbers are directly controlled by sysctl tcp_rmem[1]:
      
        1) Initial receiver buffer budget (sk_rcvbuf): while this should
           be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
           always override and set a larger size when a new connection
           establishes.
      
        2) Initial receive window in SYN: previously it is set to 20
           packets if MSS <= 1460. The number 20 was based on the initial
           congestion window of 10: the receiver needs twice amount to
           avoid being limited by the receive window upon out-of-order
           delivery in the first window burst. But since this only
           applies if the receiving MSS <= 1460, connection using large MTU
           (e.g. to utilize receiver zero-copy) may be limited by the
           receive window.
      
      With this patch TCP memory configuration is more straight-forward and
      more properly sized to modern high-speed networks by default. Several
      popular stacks have been announcing 64KB rwin in SYNs as well.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3c975d25
    • T
      netfilter: masquerade: don't flush all conntracks if only one address deleted on device · fae74587
      Tan Hu 提交于
      [ Upstream commit 097f95d319f817e651bd51f8846aced92a55a6a1 ]
      
      We configured iptables as below, which only allowed incoming data on
      established connections:
      
      iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT
      iptables -t mangle -P PREROUTING DROP
      
      When deleting a secondary address, current masquerade implements would
      flush all conntracks on this device. All the established connections on
      primary address also be deleted, then subsequent incoming data on the
      connections would be dropped wrongly because it was identified as NEW
      connection.
      
      So when an address was delete, it should only flush connections related
      with the address.
      Signed-off-by: NTan Hu <tan.hu@zte.com.cn>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fae74587
    • G
      ipmr: Fix skb headroom in ipmr_get_route(). · 33931289
      Guillaume Nault 提交于
      [ Upstream commit 7901cd97963d6cbde88fa25a4a446db3554c16c6 ]
      
      In route.c, inet_rtm_getroute_build_skb() creates an skb with no
      headroom. This skb is then used by inet_rtm_getroute() which may pass
      it to rt_fill_info() and, from there, to ipmr_get_route(). The later
      might try to reuse this skb by cloning it and prepending an IPv4
      header. But since the original skb has no headroom, skb_push() triggers
      skb_under_panic():
      
      skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
      ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:108!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      RIP: 0010:skb_panic+0xbf/0xd0
      Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff <0f> 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
      RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
      RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
      RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
      RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
      R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
      R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
      FS:  00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_push+0x7e/0x80
       ipmr_get_route+0x459/0x6fa
       rt_fill_info+0x692/0x9f0
       inet_rtm_getroute+0xd26/0xf20
       rtnetlink_rcv_msg+0x45d/0x630
       netlink_rcv_skb+0x1a5/0x220
       rtnetlink_rcv+0x15/0x20
       netlink_unicast+0x305/0x3a0
       netlink_sendmsg+0x575/0x730
       sock_sendmsg+0xb5/0xc0
       ___sys_sendmsg+0x497/0x4f0
       __sys_sendmsg+0xcb/0x150
       __x64_sys_sendmsg+0x48/0x50
       do_syscall_64+0xd2/0xac0
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Actually the original skb used to have enough headroom, but the
      reserve_skb() call was lost with the introduction of
      inet_rtm_getroute_build_skb() by commit 404eb77e ("ipv4: support
      sport, dport and ip_proto in RTM_GETROUTE").
      
      We could reserve some headroom again in inet_rtm_getroute_build_skb(),
      but this function shouldn't be responsible for handling the special
      case of ipmr_get_route(). Let's handle that directly in
      ipmr_get_route() by calling skb_realloc_headroom() instead of
      skb_clone().
      
      Fixes: 404eb77e ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      33931289
    • L
      configs: Delete CONFIG_ARCH_ASCEND for versatility code · 798f2efd
      Lijun Fang 提交于
      ascend inclusion
      category: feature
      bugzilla: NA
      CVE: NA
      
      --------
      
      Delete CONFIG_ARCH_ASCEND for versatility code
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      798f2efd
    • D
      ipv4: Fix table id reference in fib_sync_down_addr · 087e1330
      David Ahern 提交于
      [ Upstream commit e0a312629fefa943534fc46f7bfbe6de3fdaf463 ]
      
      Hendrik reported routes in the main table using source address are not
      removed when the address is removed. The problem is that fib_sync_down_addr
      does not account for devices in the default VRF which are associated
      with the main table. Fix by updating the table id reference.
      
      Fixes: 5a56a0b3 ("net: Don't delete routes in different VRFs")
      Reported-by: NHendrik Donner <hd@os-cillation.de>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      087e1330
    • P
      ipv4: fix route update on metric change. · 944216dd
      Paolo Abeni 提交于
      [ Upstream commit 0b834ba00ab5337e938c727e216e1f5249794717 ]
      
      Since commit af4d768a ("net/ipv4: Add support for specifying metric
      of connected routes"), when updating an IP address with a different metric,
      the associated connected route is updated, too.
      
      Still, the mentioned commit doesn't handle properly some corner cases:
      
      $ ip addr add dev eth0 192.168.1.0/24
      $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2
      $ ip addr add dev eth0 192.168.3.1/24
      $ ip addr change dev eth0 192.168.1.0/24 metric 10
      $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10
      $ ip addr change dev eth0 192.168.3.1/24 metric 10
      $ ip -4 route
      192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0
      192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1
      192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10
      
      Only the last route is correctly updated.
      
      The problem is the current test in fib_modify_prefix_metric():
      
      	if (!(dev->flags & IFF_UP) ||
      	    ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) ||
      	    ipv4_is_zeronet(prefix) ||
      	    prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32)
      
      Which should be the logical 'not' of the pre-existing test in
      fib_add_ifaddr():
      
      	if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) &&
      	    (prefix != addr || ifa->ifa_prefixlen < 32))
      
      To properly negate the original expression, we need to change the last
      logical 'or' to a logical 'and'.
      
      Fixes: af4d768a ("net/ipv4: Add support for specifying metric of connected routes")
      Reported-and-suggested-by: NBeniamino Galvani <bgalvani@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      944216dd
    • E
      net: use skb_queue_empty_lockless() in busy poll contexts · 027c3d07
      Eric Dumazet 提交于
      [ Upstream commit 3f926af3f4d688e2e11e7f8ed04e277a14d4d4a4 ]
      
      Busy polling usually runs without locks.
      Let's use skb_queue_empty_lockless() instead of skb_queue_empty()
      
      Also uses READ_ONCE() in __skb_try_recv_datagram() to address
      a similar potential problem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      027c3d07
    • E
      net: use skb_queue_empty_lockless() in poll() handlers · a08bf33a
      Eric Dumazet 提交于
      [ Upstream commit 3ef7cf57 ]
      
      Many poll() handlers are lockless. Using skb_queue_empty_lockless()
      instead of skb_queue_empty() is more appropriate.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a08bf33a
    • E
      udp: use skb_queue_empty_lockless() · 41955f50
      Eric Dumazet 提交于
      [ Upstream commit 137a0dbe3426fd7bcfe3f8117b36a87b3590e4eb ]
      
      syzbot reported a data-race [1].
      
      We should use skb_queue_empty_lockless() to document that we are
      not ensuring a mutual exclusion and silence KCSAN.
      
      [1]
      BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb
      
      write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0:
       __skb_insert include/linux/skbuff.h:1852 [inline]
       __skb_queue_before include/linux/skbuff.h:1958 [inline]
       __skb_queue_tail include/linux/skbuff.h:1991 [inline]
       __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470
       __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline]
       udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057
       udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074
       udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233
       __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300
       udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1:
       skb_queue_empty include/linux/skbuff.h:1494 [inline]
       __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653
       udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
       ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
       do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
       __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
       __do_sys_recvmmsg net/socket.c:2703 [inline]
       __se_sys_recvmmsg net/socket.c:2696 [inline]
       __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      41955f50
    • E
      udp: fix data-race in udp_set_dev_scratch() · e4635d3d
      Eric Dumazet 提交于
      [ Upstream commit a793183caa9afae907a0d7ddd2ffd57329369bf5 ]
      
      KCSAN reported a data-race in udp_set_dev_scratch() [1]
      
      The issue here is that we must not write over skb fields
      if skb is shared. A similar issue has been fixed in commit
      89c22d8c ("net: Fix skb csum races when peeking")
      
      While we are at it, use a helper only dealing with
      udp_skb_scratch(skb)->csum_unnecessary, as this allows
      udp_set_dev_scratch() to be called once and thus inlined.
      
      [1]
      BUG: KCSAN: data-race in udp_set_dev_scratch / udpv6_recvmsg
      
      write to 0xffff888120278317 of 1 bytes by task 10411 on cpu 1:
       udp_set_dev_scratch+0xea/0x200 net/ipv4/udp.c:1308
       __first_packet_length+0x147/0x420 net/ipv4/udp.c:1556
       first_packet_length+0x68/0x2a0 net/ipv4/udp.c:1579
       udp_poll+0xea/0x110 net/ipv4/udp.c:2720
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       do_select+0x7d0/0x1020 fs/select.c:534
       core_sys_select+0x381/0x550 fs/select.c:677
       do_pselect.constprop.0+0x11d/0x160 fs/select.c:759
       __do_sys_pselect6 fs/select.c:784 [inline]
       __se_sys_pselect6 fs/select.c:769 [inline]
       __x64_sys_pselect6+0x12e/0x170 fs/select.c:769
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      read to 0xffff888120278317 of 1 bytes by task 10413 on cpu 0:
       udp_skb_csum_unnecessary include/net/udp.h:358 [inline]
       udpv6_recvmsg+0x43e/0xe90 net/ipv6/udp.c:310
       inet6_recvmsg+0xbb/0x240 net/ipv6/af_inet6.c:592
       sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
       ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
       do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
       __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
       __do_sys_recvmmsg net/socket.c:2703 [inline]
       __se_sys_recvmmsg net/socket.c:2696 [inline]
       __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 10413 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Reviewed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e4635d3d
    • E
      net: annotate accesses to sk->sk_incoming_cpu · 46b32fd2
      Eric Dumazet 提交于
      [ Upstream commit 7170a977743b72cf3eb46ef6ef89885dc7ad3621 ]
      
      This socket field can be read and written by concurrent cpus.
      
      Use READ_ONCE() and WRITE_ONCE() annotations to document this,
      and avoid some compiler 'optimizations'.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv
      
      write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
       sk_incoming_cpu_update include/net/sock.h:953 [inline]
       tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
      
      read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
       sk_incoming_cpu_update include/net/sock.h:952 [inline]
       tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
       napi_poll net/core/dev.c:6392 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
       smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        net/ipv4/udp.c
        net/ipv4/inet_hashtables.c
        net/ipv6/inet6_hashtables.c
        net/ipv6/udp.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      46b32fd2