1. 22 5月, 2020 6 次提交
  2. 21 5月, 2020 2 次提交
    • S
      net: nlmsg_cancel() if put fails for nhmsg · d69100b8
      Stephen Worley 提交于
      Fixes data remnant seen when we fail to reserve space for a
      nexthop group during a larger dump.
      
      If we fail the reservation, we goto nla_put_failure and
      cancel the message.
      
      Reproduce with the following iproute2 commands:
      =====================
      ip link add dummy1 type dummy
      ip link add dummy2 type dummy
      ip link add dummy3 type dummy
      ip link add dummy4 type dummy
      ip link add dummy5 type dummy
      ip link add dummy6 type dummy
      ip link add dummy7 type dummy
      ip link add dummy8 type dummy
      ip link add dummy9 type dummy
      ip link add dummy10 type dummy
      ip link add dummy11 type dummy
      ip link add dummy12 type dummy
      ip link add dummy13 type dummy
      ip link add dummy14 type dummy
      ip link add dummy15 type dummy
      ip link add dummy16 type dummy
      ip link add dummy17 type dummy
      ip link add dummy18 type dummy
      ip link add dummy19 type dummy
      ip link add dummy20 type dummy
      ip link add dummy21 type dummy
      ip link add dummy22 type dummy
      ip link add dummy23 type dummy
      ip link add dummy24 type dummy
      ip link add dummy25 type dummy
      ip link add dummy26 type dummy
      ip link add dummy27 type dummy
      ip link add dummy28 type dummy
      ip link add dummy29 type dummy
      ip link add dummy30 type dummy
      ip link add dummy31 type dummy
      ip link add dummy32 type dummy
      
      ip link set dummy1 up
      ip link set dummy2 up
      ip link set dummy3 up
      ip link set dummy4 up
      ip link set dummy5 up
      ip link set dummy6 up
      ip link set dummy7 up
      ip link set dummy8 up
      ip link set dummy9 up
      ip link set dummy10 up
      ip link set dummy11 up
      ip link set dummy12 up
      ip link set dummy13 up
      ip link set dummy14 up
      ip link set dummy15 up
      ip link set dummy16 up
      ip link set dummy17 up
      ip link set dummy18 up
      ip link set dummy19 up
      ip link set dummy20 up
      ip link set dummy21 up
      ip link set dummy22 up
      ip link set dummy23 up
      ip link set dummy24 up
      ip link set dummy25 up
      ip link set dummy26 up
      ip link set dummy27 up
      ip link set dummy28 up
      ip link set dummy29 up
      ip link set dummy30 up
      ip link set dummy31 up
      ip link set dummy32 up
      
      ip link set dummy33 up
      ip link set dummy34 up
      
      ip link set vrf-red up
      ip link set vrf-blue up
      
      ip link set dummyVRFred up
      ip link set dummyVRFblue up
      
      ip ro add 1.1.1.1/32 dev dummy1
      ip ro add 1.1.1.2/32 dev dummy2
      ip ro add 1.1.1.3/32 dev dummy3
      ip ro add 1.1.1.4/32 dev dummy4
      ip ro add 1.1.1.5/32 dev dummy5
      ip ro add 1.1.1.6/32 dev dummy6
      ip ro add 1.1.1.7/32 dev dummy7
      ip ro add 1.1.1.8/32 dev dummy8
      ip ro add 1.1.1.9/32 dev dummy9
      ip ro add 1.1.1.10/32 dev dummy10
      ip ro add 1.1.1.11/32 dev dummy11
      ip ro add 1.1.1.12/32 dev dummy12
      ip ro add 1.1.1.13/32 dev dummy13
      ip ro add 1.1.1.14/32 dev dummy14
      ip ro add 1.1.1.15/32 dev dummy15
      ip ro add 1.1.1.16/32 dev dummy16
      ip ro add 1.1.1.17/32 dev dummy17
      ip ro add 1.1.1.18/32 dev dummy18
      ip ro add 1.1.1.19/32 dev dummy19
      ip ro add 1.1.1.20/32 dev dummy20
      ip ro add 1.1.1.21/32 dev dummy21
      ip ro add 1.1.1.22/32 dev dummy22
      ip ro add 1.1.1.23/32 dev dummy23
      ip ro add 1.1.1.24/32 dev dummy24
      ip ro add 1.1.1.25/32 dev dummy25
      ip ro add 1.1.1.26/32 dev dummy26
      ip ro add 1.1.1.27/32 dev dummy27
      ip ro add 1.1.1.28/32 dev dummy28
      ip ro add 1.1.1.29/32 dev dummy29
      ip ro add 1.1.1.30/32 dev dummy30
      ip ro add 1.1.1.31/32 dev dummy31
      ip ro add 1.1.1.32/32 dev dummy32
      
      ip next add id 1 via 1.1.1.1 dev dummy1
      ip next add id 2 via 1.1.1.2 dev dummy2
      ip next add id 3 via 1.1.1.3 dev dummy3
      ip next add id 4 via 1.1.1.4 dev dummy4
      ip next add id 5 via 1.1.1.5 dev dummy5
      ip next add id 6 via 1.1.1.6 dev dummy6
      ip next add id 7 via 1.1.1.7 dev dummy7
      ip next add id 8 via 1.1.1.8 dev dummy8
      ip next add id 9 via 1.1.1.9 dev dummy9
      ip next add id 10 via 1.1.1.10 dev dummy10
      ip next add id 11 via 1.1.1.11 dev dummy11
      ip next add id 12 via 1.1.1.12 dev dummy12
      ip next add id 13 via 1.1.1.13 dev dummy13
      ip next add id 14 via 1.1.1.14 dev dummy14
      ip next add id 15 via 1.1.1.15 dev dummy15
      ip next add id 16 via 1.1.1.16 dev dummy16
      ip next add id 17 via 1.1.1.17 dev dummy17
      ip next add id 18 via 1.1.1.18 dev dummy18
      ip next add id 19 via 1.1.1.19 dev dummy19
      ip next add id 20 via 1.1.1.20 dev dummy20
      ip next add id 21 via 1.1.1.21 dev dummy21
      ip next add id 22 via 1.1.1.22 dev dummy22
      ip next add id 23 via 1.1.1.23 dev dummy23
      ip next add id 24 via 1.1.1.24 dev dummy24
      ip next add id 25 via 1.1.1.25 dev dummy25
      ip next add id 26 via 1.1.1.26 dev dummy26
      ip next add id 27 via 1.1.1.27 dev dummy27
      ip next add id 28 via 1.1.1.28 dev dummy28
      ip next add id 29 via 1.1.1.29 dev dummy29
      ip next add id 30 via 1.1.1.30 dev dummy30
      ip next add id 31 via 1.1.1.31 dev dummy31
      ip next add id 32 via 1.1.1.32 dev dummy32
      
      i=100
      
      while [ $i -le 200 ]
      do
      ip next add id $i group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
      
      	echo $i
      
      	((i++))
      
      done
      
      ip next add id 999 group 1/2/3/4/5/6
      
      ip next ls
      
      ========================
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Signed-off-by: NStephen Worley <sworley@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d69100b8
    • E
      ax25: fix setsockopt(SO_BINDTODEVICE) · 687775ce
      Eric Dumazet 提交于
      syzbot was able to trigger this trace [1], probably by using
      a zero optlen.
      
      While we are at it, cap optlen to IFNAMSIZ - 1 instead of IFNAMSIZ.
      
      [1]
      BUG: KMSAN: uninit-value in strnlen+0xf9/0x170 lib/string.c:569
      CPU: 0 PID: 8807 Comm: syz-executor483 Not tainted 5.7.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       strnlen+0xf9/0x170 lib/string.c:569
       dev_name_hash net/core/dev.c:207 [inline]
       netdev_name_node_lookup net/core/dev.c:277 [inline]
       __dev_get_by_name+0x75/0x2b0 net/core/dev.c:778
       ax25_setsockopt+0xfa3/0x1170 net/ax25/af_ax25.c:654
       __compat_sys_setsockopt+0x4ed/0x910 net/compat.c:403
       __do_compat_sys_setsockopt net/compat.c:413 [inline]
       __se_compat_sys_setsockopt+0xdd/0x100 net/compat.c:410
       __ia32_compat_sys_setsockopt+0x62/0x80 net/compat.c:410
       do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
       do_fast_syscall_32+0x3bf/0x6d0 arch/x86/entry/common.c:398
       entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
      RIP: 0023:0xf7f57dd9
      Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
      RSP: 002b:00000000ffae8c1c EFLAGS: 00000217 ORIG_RAX: 000000000000016e
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000101
      RDX: 0000000000000019 RSI: 0000000020000000 RDI: 0000000000000004
      RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      
      Local variable ----devname@ax25_setsockopt created at:
       ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536
       ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      687775ce
  3. 20 5月, 2020 4 次提交
    • N
      sctp: Don't add the shutdown timer if its already been added · 20a785aa
      Neil Horman 提交于
      This BUG halt was reported a while back, but the patch somehow got
      missed:
      
      PID: 2879   TASK: c16adaa0  CPU: 1   COMMAND: "sctpn"
       #0 [f418dd28] crash_kexec at c04a7d8c
       #1 [f418dd7c] oops_end at c0863e02
       #2 [f418dd90] do_invalid_op at c040aaca
       #3 [f418de28] error_code (via invalid_op) at c08631a5
          EAX: f34baac0  EBX: 00000090  ECX: f418deb0  EDX: f5542950  EBP: 00000000
          DS:  007b      ESI: f34ba800  ES:  007b      EDI: f418dea0  GS:  00e0
          CS:  0060      EIP: c046fa5e  ERR: ffffffff  EFLAGS: 00010286
       #4 [f418de5c] add_timer at c046fa5e
       #5 [f418de68] sctp_do_sm at f8db8c77 [sctp]
       #6 [f418df30] sctp_primitive_SHUTDOWN at f8dcc1b5 [sctp]
       #7 [f418df48] inet_shutdown at c080baf9
       #8 [f418df5c] sys_shutdown at c079eedf
       #9 [f418df70] sys_socketcall at c079fe88
          EAX: ffffffda  EBX: 0000000d  ECX: bfceea90  EDX: 0937af98
          DS:  007b      ESI: 0000000c  ES:  007b      EDI: b7150ae4
          SS:  007b      ESP: bfceea7c  EBP: bfceeaa8  GS:  0033
          CS:  0073      EIP: b775c424  ERR: 00000066  EFLAGS: 00000282
      
      It appears that the side effect that starts the shutdown timer was processed
      multiple times, which can happen as multiple paths can trigger it.  This of
      course leads to the BUG halt in add_timer getting called.
      
      Fix seems pretty straightforward, just check before the timer is added if its
      already been started.  If it has mod the timer instead to min(current
      expiration, new expiration)
      
      Its been tested but not confirmed to fix the problem, as the issue has only
      occured in production environments where test kernels are enjoined from being
      installed.  It appears to be a sane fix to me though.  Also, recentely,
      Jere found a reproducer posted on list to confirm that this resolves the
      issues
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Vlad Yasevich <vyasevich@gmail.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: jere.leppanen@nokia.com
      CC: marcelo.leitner@gmail.com
      CC: netdev@vger.kernel.org
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20a785aa
    • B
      __netif_receive_skb_core: pass skb by reference · c0bbbdc3
      Boris Sukholitko 提交于
      __netif_receive_skb_core may change the skb pointer passed into it (e.g.
      in rx_handler). The original skb may be freed as a result of this
      operation.
      
      The callers of __netif_receive_skb_core may further process original skb
      by using pt_prev pointer returned by __netif_receive_skb_core thus
      leading to unpleasant effects.
      
      The solution is to pass skb by reference into __netif_receive_skb_core.
      
      v2: Added Fixes tag and comment regarding ppt_prev and skb invariant.
      
      Fixes: 88eb1944 ("net: core: propagate SKB lists through packet_type lookup")
      Signed-off-by: NBoris Sukholitko <boris.sukholitko@broadcom.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0bbbdc3
    • M
      net: inet_csk: Fix so_reuseport bind-address cache in tb->fast* · 88d7fcfa
      Martin KaFai Lau 提交于
      The commit 637bc8bb ("inet: reset tb->fastreuseport when adding a reuseport sk")
      added a bind-address cache in tb->fast*.  The tb->fast* caches the address
      of a sk which has successfully been binded with SO_REUSEPORT ON.  The idea
      is to avoid the expensive conflict search in inet_csk_bind_conflict().
      
      There is an issue with wildcard matching where sk_reuseport_match() should
      have returned false but it is currently returning true.  It ends up
      hiding bind conflict.  For example,
      
      bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
      bind("[::2]:443"); /* with    SO_REUSEPORT. Succeed. */
      bind("[::]:443");  /* with    SO_REUSEPORT. Still Succeed where it shouldn't */
      
      The last bind("[::]:443") with SO_REUSEPORT on should have failed because
      it should have a conflict with the very first bind("[::1]:443") which
      has SO_REUSEPORT off.  However, the address "[::2]" is cached in
      tb->fast* in the second bind. In the last bind, the sk_reuseport_match()
      returns true because the binding sk's wildcard addr "[::]" matches with
      the "[::2]" cached in tb->fast*.
      
      The correct bind conflict is reported by removing the second
      bind such that tb->fast* cache is not involved and forces the
      bind("[::]:443") to go through the inet_csk_bind_conflict():
      
      bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
      bind("[::]:443");  /* with    SO_REUSEPORT. -EADDRINUSE */
      
      The expected behavior for sk_reuseport_match() is, it should only allow
      the "cached" tb->fast* address to be used as a wildcard match but not
      the address of the binding sk.  To do that, the current
      "bool match_wildcard" arg is split into
      "bool match_sk1_wildcard" and "bool match_sk2_wildcard".
      
      This change only affects the sk_reuseport_match() which is only
      used by inet_csk (e.g. TCP).
      The other use cases are calling inet_rcv_saddr_equal() and
      this patch makes it pass the same "match_wildcard" arg twice to
      the "ipv[46]_rcv_saddr_equal(..., match_wildcard, match_wildcard)".
      
      Cc: Josef Bacik <jbacik@fb.com>
      Fixes: 637bc8bb ("inet: reset tb->fastreuseport when adding a reuseport sk")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88d7fcfa
    • T
      mptcp: use rightmost 64 bits in ADD_ADDR HMAC · 12555a2d
      Todd Malsbary 提交于
      This changes the HMAC used in the ADD_ADDR option from the leftmost 64
      bits to the rightmost 64 bits as described in RFC 8684, section 3.4.1.
      
      This issue was discovered while adding support to packetdrill for the
      ADD_ADDR v1 option.
      
      Fixes: 3df523ab ("mptcp: Add ADD_ADDR handling")
      Signed-off-by: NTodd Malsbary <todd.malsbary@linux.intel.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Reviewed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12555a2d
  4. 18 5月, 2020 2 次提交
  5. 17 5月, 2020 2 次提交
    • D
      net: dsa: mt7530: fix roaming from DSA user ports · 5e5502e0
      DENG Qingfang 提交于
      When a client moves from a DSA user port to a software port in a bridge,
      it cannot reach any other clients that connected to the DSA user ports.
      That is because SA learning on the CPU port is disabled, so the switch
      ignores the client's frames from the CPU port and still thinks it is at
      the user port.
      
      Fix it by enabling SA learning on the CPU port.
      
      To prevent the switch from learning from flooding frames from the CPU
      port, set skb->offload_fwd_mark to 1 for unicast and broadcast frames,
      and let the switch flood them instead of trapping to the CPU port.
      Multicast frames still need to be trapped to the CPU port for snooping,
      so set the SA_DIS bit of the MTK tag to 1 when transmitting those frames
      to disable SA learning.
      
      Fixes: b8f126a8 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
      Signed-off-by: NDENG Qingfang <dqfext@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e5502e0
    • M
      ipv6: Fix suspicious RCU usage warning in ip6mr · b6dd5acd
      Madhuparna Bhowmik 提交于
      This patch fixes the following warning:
      
      =============================
      WARNING: suspicious RCU usage
      5.7.0-rc4-next-20200507-syzkaller #0 Not tainted
      -----------------------------
      net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!!
      
      ipmr_new_table() returns an existing table, but there is no table at
      init. Therefore the condition: either holding rtnl or the list is empty
      is used.
      
      Fixes: d1db275d ("ipv6: ip6mr: support multiple tables")
      Reported-by: Nkernel test robot <lkp@intel.com>
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NMadhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6dd5acd
  6. 15 5月, 2020 3 次提交
    • A
      ipmr: Add lockdep expression to ipmr_for_each_table macro · 7013908c
      Amol Grover 提交于
      During the initialization process, ipmr_new_table() is called
      to create new tables which in turn calls ipmr_get_table() which
      traverses net->ipv4.mr_tables without holding the writer lock.
      However, this is safe to do so as no tables exist at this time.
      Hence add a suitable lockdep expression to silence the following
      false-positive warning:
      
      =============================
      WARNING: suspicious RCU usage
      5.7.0-rc3-next-20200428-syzkaller #0 Not tainted
      -----------------------------
      net/ipv4/ipmr.c:136 RCU-list traversed in non-reader section!!
      
      ipmr_get_table+0x130/0x160 net/ipv4/ipmr.c:136
      ipmr_new_table net/ipv4/ipmr.c:403 [inline]
      ipmr_rules_init net/ipv4/ipmr.c:248 [inline]
      ipmr_net_init+0x133/0x430 net/ipv4/ipmr.c:3089
      
      Fixes: f0ad0860 ("ipv4: ipmr: support multiple tables")
      Reported-by: syzbot+1519f497f2f9f08183c6@syzkaller.appspotmail.com
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NAmol Grover <frextrite@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7013908c
    • A
      ipmr: Fix RCU list debugging warning · a14fbcd4
      Amol Grover 提交于
      ipmr_for_each_table() macro uses list_for_each_entry_rcu()
      for traversing outside of an RCU read side critical section
      but under the protection of rtnl_mutex. Hence, add the
      corresponding lockdep expression to silence the following
      false-positive warning at boot:
      
      [    4.319347] =============================
      [    4.319349] WARNING: suspicious RCU usage
      [    4.319351] 5.5.4-stable #17 Tainted: G            E
      [    4.319352] -----------------------------
      [    4.319354] net/ipv4/ipmr.c:1757 RCU-list traversed in non-reader section!!
      
      Fixes: f0ad0860 ("ipv4: ipmr: support multiple tables")
      Signed-off-by: NAmol Grover <frextrite@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a14fbcd4
    • E
      tcp: fix error recovery in tcp_zerocopy_receive() · e776af60
      Eric Dumazet 提交于
      If user provides wrong virtual address in TCP_ZEROCOPY_RECEIVE
      operation we want to return -EINVAL error.
      
      But depending on zc->recv_skip_hint content, we might return
      -EIO error if the socket has SOCK_DONE set.
      
      Make sure to return -EINVAL in this case.
      
      BUG: KMSAN: uninit-value in tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
      BUG: KMSAN: uninit-value in do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
      CPU: 1 PID: 625 Comm: syz-executor.0 Not tainted 5.7.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
       do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
       tcp_getsockopt+0xf8/0x1f0 net/ipv4/tcp.c:3728
       sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3131
       __sys_getsockopt+0x533/0x7b0 net/socket.c:2177
       __do_sys_getsockopt net/socket.c:2192 [inline]
       __se_sys_getsockopt+0xe1/0x100 net/socket.c:2189
       __x64_sys_getsockopt+0x62/0x80 net/socket.c:2189
       do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c829
      Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1deeb72c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
      RAX: ffffffffffffffda RBX: 00000000004e01e0 RCX: 000000000045c829
      RDX: 0000000000000023 RSI: 0000000000000006 RDI: 0000000000000009
      RBP: 000000000078bf00 R08: 0000000020000200 R09: 0000000000000000
      R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000001d8 R14: 00000000004d3038 R15: 00007f1deeb736d4
      
      Local variable ----zc@do_tcp_getsockopt created at:
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e776af60
  7. 14 5月, 2020 3 次提交
    • T
      tipc: fix failed service subscription deletion · 88690b10
      Tuong Lien 提交于
      When a service subscription is expired or canceled by user, it needs to
      be deleted from the subscription list, so that new subscriptions can be
      registered (max = 65535 per net). However, there are two issues in code
      that can cause such an unused subscription to persist:
      
      1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but
      it makes a break shortly when the 1st subscription differs from the one
      specified, so the subscription will not be deleted.
      
      2) In case a subscription is canceled, the code to remove the
      'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it
      is a local subscription (i.e. the little endian isn't involved). So, it
      will be no matches when looking for the subscription to delete later.
      
      The subscription(s) will be removed eventually when the user terminates
      its topology connection but that could be a long time later. Meanwhile,
      the number of available subscriptions may be exhausted.
      
      This commit fixes the two issues above, so as needed a subscription can
      be deleted correctly.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88690b10
    • T
      tipc: fix memory leak in service subscripting · 0771d7df
      Tuong Lien 提交于
      Upon receipt of a service subscription request from user via a topology
      connection, one 'sub' object will be allocated in kernel, so it will be
      able to send an event of the service if any to the user correspondingly
      then. Also, in case of any failure, the connection will be shutdown and
      all the pertaining 'sub' objects will be freed.
      
      However, there is a race condition as follows resulting in memory leak:
      
             receive-work       connection        send-work
                    |                |                |
              sub-1 |<------//-------|                |
              sub-2 |<------//-------|                |
                    |                |<---------------| evt for sub-x
              sub-3 |<------//-------|                |
                    :                :                :
                    :                :                :
                    |       /--------|                |
                    |       |        * peer closed    |
                    |       |        |                |
                    |       |        |<-------X-------| evt for sub-y
                    |       |        |<===============|
              sub-n |<------/        X    shutdown    |
          -> orphan |                                 |
      
      That is, the 'receive-work' may get the last subscription request while
      the 'send-work' is shutting down the connection due to peer close.
      
      We had a 'lock' on the connection, so the two actions cannot be carried
      out simultaneously. If the last subscription is allocated e.g. 'sub-n',
      before the 'send-work' closes the connection, there will be no issue at
      all, the 'sub' objects will be freed. In contrast the last subscription
      will become orphan since the connection was closed, and we released all
      references.
      
      This commit fixes the issue by simply adding one test if the connection
      remains in 'connected' state right after we obtain the connection lock,
      then a subscription object can be created as usual, otherwise we ignore
      it.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Reported-by: NThang Ngo <thang.h.ngo@dektech.com.au>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0771d7df
    • T
      tipc: fix large latency in smart Nagle streaming · c7268589
      Tuong Lien 提交于
      Currently when a connection is in Nagle mode, we set the 'ack_required'
      bit in the last sending buffer and wait for the corresponding ACK prior
      to pushing more data. However, on the receiving side, the ACK is issued
      only when application really  reads the whole data. Even if part of the
      last buffer is received, we will not do the ACK as required. This might
      cause an unnecessary delay since the receiver does not always fetch the
      message as fast as the sender, resulting in a large latency in the user
      message sending, which is: [one RTT + the receiver processing time].
      
      The commit makes Nagle ACK as soon as possible i.e. when a message with
      the 'ack_required' arrives in the receiving side's stack even before it
      is processed or put in the socket receive queue...
      This way, we can limit the streaming latency to one RTT as committed in
      Nagle mode.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jmaloy@redhat.com>
      Signed-off-by: NTuong Lien <tuong.t.lien@dektech.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7268589
  8. 13 5月, 2020 3 次提交
  9. 12 5月, 2020 2 次提交
  10. 11 5月, 2020 4 次提交
    • F
      netfilter: conntrack: fix infinite loop on rmmod · 54ab49fd
      Florian Westphal 提交于
      'rmmod nf_conntrack' can hang forever, because the netns exit
      gets stuck in nf_conntrack_cleanup_net_list():
      
      i_see_dead_people:
       busy = 0;
       list_for_each_entry(net, net_exit_list, exit_list) {
        nf_ct_iterate_cleanup(kill_all, net, 0, 0);
        if (atomic_read(&net->ct.count) != 0)
         busy = 1;
       }
       if (busy) {
        schedule();
        goto i_see_dead_people;
       }
      
      When nf_ct_iterate_cleanup iterates the conntrack table, all nf_conn
      structures can be found twice:
      once for the original tuple and once for the conntracks reply tuple.
      
      get_next_corpse() only calls the iterator when the entry is
      in original direction -- the idea was to avoid unneeded invocations
      of the iterator callback.
      
      When support for clashing entries was added, the assumption that
      all nf_conn objects are added twice, once in original, once for reply
      tuple no longer holds -- NF_CLASH_BIT entries are only added in
      the non-clashing reply direction.
      
      Thus, if at least one NF_CLASH entry is in the list then
      nf_conntrack_cleanup_net_list() always skips it completely.
      
      During normal netns destruction, this causes a hang of several
      seconds, until the gc worker removes the entry (NF_CLASH entries
      always have a 1 second timeout).
      
      But in the rmmod case, the gc worker has already been stopped, so
      ct.count never becomes 0.
      
      We can fix this in two ways:
      
      1. Add a second test for CLASH_BIT and call iterator for those
         entries as well, or:
      2. Skip the original tuple direction and use the reply tuple.
      
      2) is simpler, so do that.
      
      Fixes: 6a757c07 ("netfilter: conntrack: allow insertion of clashing entries")
      Reported-by: NChen Yi <yiche@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      54ab49fd
    • R
      netfilter: flowtable: Remove WQ_MEM_RECLAIM from workqueue · 1d10da0e
      Roi Dayan 提交于
      This workqueue is in charge of handling offloaded flow tasks like
      add/del/stats we should not use WQ_MEM_RECLAIM flag.
      The flag can result in the following warning.
      
      [  485.557189] ------------[ cut here ]------------
      [  485.562976] workqueue: WQ_MEM_RECLAIM nf_flow_table_offload:flow_offload_worr
      [  485.562985] WARNING: CPU: 7 PID: 3731 at kernel/workqueue.c:2610 check_flush0
      [  485.590191] Kernel panic - not syncing: panic_on_warn set ...
      [  485.597100] CPU: 7 PID: 3731 Comm: kworker/u112:8 Not tainted 5.7.0-rc1.21802
      [  485.606629] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/177
      [  485.615487] Workqueue: nf_flow_table_offload flow_offload_work_handler [nf_f]
      [  485.624834] Call Trace:
      [  485.628077]  dump_stack+0x50/0x70
      [  485.632280]  panic+0xfb/0x2d7
      [  485.636083]  ? check_flush_dependency+0x110/0x130
      [  485.641830]  __warn.cold.12+0x20/0x2a
      [  485.646405]  ? check_flush_dependency+0x110/0x130
      [  485.652154]  ? check_flush_dependency+0x110/0x130
      [  485.657900]  report_bug+0xb8/0x100
      [  485.662187]  ? sched_clock_cpu+0xc/0xb0
      [  485.666974]  do_error_trap+0x9f/0xc0
      [  485.671464]  do_invalid_op+0x36/0x40
      [  485.675950]  ? check_flush_dependency+0x110/0x130
      [  485.681699]  invalid_op+0x28/0x30
      
      Fixes: 7da182a9 ("netfilter: flowtable: Use work entry per offload command")
      Reported-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NRoi Dayan <roid@mellanox.com>
      Reviewed-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1d10da0e
    • P
      netfilter: flowtable: Add pending bit for offload work · 2c889795
      Paul Blakey 提交于
      Gc step can queue offloaded flow del work or stats work.
      Those work items can race each other and a flow could be freed
      before the stats work is executed and querying it.
      To avoid that, add a pending bit that if a work exists for a flow
      don't queue another work for it.
      This will also avoid adding multiple stats works in case stats work
      didn't complete but gc step started again.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Reviewed-by: NRoi Dayan <roid@mellanox.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2c889795
    • A
      netfilter: conntrack: avoid gcc-10 zero-length-bounds warning · 2c407aca
      Arnd Bergmann 提交于
      gcc-10 warns around a suspicious access to an empty struct member:
      
      net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
      net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
       1522 |  memset(&ct->__nfct_init_offset[0], 0,
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
      In file included from net/netfilter/nf_conntrack_core.c:37:
      include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
         90 |  u8 __nfct_init_offset[0];
            |     ^~~~~~~~~~~~~~~~~~
      
      The code is correct but a bit unusual. Rework it slightly in a way that
      does not trigger the warning, using an empty struct instead of an empty
      array. There are probably more elegant ways to do this, but this is the
      smallest change.
      
      Fixes: c41884ce ("netfilter: conntrack: avoid zeroing timer")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      2c407aca
  11. 10 5月, 2020 1 次提交
    • Z
      netprio_cgroup: Fix unlimited memory leak of v2 cgroups · 090e28b2
      Zefan Li 提交于
      If systemd is configured to use hybrid mode which enables the use of
      both cgroup v1 and v2, systemd will create new cgroup on both the default
      root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
      task to the two cgroups. If the task does some network thing then the v2
      cgroup can never be freed after the session exited.
      
      One of our machines ran into OOM due to this memory leak.
      
      In the scenario described above when sk_alloc() is called
      cgroup_sk_alloc() thought it's in v2 mode, so it stores
      the cgroup pointer in sk->sk_cgrp_data and increments
      the cgroup refcnt, but then sock_update_netprioidx()
      thought it's in v1 mode, so it stores netprioidx value
      in sk->sk_cgrp_data, so the cgroup refcnt will never be freed.
      
      Currently we do the mode switch when someone writes to the ifpriomap
      cgroup control file. The easiest fix is to also do the switch when
      a task is attached to a new cgroup.
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Reported-by: NYang Yingliang <yangyingliang@huawei.com>
      Tested-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      090e28b2
  12. 09 5月, 2020 2 次提交
  13. 08 5月, 2020 3 次提交
    • C
      net: fix a potential recursive NETDEV_FEAT_CHANGE · dd912306
      Cong Wang 提交于
      syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
      between bonding master and slave. I managed to find a reproducer
      for this:
      
        ip li set bond0 up
        ifenslave bond0 eth0
        brctl addbr br0
        ethtool -K eth0 lro off
        brctl addif br0 bond0
        ip li set br0 up
      
      When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
      it captures this and calls bond_compute_features() to fixup its
      master's and other slaves' features. However, when syncing with
      its lower devices by netdev_sync_lower_features() this event is
      triggered again on slaves when the LRO feature fails to change,
      so it goes back and forth recursively until the kernel stack is
      exhausted.
      
      Commit 17b85d29 intentionally lets __netdev_update_features()
      return -1 for such a failure case, so we have to just rely on
      the existing check inside netdev_sync_lower_features() and skip
      NETDEV_FEAT_CHANGE event only for this specific failure case.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
      Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Reviewed-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd912306
    • P
      mptcp: set correct vfs info for subflows · 7d14b0d2
      Paolo Abeni 提交于
      When a subflow is created via mptcp_subflow_create_socket(),
      a new 'struct socket' is allocated, with a new i_ino value.
      
      When inspecting TCP sockets via the procfs and or the diag
      interface, the above ones are not related to the process owning
      the MPTCP master socket, even if they are a logical part of it
      ('ss -p' shows an empty process field)
      
      Additionally, subflows created by the path manager get
      the uid/gid from the running workqueue.
      
      Subflows are part of the owning MPTCP master socket, let's
      adjust the vfs info to reflect this.
      
      After this patch, 'ss' correctly displays subflows as belonging
      to the msk socket creator.
      
      Fixes: 2303f994 ("mptcp: Associate MPTCP context with TCP socket")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d14b0d2
    • M
      Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu" · 09454fd0
      Maciej Żenczykowski 提交于
      This reverts commit 19bda36c:
      
      | ipv6: add mtu lock check in __ip6_rt_update_pmtu
      |
      | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
      | It leaded to that mtu lock doesn't really work when receiving the pkt
      | of ICMPV6_PKT_TOOBIG.
      |
      | This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
      | did in __ip_rt_update_pmtu.
      
      The above reasoning is incorrect.  IPv6 *requires* icmp based pmtu to work.
      There's already a comment to this effect elsewhere in the kernel:
      
        $ git grep -p -B1 -A3 'RTAX_MTU lock'
        net/ipv6/route.c=4813=
      
        static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
        ...
          /* In IPv6 pmtu discovery is not optional,
             so that RTAX_MTU lock cannot disable it.
             We still use this lock to block changes
             caused by addrconf/ndisc.
          */
      
      This reverts to the pre-4.9 behaviour.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Fixes: 19bda36c ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09454fd0
  14. 07 5月, 2020 3 次提交