1. 22 9月, 2020 11 次提交
    • A
      net: core: device_rename: Use rwsem instead of a seqcount · 6fb670cf
      Ahmed S. Darwish 提交于
      stable inclusion
      from linux-4.19.130
      commit 29d1d0c7246462f2138ede9d5e5a1a7606366c17
      
      --------------------------------
      
      [ Upstream commit 11d6011c ]
      
      Sequence counters write paths are critical sections that must never be
      preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
      
      Commit 5dbe7c17 ("net: fix kernel deadlock with interface rename and
      netdev name retrieval.") handled a deadlock, observed with
      CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
      infinitely spinning: it got scheduled after the seqcount write side
      blocked inside its own critical section.
      
      To fix that deadlock, among other issues, the commit added a
      cond_resched() inside the read side section. While this will get the
      non-preemptible kernel eventually unstuck, the seqcount reader is fully
      exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
      
      The fix is also still broken: if the seqcount reader belongs to a
      real-time scheduling policy, it can spin forever and the kernel will
      livelock.
      
      Disabling preemption over the seqcount write side critical section will
      not work: inside it are a number of GFP_KERNEL allocations and mutex
      locking through the drivers/base/ :: device_rename() call chain.
      
      >From all the above, replace the seqcount with a rwsem.
      
      Fixes: 5dbe7c17 (net: fix kernel deadlock with interface rename and netdev name retrieval.)
      Fixes: 30e6c9fa (net: devnet_rename_seq should be a seqcount)
      Fixes: c91f6df2 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
      Cc: <stable@vger.kernel.org>
      Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ]
      Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ]
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Reviewed-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6fb670cf
    • T
      sched/rt, net: Use CONFIG_PREEMPTION.patch · 7a9a65d9
      Thomas Gleixner 提交于
      stable inclusion
      from linux-4.19.130
      commit b855db2a128ae5d003e03e1b7e7faf8d88b02134
      
      --------------------------------
      
      [ Upstream commit 2da2b32f ]
      
      CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
      Both PREEMPT and PREEMPT_RT require the same functionality which today
      depends on CONFIG_PREEMPT.
      
      Update the comment to use CONFIG_PREEMPTION.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: netdev@vger.kernel.org
      Link: https://lore.kernel.org/r/20191015191821.11479-22-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7a9a65d9
    • B
      __netif_receive_skb_core: pass skb by reference · 4e4c0fe6
      Boris Sukholitko 提交于
      stable inclusion
      from linux-4.19.126
      commit 96b2f1c0b073734d42b2cb38bdd59b945c78d51f
      
      --------------------------------
      
      [ Upstream commit c0bbbdc3 ]
      
      __netif_receive_skb_core may change the skb pointer passed into it (e.g.
      in rx_handler). The original skb may be freed as a result of this
      operation.
      
      The callers of __netif_receive_skb_core may further process original skb
      by using pt_prev pointer returned by __netif_receive_skb_core thus
      leading to unpleasant effects.
      
      The solution is to pass skb by reference into __netif_receive_skb_core.
      
      v2: Added Fixes tag and comment regarding ppt_prev and skb invariant.
      
      Fixes: 88eb1944 ("net: core: propagate SKB lists through packet_type lookup")
      Signed-off-by: NBoris Sukholitko <boris.sukholitko@broadcom.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4e4c0fe6
    • C
      net: fix a potential recursive NETDEV_FEAT_CHANGE · 43df3c21
      Cong Wang 提交于
      stable inclusion
      from linux-4.19.124
      commit 2ef834fec2adc51a8d5b5f3de494989e0aff872e
      
      --------------------------------
      
      [ Upstream commit dd912306 ]
      
      syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
      between bonding master and slave. I managed to find a reproducer
      for this:
      
        ip li set bond0 up
        ifenslave bond0 eth0
        brctl addbr br0
        ethtool -K eth0 lro off
        brctl addif br0 bond0
        ip li set br0 up
      
      When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
      it captures this and calls bond_compute_features() to fixup its
      master's and other slaves' features. However, when syncing with
      its lower devices by netdev_sync_lower_features() this event is
      triggered again on slaves when the LRO feature fails to change,
      so it goes back and forth recursively until the kernel stack is
      exhausted.
      
      Commit 17b85d29 intentionally lets __netdev_update_features()
      return -1 for such a failure case, so we have to just rely on
      the existing check inside netdev_sync_lower_features() and skip
      NETDEV_FEAT_CHANGE event only for this specific failure case.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
      Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Reviewed-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      43df3c21
    • K
      net: revert default NAPI poll timeout to 2 jiffies · 190cfd07
      Konstantin Khlebnikov 提交于
      stable inclusion
      from linux-4.19.117
      commit f7379c0050d2bfb65e44b340f1d667254dcc3058
      
      --------------------------------
      
      [ Upstream commit a4837980 ]
      
      For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly
      because next timer interrupt could come shortly after starting softirq.
      
      For commonly used CONFIG_HZ=1000 nothing changes.
      
      Fixes: 7acf8a1e ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
      Reported-by: NDmitry Yakunin <zeil@yandex-team.ru>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      190cfd07
    • A
      net: Fix Tx hash bound checking · 4553d0bb
      Amritha Nambiar 提交于
      stable inclusion
      from linux-4.19.115
      commit b1cb7f2bc9b4f776ae1ab9583802b1bca34d215a
      
      --------------------------------
      
      commit 6e11d157 upstream.
      
      Fixes the lower and upper bounds when there are multiple TCs and
      traffic is on the the same TC on the same device.
      
      The lower bound is represented by 'qoffset' and the upper limit for
      hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue
      mapping when there are multiple TCs, as the queue indices for upper TCs
      will be offset by 'qoffset'.
      
      v2: Fixed commit description based on comments.
      
      Fixes: 1b837d48 ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash")
      Fixes: eadec877 ("net: Add support for subordinate traffic classes to netdev_pick_tx")
      Signed-off-by: NAmritha Nambiar <amritha.nambiar@intel.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4553d0bb
    • T
      core: Don't skip generic XDP program execution for cloned SKBs · 9a23d8c8
      Toke Høiland-Jørgensen 提交于
      stable inclusion
      from linux-4.19.106
      commit ce754a314994c902d1a481288ff249401fdffe21
      
      --------------------------------
      
      [ Upstream commit ad1e03b2 ]
      
      The current generic XDP handler skips execution of XDP programs entirely if
      an SKB is marked as cloned. This leads to some surprising behaviour, as
      packets can end up being cloned in various ways, which will make an XDP
      program not see all the traffic on an interface.
      
      This was discovered by a simple test case where an XDP program that always
      returns XDP_DROP is installed on a veth device. When combining this with
      the Scapy packet sniffer (which uses an AF_PACKET) socket on the sending
      side, SKBs reliably end up in the cloned state, causing them to be passed
      through to the receiving interface instead of being dropped. A minimal
      reproducer script for this is included below.
      
      This patch fixed the issue by simply triggering the existing linearisation
      code for cloned SKBs instead of skipping the XDP program execution. This
      behaviour is in line with the behaviour of the native XDP implementation
      for the veth driver, which will reallocate and copy the SKB data if the SKB
      is marked as shared.
      
      Reproducer Python script (requires BCC and Scapy):
      
      from scapy.all import TCP, IP, Ether, sendp, sniff, AsyncSniffer, Raw, UDP
      from bcc import BPF
      import time, sys, subprocess, shlex
      
      SKB_MODE = (1 << 1)
      DRV_MODE = (1 << 2)
      PYTHON=sys.executable
      
      def client():
          time.sleep(2)
          # Sniffing on the sender causes skb_cloned() to be set
          s = AsyncSniffer()
          s.start()
      
          for p in range(10):
              sendp(Ether(dst="aa:aa:aa:aa:aa:aa", src="cc:cc:cc:cc:cc:cc")/IP()/UDP()/Raw("Test"),
                    verbose=False)
              time.sleep(0.1)
      
          s.stop()
          return 0
      
      def server(mode):
          prog = BPF(text="int dummy_drop(struct xdp_md *ctx) {return XDP_DROP;}")
          func = prog.load_func("dummy_drop", BPF.XDP)
          prog.attach_xdp("a_to_b", func, mode)
      
          time.sleep(1)
      
          s = sniff(iface="a_to_b", count=10, timeout=15)
          if len(s):
              print(f"Got {len(s)} packets - should have gotten 0")
              return 1
          else:
              print("Got no packets - as expected")
              return 0
      
      if len(sys.argv) < 2:
          print(f"Usage: {sys.argv[0]} <skb|drv>")
          sys.exit(1)
      
      if sys.argv[1] == "client":
          sys.exit(client())
      elif sys.argv[1] == "server":
          mode = SKB_MODE if sys.argv[2] == 'skb' else DRV_MODE
          sys.exit(server(mode))
      else:
          try:
              mode = sys.argv[1]
              if mode not in ('skb', 'drv'):
                  print(f"Usage: {sys.argv[0]} <skb|drv>")
                  sys.exit(1)
              print(f"Running in {mode} mode")
      
              for cmd in [
                      'ip netns add netns_a',
                      'ip netns add netns_b',
                      'ip -n netns_a link add a_to_b type veth peer name b_to_a netns netns_b',
                      # Disable ipv6 to make sure there's no address autoconf traffic
                      'ip netns exec netns_a sysctl -qw net.ipv6.conf.a_to_b.disable_ipv6=1',
                      'ip netns exec netns_b sysctl -qw net.ipv6.conf.b_to_a.disable_ipv6=1',
                      'ip -n netns_a link set dev a_to_b address aa:aa:aa:aa:aa:aa',
                      'ip -n netns_b link set dev b_to_a address cc:cc:cc:cc:cc:cc',
                      'ip -n netns_a link set dev a_to_b up',
                      'ip -n netns_b link set dev b_to_a up']:
                  subprocess.check_call(shlex.split(cmd))
      
              server = subprocess.Popen(shlex.split(f"ip netns exec netns_a {PYTHON} {sys.argv[0]} server {mode}"))
              client = subprocess.Popen(shlex.split(f"ip netns exec netns_b {PYTHON} {sys.argv[0]} client"))
      
              client.wait()
              server.wait()
              sys.exit(server.returncode)
      
          finally:
              subprocess.run(shlex.split("ip netns delete netns_a"))
              subprocess.run(shlex.split("ip netns delete netns_b"))
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Reported-by: NStepan Horacek <shoracek@redhat.com>
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9a23d8c8
    • J
      net-sysfs: Fix reference count leak · ce29e8b1
      Jouni Hogander 提交于
      stable inclusion
      from linux-4.19.100
      commit b4b0f1fc194614859486b4fd19bd5885a3c8818f
      
      --------------------------------
      
      [ Upstream commit cb626bf5 ]
      
      Netdev_register_kobject is calling device_initialize. In case of error
      reference taken by device_initialize is not given up.
      
      Drivers are supposed to call free_netdev in case of error. In non-error
      case the last reference is given up there and device release sequence
      is triggered. In error case this reference is kept and the release
      sequence is never started.
      
      Fix this by setting reg_state as NETREG_UNREGISTERED if registering
      fails.
      
      This is the rootcause for couple of memory leaks reported by Syzkaller:
      
      BUG: memory leak unreferenced object 0xffff8880675ca008 (size 256):
        comm "netdev_register", pid 281, jiffies 4294696663 (age 6.808s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
        backtrace:
          [<0000000058ca4711>] kmem_cache_alloc_trace+0x167/0x280
          [<000000002340019b>] device_add+0x882/0x1750
          [<000000001d588c3a>] netdev_register_kobject+0x128/0x380
          [<0000000011ef5535>] register_netdevice+0xa1b/0xf00
          [<000000007fcf1c99>] __tun_chr_ioctl+0x20d5/0x3dd0
          [<000000006a5b7b2b>] tun_chr_ioctl+0x2f/0x40
          [<00000000f30f834a>] do_vfs_ioctl+0x1c7/0x1510
          [<00000000fba062ea>] ksys_ioctl+0x99/0xb0
          [<00000000b1c1b8d2>] __x64_sys_ioctl+0x78/0xb0
          [<00000000984cabb9>] do_syscall_64+0x16f/0x580
          [<000000000bde033d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<00000000e6ca2d9f>] 0xffffffffffffffff
      
      BUG: memory leak
      unreferenced object 0xffff8880668ba588 (size 8):
        comm "kobject_set_nam", pid 286, jiffies 4294725297 (age 9.871s)
        hex dump (first 8 bytes):
          6e 72 30 00 cc be df 2b                          nr0....+
        backtrace:
          [<00000000a322332a>] __kmalloc_track_caller+0x16e/0x290
          [<00000000236fd26b>] kstrdup+0x3e/0x70
          [<00000000dd4a2815>] kstrdup_const+0x3e/0x50
          [<0000000049a377fc>] kvasprintf_const+0x10e/0x160
          [<00000000627fc711>] kobject_set_name_vargs+0x5b/0x140
          [<0000000019eeab06>] dev_set_name+0xc0/0xf0
          [<0000000069cb12bc>] netdev_register_kobject+0xc8/0x320
          [<00000000f2e83732>] register_netdevice+0xa1b/0xf00
          [<000000009e1f57cc>] __tun_chr_ioctl+0x20d5/0x3dd0
          [<000000009c560784>] tun_chr_ioctl+0x2f/0x40
          [<000000000d759e02>] do_vfs_ioctl+0x1c7/0x1510
          [<00000000351d7c31>] ksys_ioctl+0x99/0xb0
          [<000000008390040a>] __x64_sys_ioctl+0x78/0xb0
          [<0000000052d196b7>] do_syscall_64+0x16f/0x580
          [<0000000019af9236>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<00000000bc384531>] 0xffffffffffffffff
      
      v3 -> v4:
        Set reg_state to NETREG_UNREGISTERED if registering fails
      
      v2 -> v3:
      * Replaced BUG_ON with WARN_ON in free_netdev and netdev_release
      
      v1 -> v2:
      * Relying on driver calling free_netdev rather than calling
        put_device directly in error path
      
      Reported-by: syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com
      Cc: David Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NJouni Hogander <jouni.hogander@unikie.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ce29e8b1
    • E
      net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() · 47039ada
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.100
      commit be1a2be7a7b0ed5a758fd8decc39386ba3b5d556
      
      --------------------------------
      
      [ Upstream commit d836f5c6 ]
      
      rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu
      checks that we apply in do_setlink()
      
      Otherwise malicious users can crash the kernel, for example after
      an integer overflow :
      
      BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline]
      BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
      Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0
      
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
       memset+0x24/0x40 mm/kasan/common.c:108
       memset include/linux/string.h:365 [inline]
       __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
       alloc_skb include/linux/skbuff.h:1049 [inline]
       alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664
       sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242
       sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259
       mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609
       add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713
       add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844
       mld_send_cr net/ipv6/mcast.c:1970 [inline]
       mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477
       call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
       expire_timers kernel/time/timer.c:1449 [inline]
       __run_timers kernel/time/timer.c:1773 [inline]
       __run_timers kernel/time/timer.c:1740 [inline]
       run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x19b/0x1e0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
       </IRQ>
      RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
      Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79
      RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54
      RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
      R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690
       default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
       rest_init+0x23b/0x371 init/main.c:451
       arch_call_rest_init+0xe/0x1b
       start_kernel+0x904/0x943 init/main.c:784
       x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
       x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242
      
      The buggy address belongs to the page:
      page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
      raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      >ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                 ^
       ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      47039ada
    • J
      net: fix bpf_xdp_adjust_head regression for generic-XDP · 4516f7fc
      Jesper Dangaard Brouer 提交于
      stable inclusion
      from linux-4.19.99
      commit 50176c0d22ea2347867c6196c99b0f778f81f7be
      
      --------------------------------
      
      [ Upstream commit 065af355 ]
      
      When generic-XDP was moved to a later processing step by commit
      458bf2f2 ("net: core: support XDP generic on stacked devices.")
      a regression was introduced when using bpf_xdp_adjust_head.
      
      The issue is that after this commit the skb->network_header is now
      changed prior to calling generic XDP and not after. Thus, if the header
      is changed by XDP (via bpf_xdp_adjust_head), then skb->network_header
      also need to be updated again.  Fix by calling skb_reset_network_header().
      
      Fixes: 458bf2f2 ("net: core: support XDP generic on stacked devices.")
      Reported-by: NBrandon Cazander <brandon.cazander@multapplied.net>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4516f7fc
    • S
      net: core: support XDP generic on stacked devices. · 4c5c676a
      Stephen Hemminger 提交于
      stable inclusion
      from linux-4.19.99
      commit 6c350e974c953d9a806c73b629eb46f40504743f
      
      --------------------------------
      
      [ Upstream commit 458bf2f2 ]
      
      When a device is stacked like (team, bonding, failsafe or netvsc) the
      XDP generic program for the parent device was not called.
      
      Move the call to XDP generic inside __netif_receive_skb_core where
      it can be done multiple times for stacked case.
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4c5c676a
  2. 05 3月, 2020 2 次提交
    • T
      net: core: limit nested device depth · cdb5d42b
      Taehee Yoo 提交于
      [ Upstream commit 5343da4c17429efaa5fb1594ea96aee1a283e694 ]
      
      Current code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this needs huge stack
      memory. So, unlimited nested devices could make stack overflow.
      
      This patch adds upper_level and lower_level, they are common variables
      and represent maximum lower/upper depth.
      When upper/lower device is attached or dettached,
      {lower/upper}_level are updated. and if maximum depth is bigger than 8,
      attach routine fails and returns -EMLINK.
      
      In addition, this patch converts recursive routine of
      netdev_walk_all_{lower/upper} to iterator routine.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add link dummy0 name vlan1 type vlan id 1
          ip link set vlan1 up
      
          for i in {2..55}
          do
      	    let A=$i-1
      
      	    ip link add vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      Splat looks like:
      [  155.513226][  T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
      [  155.514162][  T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
      [  155.515048][  T908]
      [  155.515333][  T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  155.516147][  T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  155.517233][  T908] Call Trace:
      [  155.517627][  T908]
      [  155.517918][  T908] Allocated by task 0:
      [  155.518412][  T908] (stack is not available)
      [  155.518955][  T908]
      [  155.519228][  T908] Freed by task 0:
      [  155.519885][  T908] (stack is not available)
      [  155.520452][  T908]
      [  155.520729][  T908] The buggy address belongs to the object at ffff8880608a6ac0
      [  155.520729][  T908]  which belongs to the cache names_cache of size 4096
      [  155.522387][  T908] The buggy address is located 512 bytes inside of
      [  155.522387][  T908]  4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
      [  155.523920][  T908] The buggy address belongs to the page:
      [  155.524552][  T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
      [  155.525836][  T908] flags: 0x100000000010200(slab|head)
      [  155.526445][  T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
      [  155.527424][  T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  155.528429][  T908] page dumped because: kasan: bad access detected
      [  155.529158][  T908]
      [  155.529410][  T908] Memory state around the buggy address:
      [  155.530060][  T908]  ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.530971][  T908]  ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
      [  155.531889][  T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.532806][  T908]                                            ^
      [  155.533509][  T908]  ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
      [  155.534436][  T908]  ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
      [ ... ]
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cdb5d42b
    • E
      inet: protect against too small mtu values. · 58338396
      Eric Dumazet 提交于
      [ Upstream commit 501a90c9 ]
      
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      58338396
  3. 27 12月, 2019 12 次提交
    • E
      net: do not abort bulk send on BQL status · a02c33df
      Eric Dumazet 提交于
      [ Upstream commit fe60faa5 ]
      
      Before calling dev_hard_start_xmit(), upper layers tried
      to cook optimal skb list based on BQL budget.
      
      Problem is that GSO packets can end up comsuming more than
      the BQL budget.
      
      Breaking the loop is not useful, since requeued packets
      are ahead of any packets still in the qdisc.
      
      It is also more expensive, since next TX completion will
      push these packets later, while skbs are not in cpu caches.
      
      It is also a behavior difference with TSO packets, that can
      break the BQL limit by a large amount.
      
      Note that drivers should use __netdev_tx_sent_queue()
      in order to have optimal xmit_more support, and avoid
      useless atomic operations as shown in the following patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a02c33df
    • J
      net: fix generic XDP to handle if eth header was mangled · 00af91bc
      Jesper Dangaard Brouer 提交于
      [ Upstream commit 29724956 ]
      
      XDP can modify (and resize) the Ethernet header in the packet.
      
      There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
      are setup before reaching (netif_receive_)generic_xdp.
      
      This bug was hit when XDP were popping VLAN headers (changing
      eth->h_proto), as skb->protocol still contains VLAN-indication
      (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
      the packet (basically popping the VLAN again).
      
      This patch catch if XDP changed eth header in such a way, that SKB
      fields needs to be updated.
      
      V2: on request from Song Liu, use ETH_HLEN instead of mac_len,
      in __skb_push() as eth_type_trans() use ETH_HLEN in paired skb_pull_inline().
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      00af91bc
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · 80157c88
      Guillaume Nault 提交于
      [ Upstream commit d4e4fdf9 ]
      
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      80157c88
    • S
      net: Fix null de-reference of device refcount · e5cd66c7
      Subash Abhinov Kasiviswanathan 提交于
      [ Upstream commit 10cc514f ]
      
      In event of failure during register_netdevice, free_netdev is
      invoked immediately. free_netdev assumes that all the netdevice
      refcounts have been dropped prior to it being called and as a
      result frees and clears out the refcount pointer.
      
      However, this is not necessarily true as some of the operations
      in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
      invocation after a grace period. The IPv4 callback in_dev_rcu_put
      tries to access the refcount after free_netdev is called which
      leads to a null de-reference-
      
      44837.761523:   <6> Unable to handle kernel paging request at
                          virtual address 0000004a88287000
      44837.761651:   <2> pc : in_dev_finish_destroy+0x4c/0xc8
      44837.761654:   <2> lr : in_dev_finish_destroy+0x2c/0xc8
      44837.762393:   <2> Call trace:
      44837.762398:   <2>  in_dev_finish_destroy+0x4c/0xc8
      44837.762404:   <2>  in_dev_rcu_put+0x24/0x30
      44837.762412:   <2>  rcu_nocb_kthread+0x43c/0x468
      44837.762418:   <2>  kthread+0x118/0x128
      44837.762424:   <2>  ret_from_fork+0x10/0x1c
      
      Fix this by waiting for the completion of the call_rcu() in
      case of register_netdevice errors.
      
      Fixes: 93ee31f1 ("[NET]: Fix free_netdev on register_netdev failure.")
      Cc: Sean Tranchetti <stranche@codeaurora.org>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e5cd66c7
    • J
      net: fix ifindex collision during namespace removal · d60b9685
      Jiri Pirko 提交于
      [ Upstream commit 55b40dbf ]
      
      Commit aca51397 ("netns: Fix arbitrary net_device-s corruptions
      on net_ns stop.") introduced a possibility to hit a BUG in case device
      is returning back to init_net and two following conditions are met:
      1) dev->ifindex value is used in a name of another "dev%d"
         device in init_net.
      2) dev->name is used by another device in init_net.
      
      Under real life circumstances this is hard to get. Therefore this has
      been present happily for over 10 years. To reproduce:
      
      $ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
      3: enp0s2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
      $ ip netns add ns1
      $ ip -n ns1 link add dummy1ns1 type dummy
      $ ip -n ns1 link add dummy2ns1 type dummy
      $ ip link set enp0s2 netns ns1
      $ ip -n ns1 link set enp0s2 name dummy0
      [  100.858894] virtio_net virtio0 dummy0: renamed from enp0s2
      $ ip link add dev4 type dummy
      $ ip -n ns1 a
      1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
      2: dummy1ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 16:63:4c:38:3e:ff brd ff:ff:ff:ff:ff:ff
      3: dummy2ns1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether aa:9e:86:dd:6b:5d brd ff:ff:ff:ff:ff:ff
      4: dummy0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 52:54:00:12:34:56 brd ff:ff:ff:ff:ff:ff
      $ ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 86:89:3f:86:61:29 brd ff:ff:ff:ff:ff:ff
      4: dev4: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default qlen 1000
          link/ether 5a:e1:4a:b6:ec:f8 brd ff:ff:ff:ff:ff:ff
      $ ip netns del ns1
      [  158.717795] default_device_exit: failed to move dummy0 to init_net: -17
      [  158.719316] ------------[ cut here ]------------
      [  158.720591] kernel BUG at net/core/dev.c:9824!
      [  158.722260] invalid opcode: 0000 [#1] SMP KASAN PTI
      [  158.723728] CPU: 0 PID: 56 Comm: kworker/u2:1 Not tainted 5.3.0-rc1+ #18
      [  158.725422] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  158.727508] Workqueue: netns cleanup_net
      [  158.728915] RIP: 0010:default_device_exit.cold+0x1d/0x1f
      [  158.730683] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
      [  158.736854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
      [  158.738752] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
      [  158.741369] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
      [  158.743418] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
      [  158.745626] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
      [  158.748405] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
      [  158.750638] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
      [  158.752944] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  158.755245] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
      [  158.757654] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  158.760012] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  158.762758] Call Trace:
      [  158.763882]  ? dev_change_net_namespace+0xbb0/0xbb0
      [  158.766148]  ? devlink_nl_cmd_set_doit+0x520/0x520
      [  158.768034]  ? dev_change_net_namespace+0xbb0/0xbb0
      [  158.769870]  ops_exit_list.isra.0+0xa8/0x150
      [  158.771544]  cleanup_net+0x446/0x8f0
      [  158.772945]  ? unregister_pernet_operations+0x4a0/0x4a0
      [  158.775294]  process_one_work+0xa1a/0x1740
      [  158.776896]  ? pwq_dec_nr_in_flight+0x310/0x310
      [  158.779143]  ? do_raw_spin_lock+0x11b/0x280
      [  158.780848]  worker_thread+0x9e/0x1060
      [  158.782500]  ? process_one_work+0x1740/0x1740
      [  158.784454]  kthread+0x31b/0x420
      [  158.786082]  ? __kthread_create_on_node+0x3f0/0x3f0
      [  158.788286]  ret_from_fork+0x3a/0x50
      [  158.789871] ---[ end trace defd6c657c71f936 ]---
      [  158.792273] RIP: 0010:default_device_exit.cold+0x1d/0x1f
      [  158.795478] Code: 84 e8 18 c9 3e fe 0f 0b e9 70 90 ff ff e8 36 e4 52 fe 89 d9 4c 89 e2 48 c7 c6 80 d6 25 84 48 c7 c7 20 c0 25 84 e8 f4 c8 3e
      [  158.804854] RSP: 0018:ffff8880347e7b90 EFLAGS: 00010282
      [  158.807865] RAX: 000000000000003b RBX: 00000000ffffffef RCX: 0000000000000000
      [  158.811794] RDX: 0000000000000000 RSI: ffffffff8128013d RDI: ffffed10068fcf64
      [  158.816652] RBP: ffff888033550170 R08: 000000000000003b R09: fffffbfff0b94b9c
      [  158.820930] R10: fffffbfff0b94b9b R11: ffffffff85ca5cdf R12: ffff888032f28000
      [  158.825113] R13: dffffc0000000000 R14: ffff8880335501b8 R15: 1ffff110068fcf72
      [  158.829899] FS:  0000000000000000(0000) GS:ffff888036000000(0000) knlGS:0000000000000000
      [  158.834923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  158.838164] CR2: 00007fe8b45d21d0 CR3: 00000000340b4005 CR4: 0000000000360ef0
      [  158.841917] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  158.845149] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fix this by checking if a device with the same name exists in init_net
      and fallback to original code - dev%d to allocate name - in case it does.
      
      This was found using syzkaller.
      
      Fixes: aca51397 ("netns: Fix arbitrary net_device-s corruptions on net_ns stop.")
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d60b9685
    • M
      jump_label: move 'asm goto' support test to Kconfig · a7928808
      Masahiro Yamada 提交于
      commit e9666d10a5677a494260d60d1fa0b73cc7646eb3 upstream.
      
      Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".
      
      The jump label is controlled by HAVE_JUMP_LABEL, which is defined
      like this:
      
        #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
        # define HAVE_JUMP_LABEL
        #endif
      
      We can improve this by testing 'asm goto' support in Kconfig, then
      make JUMP_LABEL depend on CC_HAS_ASM_GOTO.
      
      Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
      match to the real kernel capability.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      [nc: Fix trivial conflicts in 4.19
           arch/xtensa/kernel/jump_label.c doesn't exist yet
           Ensured CC_HAVE_ASM_GOTO and HAVE_JUMP_LABEL were sufficiently
           eliminated]
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        kernel/module.c
        include/linux/module.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a7928808
    • E
      net-gro: fix use-after-free read in napi_gro_frags() · 5fb5d4b9
      Eric Dumazet 提交于
      [ Upstream commit a4270d67 ]
      
      If a network driver provides to napi_gro_frags() an
      skb with a page fragment of exactly 14 bytes, the call
      to gro_pull_from_frag0() will 'consume' the fragment
      by calling skb_frag_unref(skb, 0), and the page might
      be freed and reused.
      
      Reading eth->h_proto at the end of napi_frags_skb() might
      read mangled data, or crash under specific debugging features.
      
      BUG: KASAN: use-after-free in napi_frags_skb net/core/dev.c:5833 [inline]
      BUG: KASAN: use-after-free in napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
      Read of size 2 at addr ffff88809366840c by task syz-executor599/8957
      
      CPU: 1 PID: 8957 Comm: syz-executor599 Not tainted 5.2.0-rc1+ #32
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
       __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       kasan_report+0x12/0x20 mm/kasan/common.c:614
       __asan_report_load_n_noabort+0xf/0x20 mm/kasan/generic_report.c:142
       napi_frags_skb net/core/dev.c:5833 [inline]
       napi_gro_frags+0xc6f/0xd10 net/core/dev.c:5841
       tun_get_user+0x2f3c/0x3ff0 drivers/net/tun.c:1991
       tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
       call_write_iter include/linux/fs.h:1872 [inline]
       do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
       do_iter_write fs/read_write.c:970 [inline]
       do_iter_write+0x184/0x610 fs/read_write.c:951
       vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
       do_writev+0x15b/0x330 fs/read_write.c:1058
      
      Fixes: a50e233c ("net-gro: restore frag0 optimization")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5fb5d4b9
    • E
      net: avoid weird emergency message · 3b4c15ed
      Eric Dumazet 提交于
      [ Upstream commit d7c04b05 ]
      
      When host is under high stress, it is very possible thread
      running netdev_wait_allrefs() returns from msleep(250)
      10 seconds late.
      
      This leads to these messages in the syslog :
      
      [...] unregister_netdevice: waiting for syz_tun to become free. Usage count = 0
      
      If the device refcount is zero, the wait is over.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3b4c15ed
    • S
      failover: allow name change on IFF_UP slave interfaces · 29b6215a
      Si-Wei Liu 提交于
      [ Upstream commit 8065a779 ]
      
      When a netdev appears through hot plug then gets enslaved by a failover
      master that is already up and running, the slave will be opened
      right away after getting enslaved. Today there's a race that userspace
      (udev) may fail to rename the slave if the kernel (net_failover)
      opens the slave earlier than when the userspace rename happens.
      Unlike bond or team, the primary slave of failover can't be renamed by
      userspace ahead of time, since the kernel initiated auto-enslavement is
      unable to, or rather, is never meant to be synchronized with the rename
      request from userspace.
      
      As the failover slave interfaces are not designed to be operated
      directly by userspace apps: IP configuration, filter rules with
      regard to network traffic passing and etc., should all be done on master
      interface. In general, userspace apps only care about the
      name of master interface, while slave names are less important as long
      as admin users can see reliable names that may carry
      other information describing the netdev. For e.g., they can infer that
      "ens3nsby" is a standby slave of "ens3", while for a
      name like "eth0" they can't tell which master it belongs to.
      
      Historically the name of IFF_UP interface can't be changed because
      there might be admin script or management software that is already
      relying on such behavior and assumes that the slave name can't be
      changed once UP. But failover is special: with the in-kernel
      auto-enslavement mechanism, the userspace expectation for device
      enumeration and bring-up order is already broken. Previously initramfs
      and various userspace config tools were modified to bypass failover
      slaves because of auto-enslavement and duplicate MAC address. Similarly,
      in case that users care about seeing reliable slave name, the new type
      of failover slaves needs to be taken care of specifically in userspace
      anyway.
      
      It's less risky to lift up the rename restriction on failover slave
      which is already UP. Although it's possible this change may potentially
      break userspace component (most likely configuration scripts or
      management software) that assumes slave name can't be changed while
      UP, it's relatively a limited and controllable set among all userspace
      components, which can be fixed specifically to listen for the rename
      events on failover slaves. Userspace component interacting with slaves
      is expected to be changed to operate on failover master interface
      instead, as the failover slave is dynamic in nature which may come and
      go at any point.  The goal is to make the role of failover slaves less
      relevant, and userspace components should only deal with failover master
      in the long run.
      
      Fixes: 30c8bd5a ("net: Introduce generic failover module")
      Signed-off-by: NSi-Wei Liu <si-wei.liu@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      29b6215a
    • A
      net: core: netif_receive_skb_list: unlist skb before passing to pt->func · 23dda6fa
      Alexander Lobakin 提交于
      mainline inclusion
      from mainline-5.1
      commit 9a5a90d1
      category: bugfix
      bugzilla: 13575
      CVE: NA
      
      -------------------------------------------------
      
      __netif_receive_skb_list_ptype() leaves skb->next poisoned before passing
      it to pt_prev->func handler, what may produce (in certain cases, e.g. DSA
      setup) crashes like:
      
      [ 88.606777] CPU 0 Unable to handle kernel paging request at virtual address 0000000e, epc == 80687078, ra == 8052cc7c
      [ 88.618666] Oops[#1]:
      [ 88.621196] CPU: 0 PID: 0 Comm: swapper Not tainted 5.1.0-rc2-dlink-00206-g4192a172-dirty #1473
      [ 88.630885] $ 0 : 00000000 10000400 00000002 864d7850
      [ 88.636709] $ 4 : 87c0ddf0 864d7800 87c0ddf0 00000000
      [ 88.642526] $ 8 : 00000000 49600000 00000001 00000001
      [ 88.648342] $12 : 00000000 c288617b dadbee27 25d17c41
      [ 88.654159] $16 : 87c0ddf0 85cff080 80790000 fffffffd
      [ 88.659975] $20 : 80797b20 ffffffff 00000001 864d7800
      [ 88.665793] $24 : 00000000 8011e658
      [ 88.671609] $28 : 80790000 87c0dbc0 87cabf00 8052cc7c
      [ 88.677427] Hi : 00000003
      [ 88.680622] Lo : 7b5b4220
      [ 88.683840] epc : 80687078 vlan_dev_hard_start_xmit+0x1c/0x1a0
      [ 88.690532] ra : 8052cc7c dev_hard_start_xmit+0xac/0x188
      [ 88.696734] Status: 10000404	IEp
      [ 88.700422] Cause : 50000008 (ExcCode 02)
      [ 88.704874] BadVA : 0000000e
      [ 88.708069] PrId : 0001a120 (MIPS interAptiv (multi))
      [ 88.713005] Modules linked in:
      [ 88.716407] Process swapper (pid: 0, threadinfo=(ptrval), task=(ptrval), tls=00000000)
      [ 88.725219] Stack : 85f61c28 00000000 0000000e 80780000 87c0ddf0 85cff080 80790000 8052cc7c
      [ 88.734529] 87cabf00 00000000 00000001 85f5fb40 807b0000 864d7850 87cabf00 807d0000
      [ 88.743839] 864d7800 8655f600 00000000 85cff080 87c1c000 0000006a 00000000 8052d96c
      [ 88.753149] 807a0000 8057adb8 87c0dcc8 87c0dc50 85cfff08 00000558 87cabf00 85f58c50
      [ 88.762460] 00000002 85f58c00 864d7800 80543308 fffffff4 00000001 85f58c00 864d7800
      [ 88.771770] ...
      [ 88.774483] Call Trace:
      [ 88.777199] [<80687078>] vlan_dev_hard_start_xmit+0x1c/0x1a0
      [ 88.783504] [<8052cc7c>] dev_hard_start_xmit+0xac/0x188
      [ 88.789326] [<8052d96c>] __dev_queue_xmit+0x6e8/0x7d4
      [ 88.794955] [<805a8640>] ip_finish_output2+0x238/0x4d0
      [ 88.800677] [<805ab6a0>] ip_output+0xc8/0x140
      [ 88.805526] [<805a68f4>] ip_forward+0x364/0x560
      [ 88.810567] [<805a4ff8>] ip_rcv+0x48/0xe4
      [ 88.815030] [<80528d44>] __netif_receive_skb_one_core+0x44/0x58
      [ 88.821635] [<8067f220>] dsa_switch_rcv+0x108/0x1ac
      [ 88.827067] [<80528f80>] __netif_receive_skb_list_core+0x228/0x26c
      [ 88.833951] [<8052ed84>] netif_receive_skb_list+0x1d4/0x394
      [ 88.840160] [<80355a88>] lunar_rx_poll+0x38c/0x828
      [ 88.845496] [<8052fa78>] net_rx_action+0x14c/0x3cc
      [ 88.850835] [<806ad300>] __do_softirq+0x178/0x338
      [ 88.856077] [<8012a2d4>] irq_exit+0xbc/0x100
      [ 88.860846] [<802f8b70>] plat_irq_dispatch+0xc0/0x144
      [ 88.866477] [<80105974>] handle_int+0x14c/0x158
      [ 88.871516] [<806acfb0>] r4k_wait+0x30/0x40
      [ 88.876462] Code: afb10014 8c8200a0 00803025 <9443000c> 94a20468 00000000 10620042 00a08025 9605046a
      [ 88.887332]
      [ 88.888982] ---[ end trace eb863d007da11cf1 ]---
      [ 88.894122] Kernel panic - not syncing: Fatal exception in interrupt
      [ 88.901202] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fix this by pulling skb off the sublist and zeroing skb->next pointer
      before calling ptype callback.
      
      Fixes: 88eb1944 ("net: core: propagate SKB lists through packet_type lookup")
      Reviewed-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      23dda6fa
    • H
      net: Fix for_each_netdev_feature on Big endian · b2eb91a6
      Hauke Mehrtens 提交于
      [ Upstream commit 3b89ea9c ]
      
      The features attribute is of type u64 and stored in the native endianes on
      the system. The for_each_set_bit() macro takes a pointer to a 32 bit array
      and goes over the bits in this area. On little Endian systems this also
      works with an u64 as the most significant bit is on the highest address,
      but on big endian the words are swapped. When we expect bit 15 here we get
      bit 47 (15 + 32).
      
      This patch converts it more or less to its own for_each_set_bit()
      implementation which works on 64 bit integers directly. This is then
      completely in host endianness and should work like expected.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Signed-off-by: NHauke Mehrtens <hauke.mehrtens@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b2eb91a6
    • J
      net: set default network namespace in init_dummy_netdev() · 1a5e1367
      Josh Elsasser 提交于
      [ Upstream commit 35edfdc7 ]
      
      Assign a default net namespace to netdevs created by init_dummy_netdev().
      Fixes a NULL pointer dereference caused by busy-polling a socket bound to
      an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
      if napi_poll() received packets:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000190
        IP: napi_busy_loop+0xd6/0x200
        Call Trace:
          sock_poll+0x5e/0x80
          do_sys_poll+0x324/0x5a0
          SyS_poll+0x6c/0xf0
          do_syscall_64+0x6b/0x1f0
          entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      Fixes: 7db6b048 ("net: Commonize busy polling code to focus on napi_id instead of socket")
      Signed-off-by: NJosh Elsasser <jelsasser@appneta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1a5e1367
  4. 17 12月, 2018 3 次提交
    • S
      net: fix XPS static_key accounting · 9ac60749
      Sabrina Dubroca 提交于
      [ Upstream commit 867d0ad476db89a1e8af3f297af402399a54eea5 ]
      
      Commit 04157469 ("net: Use static_key for XPS maps") introduced a
      static key for XPS, but the increments/decrements don't match.
      
      First, the static key's counter is incremented once for each queue, but
      only decremented once for a whole batch of queues, leading to large
      unbalances.
      
      Second, the xps_rxqs_needed key is decremented whenever we reset a batch
      of queues, whether they had any rxqs mapping or not, so that if we setup
      cpu-XPS on em1 and RXQS-XPS on em2, resetting the queues on em1 would
      decrement the xps_rxqs_needed key.
      
      This reworks the accounting scheme so that the xps_needed key is
      incremented only once for each type of XPS for all the queues on a
      device, and the xps_rxqs_needed key is incremented only once for all
      queues. This is sufficient to let us retrieve queues via
      get_xps_queue().
      
      This patch introduces a new reset_xps_maps(), which reinitializes and
      frees the appropriate map (xps_rxqs_map or xps_cpus_map), and drops a
      reference to the needed keys:
       - both xps_needed and xps_rxqs_needed, in case of rxqs maps,
       - only xps_needed, in case of CPU maps.
      
      Now, we also need to call reset_xps_maps() at the end of
      __netif_set_xps_queue() when there's no active map left, for example
      when writing '00000000,00000000' to all queues' xps_rxqs setting.
      
      Fixes: 04157469 ("net: Use static_key for XPS maps")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9ac60749
    • S
      net: restore call to netdev_queue_numa_node_write when resetting XPS · b4b8a71c
      Sabrina Dubroca 提交于
      [ Upstream commit f28c020fb488e1a8b87469812017044bef88aa2b ]
      
      Before commit 80d19669 ("net: Refactor XPS for CPUs and Rx queues"),
      netif_reset_xps_queues() did netdev_queue_numa_node_write() for all the
      queues being reset. Now, this is only done when the "active" variable in
      clean_xps_maps() is false, ie when on all the CPUs, there's no active
      XPS mapping left.
      
      Fixes: 80d19669 ("net: Refactor XPS for CPUs and Rx queues")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4b8a71c
    • E
      net: use skb_list_del_init() to remove from RX sublists · 7fafda16
      Edward Cree 提交于
      [ Upstream commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ]
      
      list_del() leaves the skb->next pointer poisoned, which can then lead to
       a crash in e.g. OVS forwarding.  For example, setting up an OVS VXLAN
       forwarding bridge on sfc as per:
      
      ========
      $ ovs-vsctl show
      5dfd9c47-f04b-4aaa-aa96-4fbb0a522a30
          Bridge "br0"
              Port "br0"
                  Interface "br0"
                      type: internal
              Port "enp6s0f0"
                  Interface "enp6s0f0"
              Port "vxlan0"
                  Interface "vxlan0"
                      type: vxlan
                      options: {key="1", local_ip="10.0.0.5", remote_ip="10.0.0.4"}
          ovs_version: "2.5.0"
      ========
      (where 10.0.0.5 is an address on enp6s0f1)
      and sending traffic across it will lead to the following panic:
      ========
      general protection fault: 0000 [#1] SMP PTI
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.20.0-rc3-ehc+ #701
      Hardware name: Dell Inc. PowerEdge R710/0M233H, BIOS 6.4.0 07/23/2013
      RIP: 0010:dev_hard_start_xmit+0x38/0x200
      Code: 53 48 89 fb 48 83 ec 20 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 ab 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 c7 d1 b3 00 4d 85 f6 0f 95
      RSP: 0018:ffff888627b437e0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: dead000000000100 RCX: ffff88862279c000
      RDX: ffff888614a342c0 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888618a88000 R08: 0000000000000001 R09: 00000000000003e8
      R10: 0000000000000000 R11: ffff888614a34140 R12: 0000000000000000
      R13: 0000000000000062 R14: dead000000000100 R15: ffff888616430000
      FS:  0000000000000000(0000) GS:ffff888627b40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f6d2bc6d000 CR3: 000000000200a000 CR4: 00000000000006e0
      Call Trace:
       <IRQ>
       __dev_queue_xmit+0x623/0x870
       ? masked_flow_lookup+0xf7/0x220 [openvswitch]
       ? ep_poll_callback+0x101/0x310
       do_execute_actions+0xaba/0xaf0 [openvswitch]
       ? __wake_up_common+0x8a/0x150
       ? __wake_up_common_lock+0x87/0xc0
       ? queue_userspace_packet+0x31c/0x5b0 [openvswitch]
       ovs_execute_actions+0x47/0x120 [openvswitch]
       ovs_dp_process_packet+0x7d/0x110 [openvswitch]
       ovs_vport_receive+0x6e/0xd0 [openvswitch]
       ? dst_alloc+0x64/0x90
       ? rt_dst_alloc+0x50/0xd0
       ? ip_route_input_slow+0x19a/0x9a0
       ? __udp_enqueue_schedule_skb+0x198/0x1b0
       ? __udp4_lib_rcv+0x856/0xa30
       ? __udp4_lib_rcv+0x856/0xa30
       ? cpumask_next_and+0x19/0x20
       ? find_busiest_group+0x12d/0xcd0
       netdev_frame_hook+0xce/0x150 [openvswitch]
       __netif_receive_skb_core+0x205/0xae0
       __netif_receive_skb_list_core+0x11e/0x220
       netif_receive_skb_list+0x203/0x460
       ? __efx_rx_packet+0x335/0x5e0 [sfc]
       efx_poll+0x182/0x320 [sfc]
       net_rx_action+0x294/0x3c0
       __do_softirq+0xca/0x297
       irq_exit+0xa6/0xb0
       do_IRQ+0x54/0xd0
       common_interrupt+0xf/0xf
       </IRQ>
      ========
      So, in all listified-receive handling, instead pull skbs off the lists with
       skb_list_del_init().
      
      Fixes: 9af86f93 ("net: core: fix use-after-free in __netif_receive_skb_list_core")
      Fixes: 7da517a3 ("net: core: Another step of skb receive list processing")
      Fixes: a4ca8b7d ("net: ipv4: fix drop handling in ip_list_rcv() and ip_list_rcv_finish()")
      Fixes: d8269e2c ("net: ipv6: listify ipv6_rcv() and ip6_rcv_finish()")
      Signed-off-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7fafda16
  5. 06 12月, 2018 1 次提交
  6. 23 11月, 2018 1 次提交
  7. 04 11月, 2018 1 次提交
  8. 11 10月, 2018 1 次提交
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
  9. 30 8月, 2018 1 次提交
  10. 10 8月, 2018 1 次提交
    • A
      net: allow to call netif_reset_xps_queues() under cpus_read_lock · 4d99f660
      Andrei Vagin 提交于
      The definition of static_key_slow_inc() has cpus_read_lock in place. In the
      virtio_net driver, XPS queues are initialized after setting the queue:cpu
      affinity in virtnet_set_affinity() which is already protected within
      cpus_read_lock. Lockdep prints a warning when we are trying to acquire
      cpus_read_lock when it is already held.
      
      This patch adds an ability to call __netif_set_xps_queue under
      cpus_read_lock().
      Acked-by: NJason Wang <jasowang@redhat.com>
      
      ============================================
      WARNING: possible recursive locking detected
      4.18.0-rc3-next-20180703+ #1 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20
      
      but task is already holding lock:
      00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(cpu_hotplug_lock.rw_sem);
        lock(cpu_hotplug_lock.rw_sem);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by swapper/0/1:
       #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
       #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
       #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60
      
      v2: move cpus_read_lock() out of __netif_set_xps_queue()
      
      Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Fixes: 8af2c06f ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4d99f660
  11. 06 8月, 2018 1 次提交
  12. 31 7月, 2018 1 次提交
    • P
      net/tc: introduce TC_ACT_REINSERT. · cd11b164
      Paolo Abeni 提交于
      This is similar TC_ACT_REDIRECT, but with a slightly different
      semantic:
      - on ingress the mirred skbs are passed to the target device
      network stack without any additional check not scrubbing.
      - the rcu-protected stats provided via the tcf_result struct
        are updated on error conditions.
      
      This new tcfa_action value is not exposed to the user-space
      and can be used only internally by clsact.
      
      v1 -> v2: do not touch TC_ACT_REDIRECT code path, introduce
       a new action type instead
      v2 -> v3:
       - rename the new action value TC_ACT_REINJECT, update the
         helper accordingly
       - take care of uncloned reinjected packets in XDP generic
         hook
      v3 -> v4:
       - renamed again the new action value (JiriP)
      v4 -> v5:
       - fix build error with !NET_CLS_ACT (kbuild bot)
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd11b164
  13. 30 7月, 2018 1 次提交
  14. 27 7月, 2018 1 次提交
  15. 21 7月, 2018 1 次提交
  16. 17 7月, 2018 1 次提交