1. 29 10月, 2019 3 次提交
    • E
      net: use skb_queue_empty_lockless() in poll() handlers · 3ef7cf57
      Eric Dumazet 提交于
      Many poll() handlers are lockless. Using skb_queue_empty_lockless()
      instead of skb_queue_empty() is more appropriate.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ef7cf57
    • E
      udp: use skb_queue_empty_lockless() · 137a0dbe
      Eric Dumazet 提交于
      syzbot reported a data-race [1].
      
      We should use skb_queue_empty_lockless() to document that we are
      not ensuring a mutual exclusion and silence KCSAN.
      
      [1]
      BUG: KCSAN: data-race in __skb_recv_udp / __udp_enqueue_schedule_skb
      
      write to 0xffff888122474b50 of 8 bytes by interrupt on cpu 0:
       __skb_insert include/linux/skbuff.h:1852 [inline]
       __skb_queue_before include/linux/skbuff.h:1958 [inline]
       __skb_queue_tail include/linux/skbuff.h:1991 [inline]
       __udp_enqueue_schedule_skb+0x2c1/0x410 net/ipv4/udp.c:1470
       __udp_queue_rcv_skb net/ipv4/udp.c:1940 [inline]
       udp_queue_rcv_one_skb+0x7bd/0xc70 net/ipv4/udp.c:2057
       udp_queue_rcv_skb+0xb5/0x400 net/ipv4/udp.c:2074
       udp_unicast_rcv_skb.isra.0+0x7e/0x1c0 net/ipv4/udp.c:2233
       __udp4_lib_rcv+0xa44/0x17c0 net/ipv4/udp.c:2300
       udp_rcv+0x2b/0x40 net/ipv4/udp.c:2470
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       process_backlog+0x1d3/0x420 net/core/dev.c:5955
      
      read to 0xffff888122474b50 of 8 bytes by task 8921 on cpu 1:
       skb_queue_empty include/linux/skbuff.h:1494 [inline]
       __skb_recv_udp+0x18d/0x500 net/ipv4/udp.c:1653
       udp_recvmsg+0xe1/0xb10 net/ipv4/udp.c:1712
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec+0x5c/0x70 net/socket.c:871
       ___sys_recvmsg+0x1a0/0x3e0 net/socket.c:2480
       do_recvmmsg+0x19a/0x5c0 net/socket.c:2601
       __sys_recvmmsg+0x1ef/0x200 net/socket.c:2680
       __do_sys_recvmmsg net/socket.c:2703 [inline]
       __se_sys_recvmmsg net/socket.c:2696 [inline]
       __x64_sys_recvmmsg+0x89/0xb0 net/socket.c:2696
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 8921 Comm: syz-executor.4 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      137a0dbe
    • E
      net: add skb_queue_empty_lockless() · d7d16a89
      Eric Dumazet 提交于
      Some paths call skb_queue_empty() without holding
      the queue lock. We must use a barrier in order
      to not let the compiler do strange things, and avoid
      KCSAN splats.
      
      Adding a barrier in skb_queue_empty() might be overkill,
      I prefer adding a new helper to clearly identify
      points where the callers might be lockless. This might
      help us finding real bugs.
      
      The corresponding WRITE_ONCE() should add zero cost
      for current compilers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7d16a89
  2. 28 10月, 2019 1 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf · fc11078d
      David S. Miller 提交于
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS fixes for net
      
      The following patchset contains Netfilter/IPVS fixes for net:
      
      1) Fix crash on flowtable due to race between garbage collection
         and insertion.
      
      2) Restore callback unbinding in netfilter offloads.
      
      3) Fix races on IPVS module removal, from Davide Caratti.
      
      4) Make old_secure_tcp per-netns to fix sysbot report,
         from Eric Dumazet.
      
      5) Validate matching length in netfilter offloads, from wenxu.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc11078d
  3. 27 10月, 2019 5 次提交
    • D
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 1a51a474
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2019-10-27
      
      The following pull-request contains BPF updates for your *net* tree.
      
      We've added 7 non-merge commits during the last 11 day(s) which contain
      a total of 7 files changed, 66 insertions(+), 16 deletions(-).
      
      The main changes are:
      
      1) Fix two use-after-free bugs in relation to RCU in jited symbol exposure to
         kallsyms, from Daniel Borkmann.
      
      2) Fix NULL pointer dereference in AF_XDP rx-only sockets, from Magnus Karlsson.
      
      3) Fix hang in netdev unregister for hash based devmap as well as another overflow
         bug on 32 bit archs in memlock cost calculation, from Toke Høiland-Jørgensen.
      
      4) Fix wrong memory access in LWT BPF programs on reroute due to invalid dst.
         Also fix BPF selftests to use more compatible nc options, from Jiri Benc.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a51a474
    • D
      Merge branch 'ipv4-fix-route-update-on-metric-change' · 45f33806
      David S. Miller 提交于
      Paolo Abeni says:
      
      ====================
      ipv4: fix route update on metric change.
      
      This fixes connected route update on some edge cases for ip addr metric
      change.
      It additionally includes self tests for the covered scenarios. The new tests
      fail on unpatched kernels and pass on the patched one.
      
      v1 -> v2:
       - add selftests
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45f33806
    • P
      selftests: fib_tests: add more tests for metric update · 37de3b35
      Paolo Abeni 提交于
      This patch adds two more tests to ipv4_addr_metric_test() to
      explicitly cover the scenarios fixed by the previous patch.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37de3b35
    • P
      ipv4: fix route update on metric change. · 0b834ba0
      Paolo Abeni 提交于
      Since commit af4d768a ("net/ipv4: Add support for specifying metric
      of connected routes"), when updating an IP address with a different metric,
      the associated connected route is updated, too.
      
      Still, the mentioned commit doesn't handle properly some corner cases:
      
      $ ip addr add dev eth0 192.168.1.0/24
      $ ip addr add dev eth0 192.168.2.1/32 peer 192.168.2.2
      $ ip addr add dev eth0 192.168.3.1/24
      $ ip addr change dev eth0 192.168.1.0/24 metric 10
      $ ip addr change dev eth0 192.168.2.1/32 peer 192.168.2.2 metric 10
      $ ip addr change dev eth0 192.168.3.1/24 metric 10
      $ ip -4 route
      192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.0
      192.168.2.2 dev eth0 proto kernel scope link src 192.168.2.1
      192.168.3.0/24 dev eth0 proto kernel scope link src 192.168.2.1 metric 10
      
      Only the last route is correctly updated.
      
      The problem is the current test in fib_modify_prefix_metric():
      
      	if (!(dev->flags & IFF_UP) ||
      	    ifa->ifa_flags & (IFA_F_SECONDARY | IFA_F_NOPREFIXROUTE) ||
      	    ipv4_is_zeronet(prefix) ||
      	    prefix == ifa->ifa_local || ifa->ifa_prefixlen == 32)
      
      Which should be the logical 'not' of the pre-existing test in
      fib_add_ifaddr():
      
      	if (!ipv4_is_zeronet(prefix) && !(ifa->ifa_flags & IFA_F_SECONDARY) &&
      	    (prefix != addr || ifa->ifa_prefixlen < 32))
      
      To properly negate the original expression, we need to change the last
      logical 'or' to a logical 'and'.
      
      Fixes: af4d768a ("net/ipv4: Add support for specifying metric of connected routes")
      Reported-and-suggested-by: NBeniamino Galvani <bgalvani@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b834ba0
    • Z
      net: Zeroing the structure ethtool_wolinfo in ethtool_get_wol() · 5ff223e8
      zhanglin 提交于
      memset() the structure ethtool_wolinfo that has padded bytes
      but the padded bytes have not been zeroed out.
      Signed-off-by: Nzhanglin <zhang.lin16@zte.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ff223e8
  4. 26 10月, 2019 10 次提交
    • P
      Merge tag 'ipvs-fixes-for-v5.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs · 52b33b4f
      Pablo Neira Ayuso 提交于
      Simon Horman says:
      
      ====================
      IPVS fixes for v5.4
      
      * Eric Dumazet resolves a race condition in switching the defense level
      * Davide Caratti resolves a race condition in module removal
      ====================
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      52b33b4f
    • R
      cxgb4: request the TX CIDX updates to status page · 7c3bebc3
      Raju Rangoju 提交于
      For adapters which support the SGE Doorbell Queue Timer facility,
      we configured the Ethernet TX Queues to send CIDX Updates to the
      Associated Ethernet RX Response Queue with CPL_SGE_EGR_UPDATE
      messages to allow us to respond more quickly to the CIDX Updates.
      But, this was adding load to PCIe Link RX bandwidth and,
      potentially, resulting in higher CPU Interrupt load.
      
      This patch requests the HW to deliver the CIDX updates to the TX
      queue status page rather than generating an ingress queue message
      (as an interrupt). With this patch, the load on RX bandwidth is
      reduced and a substantial improvement in BW is noticed at lower
      IO sizes.
      
      Fixes: d429005f ("cxgb4/cxgb4vf: Add support for SGE doorbell queue timer")
      Signed-off-by: NRaju Rangoju <rajur@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c3bebc3
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · d4e4fdf9
      Guillaume Nault 提交于
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e4fdf9
    • N
      net: ethernet: Use the correct style for SPDX License Identifier · 16d65287
      Nishad Kamdar 提交于
      This patch corrects the SPDX License Identifier style in
      header file related to ethernet driver for Cortina Gemini
      devices. For C header files Documentation/process/license-rules.rst
      mandates C-like comments (opposed to C source files where
      C++ style should be used)
      
      Changes made by using a script provided by Joe Perches here:
      https://lkml.org/lkml/2019/2/7/46.
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NNishad Kamdar <nishadkamdar@gmail.com>
      Acked-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16d65287
    • D
      Merge branch 'smc-fixes' · 31af5057
      David S. Miller 提交于
      Karsten Graul says:
      
      ====================
      net/smc: fixes for -net
      
      Fixes for the net tree, covering a memleak when closing
      SMC fallback sockets and fix SMC-R connection establishment
      when vlan-ids are used.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31af5057
    • U
      net/smc: keep vlan_id for SMC-R in smc_listen_work() · ca5f8d2d
      Ursula Braun 提交于
      Creating of an SMC-R connection with vlan-id fails, because
      smc_listen_work() determines the vlan_id of the connection,
      saves it in struct smc_init_info ini, but clears the ini area
      again if SMC-D is not applicable.
      This patch just resets the ISM device before investigating
      SMC-R availability.
      
      Fixes: bc36d2fc ("net/smc: consolidate function parameters")
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca5f8d2d
    • U
      net/smc: fix closing of fallback SMC sockets · f536dffc
      Ursula Braun 提交于
      For SMC sockets forced to fallback to TCP, the file is propagated
      from the outer SMC to the internal TCP socket. When closing the SMC
      socket, the internal TCP socket file pointer must be restored to the
      original NULL value, otherwise memory leaks may show up (found with
      CONFIG_DEBUG_KMEMLEAK).
      
      The internal TCP socket is released in smc_clcsock_release(), which
      calls __sock_release() function in net/socket.c. This calls the
      needed iput(SOCK_INODE(sock)) only, if the file pointer has been reset
      to the original NULL-value.
      
      Fixes: 07603b23 ("net/smc: propagate file from SMC to TCP socket")
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f536dffc
    • B
      net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static · 91e2e576
      Ben Dooks (Codethink) 提交于
      If CONFIG_NET_HWBM is not set, then these stub functions in
      <net/hwbm.h> should be declared static to avoid trying to
      export them from any driver that includes this.
      
      Fixes the following sparse warnings:
      
      ./include/net/hwbm.h:24:6: warning: symbol 'hwbm_buf_free' was not declared. Should it be static?
      ./include/net/hwbm.h:25:5: warning: symbol 'hwbm_pool_refill' was not declared. Should it be static?
      ./include/net/hwbm.h:26:5: warning: symbol 'hwbm_pool_add' was not declared. Should it be static?
      Signed-off-by: NBen Dooks (Codethink) <ben.dooks@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91e2e576
    • B
      net: mvneta: make stub functions static inline · 3f6b2c44
      Ben Dooks (Codethink) 提交于
      If the CONFIG_MVNET_BA is not set, then make the stub functions
      static inline to avoid trying to export them, and remove hte
      following sparse warnings:
      
      drivers/net/ethernet/marvell/mvneta_bm.h:163:6: warning: symbol 'mvneta_bm_pool_destroy' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:165:6: warning: symbol 'mvneta_bm_bufs_free' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:167:5: warning: symbol 'mvneta_bm_construct' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:168:5: warning: symbol 'mvneta_bm_pool_refill' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:170:23: warning: symbol 'mvneta_bm_pool_use' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:181:18: warning: symbol 'mvneta_bm_get' was not declared. Should it be static?
      drivers/net/ethernet/marvell/mvneta_bm.h:182:6: warning: symbol 'mvneta_bm_put' was not declared. Should it be static?
      Signed-off-by: NBen Dooks (Codethink) <ben.dooks@codethink.co.uk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f6b2c44
    • V
      net: sch_generic: Use pfifo_fast as fallback scheduler for CAN hardware · fa784f2a
      Vincent Prince 提交于
      There is networking hardware that isn't based on Ethernet for layers 1 and 2.
      
      For example CAN.
      
      CAN is a multi-master serial bus standard for connecting Electronic Control
      Units [ECUs] also known as nodes. A frame on the CAN bus carries up to 8 bytes
      of payload. Frame corruption is detected by a CRC. However frame loss due to
      corruption is possible, but a quite unusual phenomenon.
      
      While fq_codel works great for TCP/IP, it doesn't for CAN. There are a lot of
      legacy protocols on top of CAN, which are not build with flow control or high
      CAN frame drop rates in mind.
      
      When using fq_codel, as soon as the queue reaches a certain delay based length,
      skbs from the head of the queue are silently dropped. Silently meaning that the
      user space using a send() or similar syscall doesn't get an error. However
      TCP's flow control algorithm will detect dropped packages and adjust the
      bandwidth accordingly.
      
      When using fq_codel and sending raw frames over CAN, which is the common use
      case, the user space thinks the package has been sent without problems, because
      send() returned without an error. pfifo_fast will drop skbs, if the queue
      length exceeds the maximum. But with this scheduler the skbs at the tail are
      dropped, an error (-ENOBUFS) is propagated to user space. So that the user
      space can slow down the package generation.
      
      On distributions, where fq_codel is made default via CONFIG_DEFAULT_NET_SCH
      during compile time, or set default during runtime with sysctl
      net.core.default_qdisc (see [1]), we get a bad user experience. In my test case
      with pfifo_fast, I can transfer thousands of million CAN frames without a frame
      drop. On the other hand with fq_codel there is more then one lost CAN frame per
      thousand frames.
      
      As pointed out fq_codel is not suited for CAN hardware, so this patch changes
      attach_one_default_qdisc() to use pfifo_fast for "ARPHRD_CAN" network devices.
      
      During transition of a netdev from down to up state the default queuing
      discipline is attached by attach_default_qdiscs() with the help of
      attach_one_default_qdisc(). This patch modifies attach_one_default_qdisc() to
      attach the pfifo_fast (pfifo_fast_ops) if the network device type is
      "ARPHRD_CAN".
      
      [1] https://github.com/systemd/systemd/issues/9194Signed-off-by: NVincent Prince <vincent.prince.fr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa784f2a
  5. 25 10月, 2019 12 次提交
    • D
      Merge branch 'net-fix-nested-device-bugs' · 65921376
      David S. Miller 提交于
      Taehee Yoo says:
      
      ====================
      net: fix nested device bugs
      
      This patchset fixes several bugs that are related to nesting
      device infrastructure.
      Current nesting infrastructure code doesn't limit the depth level of
      devices. nested devices could be handled recursively. at that moment,
      it needs huge memory and stack overflow could occur.
      Below devices type have same bug.
      VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, and VXLAN.
      But I couldn't test all interface types so there could be more device
      types, which have similar problems.
      Maybe qmi_wwan.c code could have same problem.
      So, I would appreciate if someone test qmi_wwan.c and other modules.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add vlan1 link dummy0 type vlan id 1
      
          for i in {2..100}
          do
      	    let A=$i-1
      	    ip link add name vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      1st patch actually fixes the root cause.
      It adds new common variables {upper/lower}_level that represent
      depth level. upper_level variable is depth of upper devices.
      lower_level variable is depth of lower devices.
      
            [U][L]       [U][L]
      vlan1  1  5  vlan4  1  4
      vlan2  2  4  vlan5  2  3
      vlan3  3  3    |
        |            |
        +------------+
        |
      vlan6  4  2
      dummy0 5  1
      
      After this patch, the nesting infrastructure code uses this variable to
      check the depth level.
      
      2nd patch fixes Qdisc lockdep related problem.
      Before this patch, devices use static lockdep map.
      So, if devices that are same types are nested, lockdep will warn about
      recursive situation.
      These patches make these devices use dynamic lockdep key instead of
      static lock or subclass.
      
      3rd patch fixes unexpected IFF_BONDING bit unset.
      When nested bonding interface scenario, bonding interface could lost it's
      IFF_BONDING flag. This should not happen.
      This patch adds a condition before unsetting IFF_BONDING.
      
      4th patch fixes nested locking problem in bonding interface
      Bonding interface has own lock and this uses static lock.
      Bonding interface could be nested and it uses same lockdep key.
      So that unexisting lockdep warning occurs.
      
      5th patch fixes nested locking problem in team interface
      Team interface has own lock and this uses static lock.
      Team interface could be nested and it uses same lockdep key.
      So that unexisting lockdep warning occurs.
      
      6th patch fixes a refcnt leak in the macsec module.
      When the macsec module is unloaded, refcnt leaks occur.
      But actually, that holding refcnt is unnecessary.
      So this patch just removes these code.
      
      7th patch adds ignore flag to an adjacent structure.
      In order to exchange an adjacent node safely, ignore flag is needed.
      
      8th patch makes vxlan add an adjacent link to limit depth level.
      Vxlan interface could set it's lower interface and these lower interfaces
      are handled recursively.
      So, if the depth of lower interfaces is too deep, stack overflow could
      happen.
      
      9th patch removes unnecessary variables and callback.
      After 1st patch, subclass callback and variables are unnecessary.
      This patch just removes these variables and callback.
      
      10th patch fix refcnt leaks in the virt_wifi module
      Like every nested interface, the upper interface should be deleted
      before the lower interface is deleted.
      In order to fix this, the notifier routine is added in this patch.
      
      v4 -> v5 :
       - Update log messages
       - Move variables position, 1st patch
       - Fix iterator routine, 1st patch
       - Add generic lockdep key code, which replaces 2, 4, 5, 6, 7 patches.
       - Log message update, 10th patch
       - Fix wrong error value in error path of __init routine, 10th patch
       - hold module refcnt when interface is created, 10th patch
      v3 -> v4 :
       - Add new 12th patch to fix refcnt leaks in the virt_wifi module
       - Fix wrong usage netdev_upper_dev_link() in the vxlan.c
       - Preserve reverse christmas tree variable ordering in the vxlan.c
       - Add missing static keyword in the dev.c
       - Expose netdev_adjacent_change_{prepare/commit/abort} instead of
         netdev_adjacent_dev_{enable/disable}
      v2 -> v3 :
       - Modify nesting infrastructure code to use iterator instead of recursive.
      v1 -> v2 :
       - Make the 3rd patch do not add a new priv_flag.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65921376
    • T
      virt_wifi: fix refcnt leak in module exit routine · 1962f86b
      Taehee Yoo 提交于
      virt_wifi_newlink() calls netdev_upper_dev_link() and it internally
      holds reference count of lower interface.
      
      Current code does not release a reference count of the lower interface
      when the lower interface is being deleted.
      So, reference count leaks occur.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add vw1 link dummy0 type virt_wifi
          ip link del dummy0
      
      Splat looks like:
      [  133.787526][  T788] WARNING: CPU: 1 PID: 788 at net/core/dev.c:8274 rollback_registered_many+0x835/0xc80
      [  133.788355][  T788] Modules linked in: virt_wifi cfg80211 dummy team af_packet sch_fq_codel ip_tables x_tables unix
      [  133.789377][  T788] CPU: 1 PID: 788 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  133.790069][  T788] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  133.791167][  T788] RIP: 0010:rollback_registered_many+0x835/0xc80
      [  133.791906][  T788] Code: 00 4d 85 ff 0f 84 b5 fd ff ff ba c0 0c 00 00 48 89 de 4c 89 ff e8 9b 58 04 00 48 89 df e8 30
      [  133.794317][  T788] RSP: 0018:ffff88805ba3f338 EFLAGS: 00010202
      [  133.795080][  T788] RAX: ffff88805e57e801 RBX: ffff88805ba34000 RCX: ffffffffa9294723
      [  133.796045][  T788] RDX: 1ffff1100b746816 RSI: 0000000000000008 RDI: ffffffffabcc4240
      [  133.797006][  T788] RBP: ffff88805ba3f4c0 R08: fffffbfff5798849 R09: fffffbfff5798849
      [  133.797993][  T788] R10: 0000000000000001 R11: fffffbfff5798848 R12: dffffc0000000000
      [  133.802514][  T788] R13: ffff88805ba3f440 R14: ffff88805ba3f400 R15: ffff88805ed622c0
      [  133.803237][  T788] FS:  00007f2e9608c0c0(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
      [  133.804002][  T788] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  133.804664][  T788] CR2: 00007f2e95610603 CR3: 000000005f68c004 CR4: 00000000000606e0
      [  133.805363][  T788] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  133.806073][  T788] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  133.806787][  T788] Call Trace:
      [  133.807069][  T788]  ? generic_xdp_install+0x310/0x310
      [  133.807612][  T788]  ? lock_acquire+0x164/0x3b0
      [  133.808077][  T788]  ? is_bpf_text_address+0x5/0xf0
      [  133.808640][  T788]  ? deref_stack_reg+0x9c/0xd0
      [  133.809138][  T788]  ? __nla_validate_parse+0x98/0x1ab0
      [  133.809944][  T788]  unregister_netdevice_many.part.122+0x13/0x1b0
      [  133.810599][  T788]  rtnl_delete_link+0xbc/0x100
      [  133.811073][  T788]  ? rtnl_af_register+0xc0/0xc0
      [  133.811672][  T788]  rtnl_dellink+0x30e/0x8a0
      [  133.812205][  T788]  ? is_bpf_text_address+0x5/0xf0
      [ ... ]
      
      [  144.110530][  T788] unregister_netdevice: waiting for dummy0 to become free. Usage count = 1
      
      This patch adds notifier routine to delete upper interface before deleting
      lower interface.
      
      Fixes: c7cdba31 ("mac80211-next: rtnetlink wifi simulation device")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1962f86b
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      vxlan: add adjacent link to limit depth level · 0ce1822c
      Taehee Yoo 提交于
      Current vxlan code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this routine needs
      huge stack memory. So, unlimited nested devices could make
      stack overflow.
      
      In order to fix this issue, this patch adds adjacent links.
      The adjacent link APIs internally check the depth level.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
      	    dstport 4789
          for i in {1..100}
          do
      	    let A=$i-1
      	    ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
      		    dev vxlan$A dstport 4789
          done
          ip link del dummy0
      
      The top upper link is vxlan100 and the lowest link is vxlan0.
      When vxlan0 is deleting, the upper devices will be deleted recursively.
      It needs huge stack memory so it makes stack overflow.
      
      Splat looks like:
      [  229.628477] =============================================================================
      [  229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
      [  229.629785] -----------------------------------------------------------------------------
      [  229.629785]
      [  229.655439] ==================================================================
      [  229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
      [  229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
      [  229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
      [  229.655688]
      [  229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff  ........h.......
      [  229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff  8.......h.......
      [  229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f  30b.......`"....
      [  229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff  .`..............
      [  229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff  h........!......
      [  229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00  ................
      [  229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df  !..!............
      [  229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff  .`..............
      [  229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00                          ...A....
      [  229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
      [  ... ]
      
      Fixes: acaf4e70 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ce1822c
    • T
      net: core: add ignore flag to netdev_adjacent structure · 32b6d34f
      Taehee Yoo 提交于
      In order to link an adjacent node, netdev_upper_dev_link() is used
      and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used.
      unlink operation does not fail, but link operation can fail.
      
      In order to exchange adjacent nodes, we should unlink an old adjacent
      node first. then, link a new adjacent node.
      If link operation is failed, we should link an old adjacent node again.
      But this link operation can fail too.
      It eventually breaks the adjacent link relationship.
      
      This patch adds an ignore flag into the netdev_adjacent structure.
      If this flag is set, netdev_upper_dev_link() ignores an old adjacent
      node for a moment.
      
      This patch also adds new functions for other modules.
      netdev_adjacent_change_prepare()
      netdev_adjacent_change_commit()
      netdev_adjacent_change_abort()
      
      netdev_adjacent_change_prepare() inserts new device into adjacent list
      but new device is not allowed to use immediately.
      If netdev_adjacent_change_prepare() fails, it internally rollbacks
      adjacent list so that we don't need any other action.
      netdev_adjacent_change_commit() deletes old device in the adjacent list
      and allows new device to use.
      netdev_adjacent_change_abort() rollbacks adjacent list.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32b6d34f
    • T
      macsec: fix refcnt leak in module exit routine · 2bce1ebe
      Taehee Yoo 提交于
      When a macsec interface is created, it increases a refcnt to a lower
      device(real device). when macsec interface is deleted, the refcnt is
      decreased in macsec_free_netdev(), which is ->priv_destructor() of
      macsec interface.
      
      The problem scenario is this.
      When nested macsec interfaces are exiting, the exit routine of the
      macsec module makes refcnt leaks.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add macsec0 link dummy0 type macsec
          ip link add macsec1 link macsec0 type macsec
          modprobe -rv macsec
      
      [  208.629433] unregister_netdevice: waiting for macsec0 to become free. Usage count = 1
      
      Steps of exit routine of macsec module are below.
      1. Calls ->dellink() in __rtnl_link_unregister().
      2. Checks refcnt and wait refcnt to be 0 if refcnt is not 0 in
      netdev_run_todo().
      3. Calls ->priv_destruvtor() in netdev_run_todo().
      
      Step2 checks refcnt, but step3 decreases refcnt.
      So, step2 waits forever.
      
      This patch makes the macsec module do not hold a refcnt of the lower
      device because it already holds a refcnt of the lower device with
      netdev_upper_dev_link().
      
      Fixes: c09440f7 ("macsec: introduce IEEE 802.1AE driver")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bce1ebe
    • T
      team: fix nested locking lockdep warning · 369f61be
      Taehee Yoo 提交于
      team interface could be nested and it's lock variable could be nested too.
      But this lock uses static lockdep key and there is no nested locking
      handling code such as mutex_lock_nested() and so on.
      so the Lockdep would warn about the circular locking scenario that
      couldn't happen.
      In order to fix, this patch makes the team module to use dynamic lock key
      instead of static key.
      
      Test commands:
          ip link add team0 type team
          ip link add team1 type team
          ip link set team0 master team1
          ip link set team0 nomaster
          ip link set team1 master team0
          ip link set team1 nomaster
      
      Splat that looks like:
      [   40.364352] WARNING: possible recursive locking detected
      [   40.364964] 5.4.0-rc3+ #96 Not tainted
      [   40.365405] --------------------------------------------
      [   40.365973] ip/750 is trying to acquire lock:
      [   40.366542] ffff888060b34c40 (&team->lock){+.+.}, at: team_set_mac_address+0x151/0x290 [team]
      [   40.367689]
      	       but task is already holding lock:
      [   40.368729] ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.370280]
      	       other info that might help us debug this:
      [   40.371159]  Possible unsafe locking scenario:
      
      [   40.371942]        CPU0
      [   40.372338]        ----
      [   40.372673]   lock(&team->lock);
      [   40.373115]   lock(&team->lock);
      [   40.373549]
      	       *** DEADLOCK ***
      
      [   40.374432]  May be due to missing lock nesting notation
      
      [   40.375338] 2 locks held by ip/750:
      [   40.375851]  #0: ffffffffabcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [   40.376927]  #1: ffff888051201c40 (&team->lock){+.+.}, at: team_del_slave+0x29/0x60 [team]
      [   40.377989]
      	       stack backtrace:
      [   40.378650] CPU: 0 PID: 750 Comm: ip Not tainted 5.4.0-rc3+ #96
      [   40.379368] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [   40.380574] Call Trace:
      [   40.381208]  dump_stack+0x7c/0xbb
      [   40.381959]  __lock_acquire+0x269d/0x3de0
      [   40.382817]  ? register_lock_class+0x14d0/0x14d0
      [   40.383784]  ? check_chain_key+0x236/0x5d0
      [   40.384518]  lock_acquire+0x164/0x3b0
      [   40.385074]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.385805]  __mutex_lock+0x14d/0x14c0
      [   40.386371]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387038]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.387632]  ? mutex_lock_io_nested+0x1380/0x1380
      [   40.388245]  ? team_del_slave+0x60/0x60 [team]
      [   40.388752]  ? rcu_read_lock_sched_held+0x90/0xc0
      [   40.389304]  ? rcu_read_lock_bh_held+0xa0/0xa0
      [   40.389819]  ? lock_acquire+0x164/0x3b0
      [   40.390285]  ? lockdep_rtnl_is_held+0x16/0x20
      [   40.390797]  ? team_port_get_rtnl+0x90/0xe0 [team]
      [   40.391353]  ? __module_text_address+0x13/0x140
      [   40.391886]  ? team_set_mac_address+0x151/0x290 [team]
      [   40.392547]  team_set_mac_address+0x151/0x290 [team]
      [   40.393111]  dev_set_mac_address+0x1f0/0x3f0
      [ ... ]
      
      Fixes: 3d249d4c ("net: introduce ethernet teaming device")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      369f61be
    • T
      bonding: use dynamic lockdep key instead of subclass · 089bca2c
      Taehee Yoo 提交于
      All bonding device has same lockdep key and subclass is initialized with
      nest_level.
      But actual nest_level value can be changed when a lower device is attached.
      And at this moment, the subclass should be updated but it seems to be
      unsafe.
      So this patch makes bonding use dynamic lockdep key instead of the
      subclass.
      
      Test commands:
          ip link add bond0 type bond
      
          for i in {1..5}
          do
      	    let A=$i-1
      	    ip link add bond$i type bond
      	    ip link set bond$i master bond$A
          done
          ip link set bond5 master bond0
      
      Splat looks like:
      [  307.992912] WARNING: possible recursive locking detected
      [  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
      [  307.994367] --------------------------------------------
      [  307.995092] ip/761 is trying to acquire lock:
      [  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.997045]
      	       but task is already holding lock:
      [  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  307.999215]
      	       other info that might help us debug this:
      [  308.000251]  Possible unsafe locking scenario:
      
      [  308.001137]        CPU0
      [  308.001533]        ----
      [  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
      [  308.003302]
      		*** DEADLOCK ***
      
      [  308.004310]  May be due to missing lock nesting notation
      
      [  308.005319] 3 locks held by ip/761:
      [  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
      [  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
      [  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
      [  308.009422]
      	       stack backtrace:
      [  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
      [  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  308.012179] Call Trace:
      [  308.012601]  dump_stack+0x7c/0xbb
      [  308.013089]  __lock_acquire+0x269d/0x3de0
      [  308.013669]  ? register_lock_class+0x14d0/0x14d0
      [  308.014318]  lock_acquire+0x164/0x3b0
      [  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.015520]  _raw_spin_lock_nested+0x2e/0x60
      [  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
      [  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
      [  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
      [  308.019710]  ? rcu_read_lock_held+0x90/0xa0
      [  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
      [  308.021953]  dev_get_stats+0x1ec/0x270
      [  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]
      
      Fixes: d3fff6c4 ("net: add netdev_lockdep_set_classes() helper")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      089bca2c
    • T
      bonding: fix unexpected IFF_BONDING bit unset · 65de65d9
      Taehee Yoo 提交于
      The IFF_BONDING means bonding master or bonding slave device.
      ->ndo_add_slave() sets IFF_BONDING flag and ->ndo_del_slave() unsets
      IFF_BONDING flag.
      
      bond0<--bond1
      
      Both bond0 and bond1 are bonding device and these should keep having
      IFF_BONDING flag until they are removed.
      But bond1 would lose IFF_BONDING at ->ndo_del_slave() because that routine
      do not check whether the slave device is the bonding type or not.
      This patch adds the interface type check routine before removing
      IFF_BONDING flag.
      
      Test commands:
          ip link add bond0 type bond
          ip link add bond1 type bond
          ip link set bond1 master bond0
          ip link set bond1 nomaster
          ip link del bond1 type bond
          ip link add bond1 type bond
      
      Splat looks like:
      [  226.665555] proc_dir_entry 'bonding/bond1' already registered
      [  226.666440] WARNING: CPU: 0 PID: 737 at fs/proc/generic.c:361 proc_register+0x2a9/0x3e0
      [  226.667571] Modules linked in: bonding af_packet sch_fq_codel ip_tables x_tables unix
      [  226.668662] CPU: 0 PID: 737 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  226.669508] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  226.670652] RIP: 0010:proc_register+0x2a9/0x3e0
      [  226.671612] Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 39 01 00 00 48 8b 04 24 48 89 ea 48 c7 c7 a0 0b 14 9f 48 8b b0 e
      0 00 00 00 e8 07 e7 88 ff <0f> 0b 48 c7 c7 40 2d a5 9f e8 59 d6 23 01 48 8b 4c 24 10 48 b8 00
      [  226.675007] RSP: 0018:ffff888050e17078 EFLAGS: 00010282
      [  226.675761] RAX: dffffc0000000008 RBX: ffff88805fdd0f10 RCX: ffffffff9dd344e2
      [  226.676757] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff88806c9f6b8c
      [  226.677751] RBP: ffff8880507160f3 R08: ffffed100d940019 R09: ffffed100d940019
      [  226.678761] R10: 0000000000000001 R11: ffffed100d940018 R12: ffff888050716008
      [  226.679757] R13: ffff8880507160f2 R14: dffffc0000000000 R15: ffffed100a0e2c1e
      [  226.680758] FS:  00007fdc217cc0c0(0000) GS:ffff88806c800000(0000) knlGS:0000000000000000
      [  226.681886] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  226.682719] CR2: 00007f49313424d0 CR3: 0000000050e46001 CR4: 00000000000606f0
      [  226.683727] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  226.684725] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  226.685681] Call Trace:
      [  226.687089]  proc_create_seq_private+0xb3/0xf0
      [  226.687778]  bond_create_proc_entry+0x1b3/0x3f0 [bonding]
      [  226.691458]  bond_netdev_event+0x433/0x970 [bonding]
      [  226.692139]  ? __module_text_address+0x13/0x140
      [  226.692779]  notifier_call_chain+0x90/0x160
      [  226.693401]  register_netdevice+0x9b3/0xd80
      [  226.694010]  ? alloc_netdev_mqs+0x854/0xc10
      [  226.694629]  ? netdev_change_features+0xa0/0xa0
      [  226.695278]  ? rtnl_create_link+0x2ed/0xad0
      [  226.695849]  bond_newlink+0x2a/0x60 [bonding]
      [  226.696422]  __rtnl_newlink+0xb9f/0x11b0
      [  226.696968]  ? rtnl_link_unregister+0x220/0x220
      [ ... ]
      
      Fixes: 0b680e75 ("[PATCH] bonding: Add priv_flag to avoid event mishandling")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65de65d9
    • T
      net: core: add generic lockdep keys · ab92d68f
      Taehee Yoo 提交于
      Some interface types could be nested.
      (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
      These interface types should set lockdep class because, without lockdep
      class key, lockdep always warn about unexisting circular locking.
      
      In the current code, these interfaces have their own lockdep class keys and
      these manage itself. So that there are so many duplicate code around the
      /driver/net and /net/.
      This patch adds new generic lockdep keys and some helper functions for it.
      
      This patch does below changes.
      a) Add lockdep class keys in struct net_device
         - qdisc_running, xmit, addr_list, qdisc_busylock
         - these keys are used as dynamic lockdep key.
      b) When net_device is being allocated, lockdep keys are registered.
         - alloc_netdev_mqs()
      c) When net_device is being free'd llockdep keys are unregistered.
         - free_netdev()
      d) Add generic lockdep key helper function
         - netdev_register_lockdep_key()
         - netdev_unregister_lockdep_key()
         - netdev_update_lockdep_key()
      e) Remove unnecessary generic lockdep macro and functions
      f) Remove unnecessary lockdep code of each interfaces.
      
      After this patch, each interface modules don't need to maintain
      their lockdep keys.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab92d68f
    • T
      net: core: limit nested device depth · 5343da4c
      Taehee Yoo 提交于
      Current code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this needs huge stack
      memory. So, unlimited nested devices could make stack overflow.
      
      This patch adds upper_level and lower_level, they are common variables
      and represent maximum lower/upper depth.
      When upper/lower device is attached or dettached,
      {lower/upper}_level are updated. and if maximum depth is bigger than 8,
      attach routine fails and returns -EMLINK.
      
      In addition, this patch converts recursive routine of
      netdev_walk_all_{lower/upper} to iterator routine.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add link dummy0 name vlan1 type vlan id 1
          ip link set vlan1 up
      
          for i in {2..55}
          do
      	    let A=$i-1
      
      	    ip link add vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      Splat looks like:
      [  155.513226][  T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
      [  155.514162][  T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
      [  155.515048][  T908]
      [  155.515333][  T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  155.516147][  T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  155.517233][  T908] Call Trace:
      [  155.517627][  T908]
      [  155.517918][  T908] Allocated by task 0:
      [  155.518412][  T908] (stack is not available)
      [  155.518955][  T908]
      [  155.519228][  T908] Freed by task 0:
      [  155.519885][  T908] (stack is not available)
      [  155.520452][  T908]
      [  155.520729][  T908] The buggy address belongs to the object at ffff8880608a6ac0
      [  155.520729][  T908]  which belongs to the cache names_cache of size 4096
      [  155.522387][  T908] The buggy address is located 512 bytes inside of
      [  155.522387][  T908]  4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
      [  155.523920][  T908] The buggy address belongs to the page:
      [  155.524552][  T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
      [  155.525836][  T908] flags: 0x100000000010200(slab|head)
      [  155.526445][  T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
      [  155.527424][  T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  155.528429][  T908] page dumped because: kasan: bad access detected
      [  155.529158][  T908]
      [  155.529410][  T908] Memory state around the buggy address:
      [  155.530060][  T908]  ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.530971][  T908]  ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
      [  155.531889][  T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.532806][  T908]                                            ^
      [  155.533509][  T908]  ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
      [  155.534436][  T908]  ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
      [ ... ]
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5343da4c
    • T
      keys: Fix memory leak in copy_net_ns · 82ecff65
      Takeshi Misawa 提交于
      If copy_net_ns() failed after net_alloc(), net->key_domain is leaked.
      Fix this, by freeing key_domain in error path.
      
      syzbot report:
      BUG: memory leak
      unreferenced object 0xffff8881175007e0 (size 32):
        comm "syz-executor902", pid 7069, jiffies 4294944350 (age 28.400s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000a83ed741>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<00000000a83ed741>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000a83ed741>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000a83ed741>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<0000000059fc92b9>] kmalloc include/linux/slab.h:547 [inline]
          [<0000000059fc92b9>] kzalloc include/linux/slab.h:742 [inline]
          [<0000000059fc92b9>] net_alloc net/core/net_namespace.c:398 [inline]
          [<0000000059fc92b9>] copy_net_ns+0xb2/0x220 net/core/net_namespace.c:445
          [<00000000a9d74bbc>] create_new_namespaces+0x141/0x2a0 kernel/nsproxy.c:103
          [<000000008047d645>] unshare_nsproxy_namespaces+0x7f/0x100 kernel/nsproxy.c:202
          [<000000005993ea6e>] ksys_unshare+0x236/0x490 kernel/fork.c:2674
          [<0000000019417e75>] __do_sys_unshare kernel/fork.c:2742 [inline]
          [<0000000019417e75>] __se_sys_unshare kernel/fork.c:2740 [inline]
          [<0000000019417e75>] __x64_sys_unshare+0x16/0x20 kernel/fork.c:2740
          [<00000000f4c5f2c8>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:296
          [<0000000038550184>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      syzbot also reported other leak in copy_net_ns -> setup_net.
      This problem is already fixed by cf47a0b8.
      
      Fixes: 9b242610 ("keys: Network namespace domain tag")
      Reported-and-tested-by: syzbot+3b3296d032353c33184b@syzkaller.appspotmail.com
      Signed-off-by: NTakeshi Misawa <jeliantsurux@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82ecff65
  6. 24 10月, 2019 6 次提交
    • W
      netfilter: nft_payload: fix missing check for matching length in offloads · a69a85da
      wenxu 提交于
      Payload offload rule should also check the length of the match.
      Moreover, check for unsupported link-layer fields:
      
       nft --debug=netlink add rule firewall zones vlan id 100
       ...
       [ payload load 2b @ link header + 0 => reg 1 ]
      
      this loads 2byte base on ll header and offset 0.
      
      This also fixes unsupported raw payload match.
      
      Fixes: 92ad6325 ("netfilter: nf_tables: add hardware offload support")
      Signed-off-by: Nwenxu <wenxu@ucloud.cn>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a69a85da
    • E
      ipvs: move old_secure_tcp into struct netns_ipvs · c24b75e0
      Eric Dumazet 提交于
      syzbot reported the following issue :
      
      BUG: KCSAN: data-race in update_defense_level / update_defense_level
      
      read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
       update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
       update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
       defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events defense_work_handler
      
      Indeed, old_secure_tcp is currently a static variable, while it
      needs to be a per netns variable.
      
      Fixes: a0840e2e ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      c24b75e0
    • D
      ipvs: don't ignore errors in case refcounting ip_vs module fails · 62931f59
      Davide Caratti 提交于
      if the IPVS module is removed while the sync daemon is starting, there is
      a small gap where try_module_get() might fail getting the refcount inside
      ip_vs_use_count_inc(). Then, the refcounts of IPVS module are unbalanced,
      and the subsequent call to stop_sync_thread() causes the following splat:
      
       WARNING: CPU: 0 PID: 4013 at kernel/module.c:1146 module_put.part.44+0x15b/0x290
        Modules linked in: ip_vs(-) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 veth ip6table_filter ip6_tables iptable_filter binfmt_misc intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul ext4 mbcache jbd2 ghash_clmulni_intel snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd cryptd glue_helper joydev pcspkr snd_timer virtio_balloon snd soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk failover virtio_console qxl drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ata_piix ttm crc32c_intel serio_raw drm virtio_pci libata virtio_ring virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_defrag_ipv6]
        CPU: 0 PID: 4013 Comm: modprobe Tainted: G        W         5.4.0-rc1.upstream+ #741
        Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
        RIP: 0010:module_put.part.44+0x15b/0x290
        Code: 04 25 28 00 00 00 0f 85 18 01 00 00 48 83 c4 68 5b 5d 41 5c 41 5d 41 5e 41 5f c3 89 44 24 28 83 e8 01 89 c5 0f 89 57 ff ff ff <0f> 0b e9 78 ff ff ff 65 8b 1d 67 83 26 4a 89 db be 08 00 00 00 48
        RSP: 0018:ffff888050607c78 EFLAGS: 00010297
        RAX: 0000000000000003 RBX: ffffffffc1420590 RCX: ffffffffb5db0ef9
        RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffffc1420590
        RBP: 00000000ffffffff R08: fffffbfff82840b3 R09: fffffbfff82840b3
        R10: 0000000000000001 R11: fffffbfff82840b2 R12: 1ffff1100a0c0f90
        R13: ffffffffc1420200 R14: ffff88804f533300 R15: ffff88804f533ca0
        FS:  00007f8ea9720740(0000) GS:ffff888053800000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3245abe000 CR3: 000000004c28a006 CR4: 00000000001606f0
        Call Trace:
         stop_sync_thread+0x3a3/0x7c0 [ip_vs]
         ip_vs_sync_net_cleanup+0x13/0x50 [ip_vs]
         ops_exit_list.isra.5+0x94/0x140
         unregister_pernet_operations+0x29d/0x460
         unregister_pernet_device+0x26/0x60
         ip_vs_cleanup+0x11/0x38 [ip_vs]
         __x64_sys_delete_module+0x2d5/0x400
         do_syscall_64+0xa5/0x4e0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f8ea8bf0db7
        Code: 73 01 c3 48 8b 0d b9 80 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 80 2c 00 f7 d8 64 89 01 48
        RSP: 002b:00007ffcd38d2fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000000002436240 RCX: 00007f8ea8bf0db7
        RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00000000024362a8
        RBP: 0000000000000000 R08: 00007f8ea8eba060 R09: 00007f8ea8c658a0
        R10: 00007ffcd38d2a60 R11: 0000000000000206 R12: 0000000000000000
        R13: 0000000000000001 R14: 00000000024362a8 R15: 0000000000000000
        irq event stamp: 4538
        hardirqs last  enabled at (4537): [<ffffffffb6193dde>] quarantine_put+0x9e/0x170
        hardirqs last disabled at (4538): [<ffffffffb5a0556a>] trace_hardirqs_off_thunk+0x1a/0x20
        softirqs last  enabled at (4522): [<ffffffffb6f8ebe9>] sk_common_release+0x169/0x2d0
        softirqs last disabled at (4520): [<ffffffffb6f8eb3e>] sk_common_release+0xbe/0x2d0
      
      Check the return value of ip_vs_use_count_inc() and let its caller return
      proper error. Inside do_ip_vs_set_ctl() the module is already refcounted,
      we don't need refcount/derefcount there. Finally, in register_ip_vs_app()
      and start_sync_thread(), take the module refcount earlier and ensure it's
      released in the error path.
      
      Change since v1:
       - better return values in case of failure of ip_vs_use_count_inc(),
         thanks to Julian Anastasov
       - no need to increase/decrease the module refcount in ip_vs_set_ctl(),
         thanks to Julian Anastasov
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      62931f59
    • M
      net: phy: smsc: LAN8740: add PHY_RST_AFTER_CLK_EN flag · 76db2d46
      Martin Fuzzey 提交于
      The LAN8740, like the 8720, also requires a reset after enabling clock.
      The datasheet [1] 3.8.5.1 says:
      	"During a Hardware reset, an external clock must be supplied
      	to the XTAL1/CLKIN signal."
      
      I have observed this issue on a custom i.MX6 based board with
      the LAN8740A.
      
      [1] http://ww1.microchip.com/downloads/en/DeviceDoc/8740a.pdfSigned-off-by: NMartin Fuzzey <martin.fuzzey@flowbird.group>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76db2d46
    • M
      xsk: Fix registration of Rx-only sockets · 2afd23f7
      Magnus Karlsson 提交于
      Having Rx-only AF_XDP sockets can potentially lead to a crash in the
      system by a NULL pointer dereference in xsk_umem_consume_tx(). This
      function iterates through a list of all sockets tied to a umem and
      checks if there are any packets to send on the Tx ring. Rx-only
      sockets do not have a Tx ring, so this will cause a NULL pointer
      dereference. This will happen if you have registered one or more
      Rx-only sockets to a umem and the driver is checking the Tx ring even
      on Rx, or if the XDP_SHARED_UMEM mode is used and there is a mix of
      Rx-only and other sockets tied to the same umem.
      
      Fixed by only putting sockets with a Tx component on the list that
      xsk_umem_consume_tx() iterates over.
      
      Fixes: ac98d8aa ("xsk: wire upp Tx zero-copy functions")
      Reported-by: NKal Cutter Conley <kal.conley@dectris.com>
      Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Link: https://lore.kernel.org/bpf/1571645818-16244-1-git-send-email-magnus.karlsson@intel.com
      2afd23f7
    • E
      net/flow_dissector: switch to siphash · 55667441
      Eric Dumazet 提交于
      UDP IPv6 packets auto flowlabels are using a 32bit secret
      (static u32 hashrnd in net/core/flow_dissector.c) and
      apply jhash() over fields known by the receivers.
      
      Attackers can easily infer the 32bit secret and use this information
      to identify a device and/or user, since this 32bit secret is only
      set at boot time.
      
      Really, using jhash() to generate cookies sent on the wire
      is a serious security concern.
      
      Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
      a dead end. Trying to periodically change the secret (like in sch_sfq.c)
      could change paths taken in the network for long lived flows.
      
      Let's switch to siphash, as we did in commit df453700
      ("inet: switch IP ID generator to siphash")
      
      Using a cryptographically strong pseudo random function will solve this
      privacy issue and more generally remove other weak points in the stack.
      
      Packet schedulers using skb_get_hash_perturb() benefit from this change.
      
      Fixes: b5677416 ("ipv6: Enable auto flow labels by default")
      Fixes: 42240901 ("ipv6: Implement different admin modes for automatic flow labels")
      Fixes: 67800f9b ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
      Fixes: cb1ce2ef ("ipv6: Implement automatic flow label generation on transmit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Berger <jonathann1@walla.com>
      Reported-by: NAmit Klein <aksecurity@gmail.com>
      Reported-by: NBenny Pinkas <benny@pinkas.net>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55667441
  7. 23 10月, 2019 3 次提交
    • P
      netfilter: nf_tables_offload: restore basechain deletion · 085461c8
      Pablo Neira Ayuso 提交于
      Unbind callbacks on chain deletion.
      
      Fixes: 8fc618c5 ("netfilter: nf_tables_offload: refactor the nft_flow_offload_chain function")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      085461c8
    • P
      netfilter: nf_flow_table: set timeout before insertion into hashes · daf61b02
      Pablo Neira Ayuso 提交于
      Other garbage collector might remove an entry not fully set up yet.
      
      [570953.958293] RIP: 0010:memcmp+0x9/0x50
      [...]
      [570953.958567]  flow_offload_hash_cmp+0x1e/0x30 [nf_flow_table]
      [570953.958585]  flow_offload_lookup+0x8c/0x110 [nf_flow_table]
      [570953.958606]  nf_flow_offload_ip_hook+0x135/0xb30 [nf_flow_table]
      [570953.958624]  nf_flow_offload_inet_hook+0x35/0x37 [nf_flow_table_inet]
      [570953.958646]  nf_hook_slow+0x3c/0xb0
      [570953.958664]  __netif_receive_skb_core+0x90f/0xb10
      [570953.958678]  ? ip_rcv_finish+0x82/0xa0
      [570953.958692]  __netif_receive_skb_one_core+0x3b/0x80
      [570953.958711]  __netif_receive_skb+0x18/0x60
      [570953.958727]  netif_receive_skb_internal+0x45/0xf0
      [570953.958741]  napi_gro_receive+0xcd/0xf0
      [570953.958764]  ixgbe_clean_rx_irq+0x432/0xe00 [ixgbe]
      [570953.958782]  ixgbe_poll+0x27b/0x700 [ixgbe]
      [570953.958796]  net_rx_action+0x284/0x3c0
      [570953.958817]  __do_softirq+0xcc/0x27c
      [570953.959464]  irq_exit+0xe8/0x100
      [570953.960097]  do_IRQ+0x59/0xe0
      [570953.960734]  common_interrupt+0xf/0xf
      
      Fixes: 43c8f131 ("netfilter: nf_flow_table: fix missing error check for rhashtable_insert_fast")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      daf61b02
    • D
      bpf: Fix use after free in bpf_get_prog_name · 3b4d9eb2
      Daniel Borkmann 提交于
      There is one more problematic case I noticed while recently fixing BPF kallsyms
      handling in cd7455f1 ("bpf: Fix use after free in subprog's jited symbol
      removal") and that is bpf_get_prog_name().
      
      If BTF has been attached to the prog, then we may be able to fetch the function
      signature type id in kallsyms through prog->aux->func_info[prog->aux->func_idx].type_id.
      However, while the BTF object itself is torn down via RCU callback, the prog's
      aux->func_info is immediately freed via kvfree(prog->aux->func_info) once the
      prog's refcount either hit zero or when subprograms were already exposed via
      kallsyms and we hit the error path added in 5482e9a9 ("bpf: Fix memleak in
      aux->func_info and aux->btf").
      
      This violates RCU as well since kallsyms could be walked in parallel where we
      could access aux->func_info. Hence, defer kvfree() to after RCU grace period.
      Looking at ba64e7d8 ("bpf: btf: support proper non-jit func info") there
      is no reason/dependency where we couldn't defer the kvfree(aux->func_info) into
      the RCU callback.
      
      Fixes: 5482e9a9 ("bpf: Fix memleak in aux->func_info and aux->btf")
      Fixes: ba64e7d8 ("bpf: btf: support proper non-jit func info")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/875f2906a7c1a0691f2d567b4d8e4ea2739b1e88.1571779205.git.daniel@iogearbox.net
      3b4d9eb2