1. 13 2月, 2018 1 次提交
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
  2. 17 1月, 2018 1 次提交
    • A
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan 提交于
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      
      VFS stopped pinning module at this point.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96890d62
  3. 16 1月, 2018 1 次提交
  4. 30 11月, 2017 1 次提交
    • D
      xfrm: Move dst->path into struct xfrm_dst · 0f6c480f
      David Miller 提交于
      The first member of an IPSEC route bundle chain sets it's dst->path to
      the underlying ipv4/ipv6 route that carries the bundle.
      
      Stated another way, if one were to follow the xfrm_dst->child chain of
      the bundle, the final non-NULL pointer would be the path and point to
      either an ipv4 or an ipv6 route.
      
      This is largely used to make sure that PMTU events propagate down to
      the correct ipv4 or ipv6 route.
      
      When we don't have the top of an IPSEC bundle 'dst->path == dst'.
      
      Move it down into xfrm_dst and key off of dst->xfrm.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      0f6c480f
  5. 18 11月, 2017 2 次提交
  6. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  7. 21 10月, 2017 1 次提交
  8. 10 10月, 2017 1 次提交
  9. 06 10月, 2017 1 次提交
  10. 01 10月, 2017 1 次提交
    • P
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni 提交于
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      addresses.
      
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      
      This still gives a measurable, but limited performance
      regression:
      
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc044e8d
  11. 19 8月, 2017 2 次提交
    • R
      net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set · bc3aae2b
      Roopa Prabhu 提交于
      Syzkaller hit 'general protection fault in fib_dump_info' bug on
      commit 4.13-rc5..
      
      Guilty file: net/ipv4/fib_semantics.c
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff880078562700 task.stack: ffff880078110000
      RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
      RSP: 0018:ffff880078117010 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: 00000000000000fe RCX: 0000000000000002
      RDX: 0000000000000006 RSI: ffff880078117084 RDI: 0000000000000030
      RBP: ffff880078117268 R08: 000000000000000c R09: ffff8800780d80c8
      R10: 0000000058d629b4 R11: 0000000067fce681 R12: 0000000000000000
      R13: ffff8800784bd540 R14: ffff8800780d80b5 R15: ffff8800780d80a4
      FS:  00000000022fa940(0000) GS:ffff88007fc00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004387d0 CR3: 0000000079135000 CR4: 00000000000006f0
      Call Trace:
        inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
        rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
        netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
        rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
        netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
        netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
        netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg+0xca/0x110 net/socket.c:643
        ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
        __sys_sendmsg+0xd1/0x170 net/socket.c:2069
        SYSC_sendmsg net/socket.c:2080 [inline]
        SyS_sendmsg+0x2d/0x50 net/socket.c:2076
        entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x4512e9
        RSP: 002b:00007ffc75584cc8 EFLAGS: 00000216 ORIG_RAX:
        000000000000002e
        RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004512e9
        RDX: 0000000000000000 RSI: 0000000020f2cfc8 RDI: 0000000000000003
        RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000216 R12: fffffffffffffffe
        R13: 0000000000718000 R14: 0000000020c44ff0 R15: 0000000000000000
        Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
        28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03
        <0f>
        b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
        RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
        ffff880078117010
      ---[ end trace 254a7af28348f88b ]---
      
      This patch adds a res->fi NULL check.
      
      example run:
      $ip route get 0.0.0.0 iif virt1-0
      broadcast 0.0.0.0 dev lo
          cache <local,brd> iif virt1-0
      
      $ip route get 0.0.0.0 iif virt1-0 fibmatch
      RTNETLINK answers: No route to host
      Reported-by: Nidaifish <idaifish@gmail.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b6179813 ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc3aae2b
    • E
      ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t · 9620fef2
      Eric Dumazet 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9620fef2
  12. 17 8月, 2017 1 次提交
    • E
      ipv4: better IP_MAX_MTU enforcement · c780a049
      Eric Dumazet 提交于
      While working on yet another syzkaller report, I found
      that our IP_MAX_MTU enforcements were not properly done.
      
      gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
      final result can be bigger than IP_MAX_MTU :/
      
      This is a problem because device mtu can be changed on other cpus or
      threads.
      
      While this patch does not fix the issue I am working on, it is
      probably worth addressing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c780a049
  13. 16 8月, 2017 1 次提交
  14. 15 8月, 2017 1 次提交
    • F
      ipv4: route: fix inet_rtm_getroute induced crash · 2c87d63a
      Florian Westphal 提交于
      "ip route get $daddr iif eth0 from $saddr" causes:
       BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50
       Call Trace:
        ip_route_input_rcu+0x1535/0x1b50
        ip_route_input_noref+0xf9/0x190
        tcp_v4_early_demux+0x1a4/0x2b0
        ip_rcv+0xbcb/0xc05
        __netif_receive_skb+0x9c/0xd0
        netif_receive_skb_internal+0x5a8/0x890
      
      Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an
      iif was provided) or ip_route_output_key_hash_rcu.
      
      But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already
      associates the dst_entry with the skb.  This clears the SKB_DST_NOREF
      bit (i.e. skb_dst_drop will release/free the entry while it should not).
      
      Thus only set the dst if we called ip_route_output_key_hash_rcu().
      
      I tested this patch by running:
       while true;do ip r get 10.0.1.2;done > /dev/null &
       while true;do ip r get 10.0.1.2 iif eth0  from 10.0.1.1;done > /dev/null &
      ... and saw no crash or memory leak.
      
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: ba52d61e ("ipv4: route: restore skb_dst_set in inet_rtm_getroute")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c87d63a
  15. 14 8月, 2017 1 次提交
  16. 12 8月, 2017 1 次提交
    • D
      net: ipv4: set orig_oif based on fib result for local traffic · 839da4d9
      David Ahern 提交于
      Attempts to connect to a local address with a socket bound
      to a device with the local address hangs if there is no listener:
      
        $ ip addr sh dev eth1
        3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
          inet 10.100.1.4/24 scope global eth1
             valid_lft forever preferred_lft forever
          inet6 2001:db8:1::4/120 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::e0:f9ff:fe1c:37/64 scope link
             valid_lft forever preferred_lft forever
      
        $ vrf-test -I eth1 -r 10.100.1.4
        <hangs when there is no server>
      
      (don't let the command name fool you; vrf-test works without vrfs.)
      
      The problem is that the original intended device, eth1 in this case, is
      lost when the tcp reset is sent, so the socket lookup does not find a
      match for the reset and the connect attempt hangs. Fix by adjusting
      orig_oif for local traffic to the device from the fib lookup result.
      
      With this patch you get the more user friendly:
        $ vrf-test -I eth1 -r 10.100.1.4
        connect failed: 111: Connection refused
      
      orig_oif is saved to the newly created rtable as rt_iif and when set
      it is used as the dif for socket lookups. It is set based on flowi4_oif
      passed in to ip_route_output_key_hash_rcu and will be set to either
      the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which
      is the case in the example above) or a netdev index depending on the
      lookup path. In each case, resetting orig_oif to the device in the fib
      result for the RTN_LOCAL case allows the actual device to be preserved
      as the skb tx and rx is done over the loopback or VRF device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      839da4d9
  17. 10 8月, 2017 1 次提交
  18. 20 6月, 2017 1 次提交
  19. 18 6月, 2017 7 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      ipv4: mark DST_NOGC and remove the operation of dst_free() · b838d5e1
      Wei Wang 提交于
      With the previous preparation patches, we are ready to get rid of the
      dst gc operation in ipv4 code and release dst based on refcnt only.
      So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
      to dst_free().
      At this point, all dst created in ipv4 code do not use the dst gc
      anymore and will be destroyed at the point when refcnt drops to 0.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b838d5e1
    • W
      ipv4: call dst_hold_safe() properly · 9df16efa
      Wei Wang 提交于
      This patch checks all the calls to
      dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
      dst_hold_safe() is needed to avoid double free issue if dst
      gc is removed and dst_release() directly destroys dst when
      dst->__refcnt drops to 0.
      
      In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
      UDP and other similar protocols always hold refcount for
      skb->_skb_refdst. So both paths seem to be safe.
      
      In rx path, as it is lockless and skb_dst_set_noref() is likely to be
      used, dst_hold_safe() should always be used when trying to hold dst.
      
      In the routing code, if dst is held during an rcu protected session, it
      is necessary to call dst_hold_safe() as the current dst might be in its
      rcu grace period.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9df16efa
    • W
      ipv4: call dst_dev_put() properly · 95c47f9c
      Wei Wang 提交于
      As the intend of this patch series is to completely remove dst gc,
      we need to call dst_dev_put() to release the reference to dst->dev
      when removing routes from fib because we won't keep the gc list anymore
      and will lose the dst pointer right after removing the routes.
      Without the gc list, there is no way to find all the dst's that have
      dst->dev pointing to the going-down dev.
      Hence, we are doing dst_dev_put() immediately before we lose the last
      reference of the dst from the routing code. The next dst_check() will
      trigger a route re-lookup to find another route (if there is any).
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95c47f9c
    • W
      ipv4: take dst->__refcnt when caching dst in fib · 0830106c
      Wei Wang 提交于
      In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
      to struct rtable but they never increment dst->__refcnt.
      This leads to the need of the dst garbage collector because when user
      is done with this dst and calls dst_release(), it can only decrement
      dst->__refcnt and can not free the dst even it sees dst->__refcnt
      drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
      code might still hold reference to it.
      And when the routing code tries to delete a route, it has to put the
      dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
      running periodically to check on dst->__refcnt and finally to free dst
      when refcnt becomes 0.
      
      This patch increments dst->__refcnt when
      fib_nh/fib_nh_exception holds reference to this dst and properly release
      the dst when fib_nh/fib_nh_exception has been updated with a new dst.
      
      This patch is a preparation in order to fully get rid of dst gc later.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0830106c
    • W
      net: use loopback dev when generating blackhole route · 1dbe3252
      Wei Wang 提交于
      Existing ipv4/6_blackhole_route() code generates a blackhole route
      with dst->dev pointing to the passed in dst->dev.
      It is not necessary to hold reference to the passed in dst->dev
      because the packets going through this route are dropped anyway.
      A loopback interface is good enough so that we don't need to worry about
      releasing this dst->dev when this dev is going down.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dbe3252
  20. 01 6月, 2017 1 次提交
  21. 27 5月, 2017 6 次提交
  22. 09 5月, 2017 2 次提交
  23. 25 4月, 2017 1 次提交
    • R
      ipv4: Avoid caching l3mdev dst on mismatched local route · b7c8487c
      Robert Shearman 提交于
      David reported that doing the following:
      
          ip li add red type vrf table 10
          ip link set dev eth1 vrf red
          ip addr add 127.0.0.1/8 dev red
          ip link set dev eth1 up
          ip li set red up
          ping -c1 -w1 -I red 127.0.0.1
          ip li del red
      
      when either policy routing IP rules are present or the local table
      lookup ip rule is before the l3mdev lookup results in a hang with
      these messages:
      
          unregister_netdevice: waiting for red to become free. Usage count = 1
      
      The problem is caused by caching the dst used for sending the packet
      out of the specified interface on a local route with a different
      nexthop interface. Thus the dst could stay around until the route in
      the table the lookup was done is deleted which may be never.
      
      Address the problem by not forcing output device to be the l3mdev in
      the flow's output interface if the lookup didn't use the l3mdev. This
      then results in the dst using the right device according to the route.
      
      Changes in v2:
       - make the dev_out passed in by __ip_route_output_key_hash correct
         instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
         suggested by David.
      
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Suggested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7c8487c
  24. 18 4月, 2017 1 次提交
  25. 14 4月, 2017 2 次提交