1. 09 7月, 2019 1 次提交
  2. 06 7月, 2019 1 次提交
    • I
      ipv4: Fix NULL pointer dereference in ipv4_neigh_lookup() · 537de0c8
      Ido Schimmel 提交于
      Both ip_neigh_gw4() and ip_neigh_gw6() can return either a valid pointer
      or an error pointer, but the code currently checks that the pointer is
      not NULL.
      
      Fix this by checking that the pointer is not an error pointer, as this
      can result in a NULL pointer dereference [1]. Specifically, I believe
      that what happened is that ip_neigh_gw4() returned '-EINVAL'
      (0xffffffffffffffea) to which the offset of 'refcnt' (0x70) was added,
      which resulted in the address 0x000000000000005a.
      
      [1]
       BUG: KASAN: null-ptr-deref in refcount_inc_not_zero_checked+0x6e/0x180
       Read of size 4 at addr 000000000000005a by task swapper/2/0
      
       CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.2.0-rc6-custom-reg-179657-gaa32d89 #396
       Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
       Call Trace:
       <IRQ>
       dump_stack+0x73/0xbb
       __kasan_report+0x188/0x1ea
       kasan_report+0xe/0x20
       refcount_inc_not_zero_checked+0x6e/0x180
       ipv4_neigh_lookup+0x365/0x12c0
       __neigh_update+0x1467/0x22f0
       arp_process.constprop.6+0x82e/0x1f00
       __netif_receive_skb_one_core+0xee/0x170
       process_backlog+0xe3/0x640
       net_rx_action+0x755/0xd90
       __do_softirq+0x29b/0xae7
       irq_exit+0x177/0x1c0
       smp_apic_timer_interrupt+0x164/0x5e0
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      
      Fixes: 5c9f7c1d ("ipv4: Add helpers for neigh lookup for nexthop")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NShalom Toledo <shalomt@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      537de0c8
  3. 02 7月, 2019 1 次提交
  4. 29 6月, 2019 1 次提交
  5. 27 6月, 2019 2 次提交
    • S
      ipv4: reset rt_iif for recirculated mcast/bcast out pkts · 5b18f128
      Stephen Suryaputra 提交于
      Multicast or broadcast egress packets have rt_iif set to the oif. These
      packets might be recirculated back as input and lookup to the raw
      sockets may fail because they are bound to the incoming interface
      (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
      returns rt_iif instead of skb_iif. Hence, the lookup fails.
      
      v2: Make it non vrf specific (David Ahern). Reword the changelog to
          reflect it.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b18f128
    • E
      ipv4: fix suspicious RCU usage in fib_dump_info_fnhe() · 93ed54b1
      Eric Dumazet 提交于
      sysbot reported that we lack appropriate rcu_read_lock()
      protection in fib_dump_info_fnhe()
      
      net/ipv4/route.c:2875 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by syz-executor609/8966:
       #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199
      
      stack backtrace:
      CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250
       fib_dump_info_fnhe+0x9d9/0x1080 net/ipv4/route.c:2875
       fn_trie_dump_leaf net/ipv4/fib_trie.c:2141 [inline]
       fib_table_dump+0x64a/0xd00 net/ipv4/fib_trie.c:2175
       inet_dump_fib+0x83c/0xa90 net/ipv4/fib_frontend.c:1004
       rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445
       netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244
       __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352
       netlink_dump_start include/linux/netlink.h:226 [inline]
       rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:646 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:665
       sock_write_iter+0x27c/0x3e0 net/socket.c:994
       call_write_iter include/linux/fs.h:1872 [inline]
       new_sync_write+0x4d3/0x770 fs/read_write.c:483
       __vfs_write+0xe1/0x110 fs/read_write.c:496
       vfs_write+0x20c/0x580 fs/read_write.c:558
       ksys_write+0x14f/0x290 fs/read_write.c:611
       __do_sys_write fs/read_write.c:623 [inline]
       __se_sys_write fs/read_write.c:620 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:620
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401b9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9
      RDX: 000000000000001c RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000401a40
      R13: 0000000000401ad0 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ed54b1
  6. 25 6月, 2019 2 次提交
    • S
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio 提交于
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28906f
    • S
      ipv4/route: Allow NULL flowinfo in rt_fill_info() · d948974c
      Stefano Brivio 提交于
      In the next patch, we're going to use rt_fill_info() to dump exception
      routes upon RTM_GETROUTE with NLM_F_ROOT, meaning userspace is requesting
      a dump and not a specific route selection, which in turn implies the input
      interface is not relevant. Update rt_fill_info() to handle a NULL
      flowinfo.
      
      v7: If fl4 is NULL, explicitly set r->rtm_tos to 0: it's not initialised
          otherwise (spotted by David Ahern)
      
      v6: New patch
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d948974c
  7. 15 6月, 2019 1 次提交
  8. 06 6月, 2019 1 次提交
    • X
      ipv4: not do cache for local delivery if bc_forwarding is enabled · 0a90478b
      Xin Long 提交于
      With the topo:
      
          h1 ---| rp1            |
                |     route  rp3 |--- h3 (192.168.200.1)
          h2 ---| rp2            |
      
      If rp1 bc_forwarding is set while rp2 bc_forwarding is not, after
      doing "ping 192.168.200.255" on h1, then ping 192.168.200.255 on
      h2, and the packets can still be forwared.
      
      This issue was caused by the input route cache. It should only do
      the cache for either bc forwarding or local delivery. Otherwise,
      local delivery can use the route cache for bc forwarding of other
      interfaces.
      
      This patch is to fix it by not doing cache for local delivery if
      all.bc_forwarding is enabled.
      
      Note that we don't fix it by checking route cache local flag after
      rt_cache_valid() in "local_input:" and "ip_mkroute_input", as the
      common route code shouldn't be touched for bc_forwarding.
      
      Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a90478b
  9. 05 6月, 2019 2 次提交
    • D
      ipv4: Prepare for fib6_nh from a nexthop object · dcb1ecb5
      David Ahern 提交于
      Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
      to use a fib6_nh based nexthop. In the end, only code not using a
      nexthop object in a fib_info should directly access fib_nh in a fib_info
      without checking the famiy and going through fib_nh_common. Those
      functions will be marked when it is not directly evident.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb1ecb5
    • D
      ipv4: Use accessors for fib_info nexthop data · 5481d73f
      David Ahern 提交于
      Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
      fib_dev macro which is an alias for the first nexthop. Replacements:
      
        fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
        fi->fib_nh     --> fib_info_nh(fi, 0)
        fi->fib_nh[i]  --> fib_info_nh(fi, i)
        fi->fib_nhs    --> fib_info_num_path(fi)
      
      where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
      returns fi->fib_nhs.
      
      Move the existing fib_info_nhc to nexthop.h and define the new ones
      there. A later patch adds a check if a fib_info uses a nexthop object,
      and defining the helpers in nexthop.h avoid circular header
      dependencies.
      
      After this all remaining open coded references to fi->fib_nhs and
      fi->fib_nh are in:
      - fib_create_info and helpers used to lookup an existing fib_info
        entry, and
      - the netdev event functions fib_sync_down_dev and fib_sync_up.
      
      The latter two will not be reused for nexthops, and the fib_create_info
      will be updated to handle a nexthop in a fib_info.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5481d73f
  10. 31 5月, 2019 1 次提交
  11. 05 5月, 2019 3 次提交
  12. 28 4月, 2019 1 次提交
    • J
      netlink: make validation more configurable for future strictness · 8cb08174
      Johannes Berg 提交于
      We currently have two levels of strict validation:
      
       1) liberal (default)
           - undefined (type >= max) & NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
           - garbage at end of message accepted
       2) strict (opt-in)
           - NLA_UNSPEC attributes accepted
           - attribute length >= expected accepted
      
      Split out parsing strictness into four different options:
       * TRAILING     - check that there's no trailing data after parsing
                        attributes (in message or nested)
       * MAXTYPE      - reject attrs > max known type
       * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
       * STRICT_ATTRS - strictly validate attribute size
      
      The default for future things should be *everything*.
      The current *_strict() is a combination of TRAILING and MAXTYPE,
      and is renamed to _deprecated_strict().
      The current regular parsing has none of this, and is renamed to
      *_parse_deprecated().
      
      Additionally it allows us to selectively set one of the new flags
      even on old policies. Notably, the UNSPEC flag could be useful in
      this case, since it can be arranged (by filling in the policy) to
      not be an incompatible userspace ABI change, but would then going
      forward prevent forgetting attribute entries. Similar can apply
      to the POLICY flag.
      
      We end up with the following renames:
       * nla_parse           -> nla_parse_deprecated
       * nla_parse_strict    -> nla_parse_deprecated_strict
       * nlmsg_parse         -> nlmsg_parse_deprecated
       * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
       * nla_parse_nested    -> nla_parse_nested_deprecated
       * nla_validate_nested -> nla_validate_nested_deprecated
      
      Using spatch, of course:
          @@
          expression TB, MAX, HEAD, LEN, POL, EXT;
          @@
          -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
          +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, TB, MAX, POL, EXT;
          @@
          -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
          +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
      
          @@
          expression TB, MAX, NLA, POL, EXT;
          @@
          -nla_parse_nested(TB, MAX, NLA, POL, EXT)
          +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)
      
          @@
          expression START, MAX, POL, EXT;
          @@
          -nla_validate_nested(START, MAX, POL, EXT)
          +nla_validate_nested_deprecated(START, MAX, POL, EXT)
      
          @@
          expression NLH, HDRLEN, MAX, POL, EXT;
          @@
          -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
          +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)
      
      For this patch, don't actually add the strict, non-renamed versions
      yet so that it breaks compile if I get it wrong.
      
      Also, while at it, make nla_validate and nla_parse go down to a
      common __nla_validate_parse() function to avoid code duplication.
      
      Ultimately, this allows us to have very strict validation for every
      new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
      next patch, while existing things will continue to work as is.
      
      In effect then, this adds fully strict validation for any new command.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8cb08174
  13. 25 4月, 2019 1 次提交
    • E
      ipv4: add sanity checks in ipv4_link_failure() · 20ff83f1
      Eric Dumazet 提交于
      Before calling __ip_options_compile(), we need to ensure the network
      header is a an IPv4 one, and that it is already pulled in skb->head.
      
      RAW sockets going through a tunnel can end up calling ipv4_link_failure()
      with total garbage in the skb, or arbitrary lengthes.
      
      syzbot report :
      
      BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:355 [inline]
      BUG: KASAN: stack-out-of-bounds in __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
      Write of size 69 at addr ffff888096abf068 by task syz-executor.4/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.4 Not tainted 5.1.0-rc5+ #77
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
       kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x123/0x190 mm/kasan/generic.c:191
       memcpy+0x38/0x50 mm/kasan/common.c:133
       memcpy include/linux/string.h:355 [inline]
       __ip_options_echo+0x294/0x1120 net/ipv4/ip_options.c:123
       __icmp_send+0x725/0x1400 net/ipv4/icmp.c:695
       ipv4_link_failure+0x29f/0x550 net/ipv4/route.c:1204
       dst_link_failure include/net/dst.h:427 [inline]
       vti6_xmit net/ipv6/ip6_vti.c:514 [inline]
       vti6_tnl_xmit+0x10d4/0x1c0c net/ipv6/ip6_vti.c:553
       __netdev_start_xmit include/linux/netdevice.h:4414 [inline]
       netdev_start_xmit include/linux/netdevice.h:4423 [inline]
       xmit_one net/core/dev.c:3292 [inline]
       dev_hard_start_xmit+0x1b2/0x980 net/core/dev.c:3308
       __dev_queue_xmit+0x271d/0x3060 net/core/dev.c:3878
       dev_queue_xmit+0x18/0x20 net/core/dev.c:3911
       neigh_direct_output+0x16/0x20 net/core/neighbour.c:1527
       neigh_output include/net/neighbour.h:508 [inline]
       ip_finish_output2+0x949/0x1740 net/ipv4/ip_output.c:229
       ip_finish_output+0x73c/0xd50 net/ipv4/ip_output.c:317
       NF_HOOK_COND include/linux/netfilter.h:278 [inline]
       ip_output+0x21f/0x670 net/ipv4/ip_output.c:405
       dst_output include/net/dst.h:444 [inline]
       NF_HOOK include/linux/netfilter.h:289 [inline]
       raw_send_hdrinc net/ipv4/raw.c:432 [inline]
       raw_sendmsg+0x1d2b/0x2f20 net/ipv4/raw.c:663
       inet_sendmsg+0x147/0x5d0 net/ipv4/af_inet.c:798
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg+0xdd/0x130 net/socket.c:661
       sock_write_iter+0x27c/0x3e0 net/socket.c:988
       call_write_iter include/linux/fs.h:1866 [inline]
       new_sync_write+0x4c7/0x760 fs/read_write.c:474
       __vfs_write+0xe4/0x110 fs/read_write.c:487
       vfs_write+0x20c/0x580 fs/read_write.c:549
       ksys_write+0x14f/0x2d0 fs/read_write.c:599
       __do_sys_write fs/read_write.c:611 [inline]
       __se_sys_write fs/read_write.c:608 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:608
       do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x458c29
      Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f293b44bc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458c29
      RDX: 0000000000000014 RSI: 00000000200002c0 RDI: 0000000000000003
      RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f293b44c6d4
      R13: 00000000004c8623 R14: 00000000004ded68 R15: 00000000ffffffff
      
      The buggy address belongs to the page:
      page:ffffea00025aafc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0
      flags: 0x1fffc0000000000()
      raw: 01fffc0000000000 0000000000000000 ffffffff025a0101 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888096abef80: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2
       ffff888096abf000: f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 00
      >ffff888096abf080: 00 00 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
                               ^
       ffff888096abf100: 00 00 00 00 f1 f1 f1 f1 00 00 f3 f3 00 00 00 00
       ffff888096abf180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      
      Fixes: ed0de45a ("ipv4: recompile ip options in ipv4_link_failure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20ff83f1
  14. 15 4月, 2019 1 次提交
    • E
      ipv4: ensure rcu_read_lock() in ipv4_link_failure() · c543cb4a
      Eric Dumazet 提交于
      fib_compute_spec_dst() needs to be called under rcu protection.
      
      syzbot reported :
      
      WARNING: suspicious RCU usage
      5.1.0-rc4+ #165 Not tainted
      include/linux/inetdevice.h:220 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by swapper/0/0:
       #0: 0000000051b67925 ((&n->timer)){+.-.}, at: lockdep_copy_map include/linux/lockdep.h:170 [inline]
       #0: 0000000051b67925 ((&n->timer)){+.-.}, at: call_timer_fn+0xda/0x720 kernel/time/timer.c:1315
      
      stack backtrace:
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.1.0-rc4+ #165
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5162
       __in_dev_get_rcu include/linux/inetdevice.h:220 [inline]
       fib_compute_spec_dst+0xbbd/0x1030 net/ipv4/fib_frontend.c:294
       spec_dst_fill net/ipv4/ip_options.c:245 [inline]
       __ip_options_compile+0x15a7/0x1a10 net/ipv4/ip_options.c:343
       ipv4_link_failure+0x172/0x400 net/ipv4/route.c:1195
       dst_link_failure include/net/dst.h:427 [inline]
       arp_error_report+0xd1/0x1c0 net/ipv4/arp.c:297
       neigh_invalidate+0x24b/0x570 net/core/neighbour.c:995
       neigh_timer_handler+0xc35/0xf30 net/core/neighbour.c:1081
       call_timer_fn+0x190/0x720 kernel/time/timer.c:1325
       expire_timers kernel/time/timer.c:1362 [inline]
       __run_timers kernel/time/timer.c:1681 [inline]
       __run_timers kernel/time/timer.c:1649 [inline]
       run_timer_softirq+0x652/0x1700 kernel/time/timer.c:1694
       __do_softirq+0x266/0x95a kernel/softirq.c:293
       invoke_softirq kernel/softirq.c:374 [inline]
       irq_exit+0x180/0x1d0 kernel/softirq.c:414
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x14a/0x570 arch/x86/kernel/apic/apic.c:1062
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
      
      Fixes: ed0de45a ("ipv4: recompile ip options in ipv4_link_failure")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Stephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c543cb4a
  15. 13 4月, 2019 1 次提交
  16. 09 4月, 2019 5 次提交
    • D
      ipv4: Handle ipv6 gateway in ipv4_confirm_neigh · 6de9c055
      David Ahern 提交于
      Update ipv4_confirm_neigh to handle an ipv6 gateway.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6de9c055
    • D
      ipv4: Add helpers for neigh lookup for nexthop · 5c9f7c1d
      David Ahern 提交于
      A common theme in the output path is looking up a neigh entry for a
      nexthop, either the gateway in an rtable or a fallback to the daddr
      in the skb:
      
              nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      To allow the nexthop to be an IPv6 address we need to consider the
      family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
      on it.
      
      To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
      added in an earlier patch which handles:
      
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      And then add a second one, ip_neigh_for_gw, that calls either
      ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
      
      Update the output paths in the VRF driver and core v4 code to use
      ip_neigh_for_gw simplifying the family based lookup and making both
      ready for a v6 nexthop.
      
      ipv4_neigh_lookup has a different need - the potential to resolve a
      passed in address in addition to any gateway in the rtable or skb. Since
      this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
      difference between __neigh_create used by the helpers and neigh_create
      called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
      and bump the refcnt on the neigh entry.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c9f7c1d
    • D
      ipv4: Add support to rtable for ipv6 gateway · 0f5f7d7b
      David Ahern 提交于
      Add support for an IPv6 gateway to rtable. Since a gateway is either
      IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
      which address is in use.
      
      When dumping the route data, encode an ipv6 nexthop using RTA_VIA.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f5f7d7b
    • D
      ipv4: Prepare rtable for IPv6 gateway · 1550c171
      David Ahern 提交于
      To allow the gateway to be either an IPv4 or IPv6 address, remove
      rt_uses_gateway from rtable and replace with rt_gw_family. If
      rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
      to rt_gw4 to represent the IPv4 version.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1550c171
    • D
      net: Replace nhc_has_gw with nhc_gw_family · bdf00467
      David Ahern 提交于
      Allow the gateway in a fib_nh_common to be from a different address
      family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
      nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
      Now nhc_family is used to know if the nh_common is part of a fib_nh
      or fib6_nh (used for container_of to get to route family specific data),
      and nhc_gw_family represents the address family for the gateway.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf00467
  17. 04 4月, 2019 1 次提交
    • D
      ipv4: Add fib_nh_common to fib_result · eba618ab
      David Ahern 提交于
      Most of the ipv4 code only needs data from fib_nh_common. Add
      fib_nh_common selection to fib_result and update users to use it.
      
      Right now, fib_nh_common in fib_result will point to a fib_nh struct
      that is embedded within a fib_info:
      
              fib_info  --> fib_nh
                            fib_nh
                            ...
                            fib_nh
                              ^
          fib_result->nhc ----+
      
      Later, nhc can point to a fib_nh within a nexthop struct:
      
              fib_info --> nexthop --> fib_nh
                                         ^
          fib_result->nhc ---------------+
      
      or for a nexthop group:
      
              fib_info --> nexthop --> nexthop --> fib_nh
                                       nexthop --> fib_nh
                                       ...
                                       nexthop --> fib_nh
                                                     ^
          fib_result->nhc ---------------------------+
      
      In all cases nhsel within fib_result will point to which leg in the
      multipath route is used.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eba618ab
  18. 30 3月, 2019 1 次提交
    • D
      ipv4: Rename fib_nh entries · b75ed8b1
      David Ahern 提交于
      Rename fib_nh entries that will be moved to a fib_nh_common struct.
      Specifically, the device, oif, gateway, flags, scope, lwtstate,
      nh_weight and nh_upper_bound are common with all nexthop definitions.
      In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really
      long lines.
      
      Rename only; no functional change intended.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b75ed8b1
  19. 28 3月, 2019 1 次提交
  20. 22 3月, 2019 1 次提交
  21. 09 3月, 2019 1 次提交
    • X
      route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race · ee60ad21
      Xin Long 提交于
      The race occurs in __mkroute_output() when 2 threads lookup a dst:
      
        CPU A                 CPU B
        find_exception()
                              find_exception() [fnhe expires]
                              ip_del_fnhe() [fnhe is deleted]
        rt_bind_exception()
      
      In rt_bind_exception() it will bind a deleted fnhe with the new dst, and
      this dst will get no chance to be freed. It causes a dev defcnt leak and
      consecutive dmesg warnings:
      
        unregister_netdevice: waiting for ethX to become free. Usage count = 1
      
      Especially thanks Jon to identify the issue.
      
      This patch fixes it by setting fnhe_daddr to 0 in ip_del_fnhe() to stop
      binding the deleted fnhe with a new dst when checking fnhe's fnhe_daddr
      and daddr in rt_bind_exception().
      
      It works as both ip_del_fnhe() and rt_bind_exception() are protected by
      fnhe_lock and the fhne is freed by kfree_rcu().
      
      Fixes: deed49df ("route: check and remove route cache when we get route")
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee60ad21
  22. 07 3月, 2019 1 次提交
  23. 02 3月, 2019 2 次提交
  24. 28 2月, 2019 1 次提交
  25. 09 2月, 2019 1 次提交
    • L
      net: ipv4: use a dedicated counter for icmp_v4 redirect packets · c09551c6
      Lorenzo Bianconi 提交于
      According to the algorithm described in the comment block at the
      beginning of ip_rt_send_redirect, the host should try to send
      'ip_rt_redirect_number' ICMP redirect packets with an exponential
      backoff and then stop sending them at all assuming that the destination
      ignores redirects.
      If the device has previously sent some ICMP error packets that are
      rate-limited (e.g TTL expired) and continues to receive traffic,
      the redirect packets will never be transmitted. This happens since
      peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
      and so it will never be reset even if the redirect silence timeout
      (ip_rt_redirect_silence) has elapsed without receiving any packet
      requiring redirects.
      
      Fix it by using a dedicated counter for the number of ICMP redirect
      packets that has been sent by the host
      
      I have not been able to identify a given commit that introduced the
      issue since ip_rt_send_redirect implements the same rate-limiting
      algorithm from commit 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c09551c6
  26. 04 2月, 2019 1 次提交
  27. 20 1月, 2019 1 次提交
  28. 21 12月, 2018 1 次提交
    • I
      net: ipv4: Set skb->dev for output route resolution · 21f94775
      Ido Schimmel 提交于
      When user requests to resolve an output route, the kernel synthesizes
      an skb where the relevant parameters (e.g., source address) are set. The
      skb is then passed to ip_route_output_key_hash_rcu() which might call
      into the flow dissector in case a multipath route was hit and a nexthop
      needs to be selected based on the multipath hash.
      
      Since both 'skb->dev' and 'skb->sk' are not set, a warning is triggered
      in the flow dissector [1]. The warning is there to prevent codepaths
      from silently falling back to the standard flow dissector instead of the
      BPF one.
      
      Therefore, instead of removing the warning, set 'skb->dev' to the
      loopback device, as its not used for anything but resolving the correct
      namespace.
      
      [1]
      WARNING: CPU: 1 PID: 24819 at net/core/flow_dissector.c:764 __skb_flow_dissect+0x314/0x16b0
      ...
      RSP: 0018:ffffa0df41fdf650 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8bcded232000 RCX: 0000000000000000
      RDX: ffffa0df41fdf7e0 RSI: ffffffff98e415a0 RDI: ffff8bcded232000
      RBP: ffffa0df41fdf760 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffa0df41fdf7e8 R11: ffff8bcdf27a3000 R12: ffffffff98e415a0
      R13: ffffa0df41fdf7e0 R14: ffffffff98dd2980 R15: ffffa0df41fdf7e0
      FS:  00007f46f6897680(0000) GS:ffff8bcdf7a80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055933e95f9a0 CR3: 000000021e636000 CR4: 00000000001006e0
      Call Trace:
       fib_multipath_hash+0x28c/0x2d0
       ? fib_multipath_hash+0x28c/0x2d0
       fib_select_path+0x241/0x32f
       ? __fib_lookup+0x6a/0xb0
       ip_route_output_key_hash_rcu+0x650/0xa30
       ? __alloc_skb+0x9b/0x1d0
       inet_rtm_getroute+0x3f7/0xb80
       ? __alloc_pages_nodemask+0x11c/0x2c0
       rtnetlink_rcv_msg+0x1d9/0x2f0
       ? rtnl_calcit.isra.24+0x120/0x120
       netlink_rcv_skb+0x54/0x130
       rtnetlink_rcv+0x15/0x20
       netlink_unicast+0x20a/0x2c0
       netlink_sendmsg+0x2d1/0x3d0
       sock_sendmsg+0x39/0x50
       ___sys_sendmsg+0x2a0/0x2f0
       ? filemap_map_pages+0x16b/0x360
       ? __handle_mm_fault+0x108e/0x13d0
       __sys_sendmsg+0x63/0xa0
       ? __sys_sendmsg+0x63/0xa0
       __x64_sys_sendmsg+0x1f/0x30
       do_syscall_64+0x5a/0x120
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: d0e13a14 ("flow_dissector: lookup netns by skb->sk if skb->dev is NULL")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21f94775
  29. 21 11月, 2018 1 次提交
  30. 11 10月, 2018 1 次提交
    • S
      net: ipv4: don't let PMTU updates increase route MTU · 28d35bcd
      Sabrina Dubroca 提交于
      When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
      received, we must clamp its value. However, we can receive a PMTU
      exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
      increase in PMTU.
      
      To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.
      
      Before this patch, in case of an update, the exception's MTU would
      always change. Now, an exception can have only its lock flag updated,
      but not the MTU, so we need to add a check on locking to the following
      "is this exception getting updated, or close to expiring?" test.
      
      Fixes: d52e5a7e ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d35bcd