1. 05 5月, 2022 1 次提交
    • V
      memcg: accounting for objects allocated for new netdevice · 425b9c7f
      Vasily Averin 提交于
      Creating a new netdevice allocates at least ~50Kb of memory for various
      kernel objects, but only ~5Kb of them are accounted to memcg. As a result,
      creating an unlimited number of netdevice inside a memcg-limited container
      does not fall within memcg restrictions, consumes a significant part
      of the host's memory, can cause global OOM and lead to random kills of
      host processes.
      
      The main consumers of non-accounted memory are:
       ~10Kb   80+ kernfs nodes
       ~6Kb    ipv6_add_dev() allocations
        6Kb    __register_sysctl_table() allocations
        4Kb    neigh_sysctl_register() allocations
        4Kb    __devinet_sysctl_register() allocations
        4Kb    __addrconf_sysctl_register() allocations
      
      Accounting of these objects allows to increase the share of memcg-related
      memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice
      on typical VM with default Fedora 35 kernel) and this should be enough
      to somehow protect the host from misuse inside container.
      
      Other related objects are quite small and may not be taken into account
      to minimize the expected performance degradation.
      
      It should be separately mentonied ~300 bytes of percpu allocation
      of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes
      it can become the main consumer of memory.
      
      This patch does not enables kernfs accounting as it affects
      other parts of the kernel and should be discussed separately.
      However, even without kernfs, this patch significantly improves the
      current situation and allows to take into account more than half
      of all netdevice allocations.
      Signed-off-by: NVasily Averin <vvs@openvz.org>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      425b9c7f
  2. 26 2月, 2022 1 次提交
  3. 03 2月, 2022 1 次提交
    • D
      net, neigh: Do not trigger immediate probes on NUD_FAILED from neigh_managed_work · 4a81f6da
      Daniel Borkmann 提交于
      syzkaller was able to trigger a deadlock for NTF_MANAGED entries [0]:
      
        kworker/0:16/14617 is trying to acquire lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        [...]
        but task is already holding lock:
        ffffffff8d4dd370 (&tbl->lock){++-.}-{2:2}, at: neigh_managed_work+0x35/0x250 net/core/neighbour.c:1572
      
      The neighbor entry turned to NUD_FAILED state, where __neigh_event_send()
      triggered an immediate probe as per commit cd28ca0a ("neigh: reduce
      arp latency") via neigh_probe() given table lock was held.
      
      One option to fix this situation is to defer the neigh_probe() back to
      the neigh_timer_handler() similarly as pre cd28ca0a. For the case
      of NTF_MANAGED, this deferral is acceptable given this only happens on
      actual failure state and regular / expected state is NUD_VALID with the
      entry already present.
      
      The fix adds a parameter to __neigh_event_send() in order to communicate
      whether immediate probe is allowed or disallowed. Existing call-sites
      of neigh_event_send() default as-is to immediate probe. However, the
      neigh_managed_work() disables it via use of neigh_event_send_probe().
      
      [0] <TASK>
        __dump_stack lib/dump_stack.c:88 [inline]
        dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
        print_deadlock_bug kernel/locking/lockdep.c:2956 [inline]
        check_deadlock kernel/locking/lockdep.c:2999 [inline]
        validate_chain kernel/locking/lockdep.c:3788 [inline]
        __lock_acquire.cold+0x149/0x3ab kernel/locking/lockdep.c:5027
        lock_acquire kernel/locking/lockdep.c:5639 [inline]
        lock_acquire+0x1ab/0x510 kernel/locking/lockdep.c:5604
        __raw_write_lock_bh include/linux/rwlock_api_smp.h:202 [inline]
        _raw_write_lock_bh+0x2f/0x40 kernel/locking/spinlock.c:334
        ___neigh_create+0x9e1/0x2990 net/core/neighbour.c:652
        ip6_finish_output2+0x1070/0x14f0 net/ipv6/ip6_output.c:123
        __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
        __ip6_finish_output+0x61e/0xe90 net/ipv6/ip6_output.c:170
        ip6_finish_output+0x32/0x200 net/ipv6/ip6_output.c:201
        NF_HOOK_COND include/linux/netfilter.h:296 [inline]
        ip6_output+0x1e4/0x530 net/ipv6/ip6_output.c:224
        dst_output include/net/dst.h:451 [inline]
        NF_HOOK include/linux/netfilter.h:307 [inline]
        ndisc_send_skb+0xa99/0x17f0 net/ipv6/ndisc.c:508
        ndisc_send_ns+0x3a9/0x840 net/ipv6/ndisc.c:650
        ndisc_solicit+0x2cd/0x4f0 net/ipv6/ndisc.c:742
        neigh_probe+0xc2/0x110 net/core/neighbour.c:1040
        __neigh_event_send+0x37d/0x1570 net/core/neighbour.c:1201
        neigh_event_send include/net/neighbour.h:470 [inline]
        neigh_managed_work+0x162/0x250 net/core/neighbour.c:1574
        process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
        worker_thread+0x657/0x1110 kernel/workqueue.c:2454
        kthread+0x2e9/0x3a0 kernel/kthread.c:377
        ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        </TASK>
      
      Fixes: 7482e384 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Tested-by: syzbot+5239d0e1778a500d477a@syzkaller.appspotmail.com
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220201193942.5055-1-daniel@iogearbox.netSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      4a81f6da
  4. 22 1月, 2022 1 次提交
  5. 12 12月, 2021 1 次提交
    • X
      net: Enable neighbor sysctls that is save for userns root · 8c8b7aa7
      xu xin 提交于
      Inside netns owned by non-init userns, sysctls about ARP/neighbor is
      currently not visible and configurable.
      
      For the attributes these sysctls correspond to, any modifications make
      effects on the performance of networking(ARP, especilly) only in the
      scope of netns, which does not affect other netns.
      
      Actually, some tools via netlink can modify these attribute. iproute2 is
      an example. see as follows:
      
      $ unshare -ur -n
      $ cat /proc/sys/net/ipv4/neigh/lo/retrans_time
      cat: can't open '/proc/sys/net/ipv4/neigh/lo/retrans_time': No such file
      or directory
      $ ip ntable show dev lo
      inet arp_cache
          dev lo
          refcnt 1 reachable 19494 base_reachable 30000 retrans 1000
          gc_stale 60000 delay_probe 5000 queue 101
          app_probes 0 ucast_probes 3 mcast_probes 3
          anycast_delay 1000 proxy_delay 800 proxy_queue 64 locktime 1000
      
      inet6 ndisc_cache
          dev lo
          refcnt 1 reachable 42394 base_reachable 30000 retrans 1000
          gc_stale 60000 delay_probe 5000 queue 101
          app_probes 0 ucast_probes 3 mcast_probes 3
          anycast_delay 1000 proxy_delay 800 proxy_queue 64 locktime 0
      $ ip ntable change name arp_cache dev <if> retrans 2000
      inet arp_cache
          dev lo
          refcnt 1 reachable 22917 base_reachable 30000 retrans 2000
          gc_stale 60000 delay_probe 5000 queue 101
          app_probes 0 ucast_probes 3 mcast_probes 3
          anycast_delay 1000 proxy_delay 800 proxy_queue 64 locktime 1000
      
      inet6 ndisc_cache
          dev lo
          refcnt 1 reachable 35524 base_reachable 30000 retrans 1000
          gc_stale 60000 delay_probe 5000 queue 101
          app_probes 0 ucast_probes 3 mcast_probes 3
          anycast_delay 1000 proxy_delay 800 proxy_queue 64 locktime 0
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
      Acked-by: NJoanne Koong <joannekoong@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c8b7aa7
  6. 09 12月, 2021 1 次提交
    • E
      net, neigh: clear whole pneigh_entry at alloc time · e195e9b5
      Eric Dumazet 提交于
      Commit 2c611ad9 ("net, neigh: Extend neigh->flags to 32 bit
      to allow for extensions") enables a new KMSAM warning [1]
      
      I think the bug is actually older, because the following intruction
      only occurred if ndm->ndm_flags had NTF_PROXY set.
      
      	pn->flags = ndm->ndm_flags;
      
      Let's clear all pneigh_entry fields at alloc time.
      
      [1]
      BUG: KMSAN: uninit-value in pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593
       pneigh_fill_info+0x986/0xb30 net/core/neighbour.c:2593
       pneigh_dump_table net/core/neighbour.c:2715 [inline]
       neigh_dump_info+0x1e3f/0x2c60 net/core/neighbour.c:2832
       netlink_dump+0xaca/0x16a0 net/netlink/af_netlink.c:2265
       __netlink_dump_start+0xd1c/0xee0 net/netlink/af_netlink.c:2370
       netlink_dump_start include/linux/netlink.h:254 [inline]
       rtnetlink_rcv_msg+0x181b/0x18c0 net/core/rtnetlink.c:5534
       netlink_rcv_skb+0x447/0x800 net/netlink/af_netlink.c:2491
       rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5589
       netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
       netlink_unicast+0x1095/0x1360 net/netlink/af_netlink.c:1345
       netlink_sendmsg+0x16f3/0x1870 net/netlink/af_netlink.c:1916
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg net/socket.c:724 [inline]
       sock_write_iter+0x594/0x690 net/socket.c:1057
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write fs/read_write.c:503 [inline]
       vfs_write+0x1318/0x2030 fs/read_write.c:590
       ksys_write+0x28c/0x520 fs/read_write.c:643
       __do_sys_write fs/read_write.c:655 [inline]
       __se_sys_write fs/read_write.c:652 [inline]
       __x64_sys_write+0xdb/0x120 fs/read_write.c:652
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Uninit was created at:
       slab_post_alloc_hook mm/slab.h:524 [inline]
       slab_alloc_node mm/slub.c:3251 [inline]
       slab_alloc mm/slub.c:3259 [inline]
       __kmalloc+0xc3c/0x12d0 mm/slub.c:4437
       kmalloc include/linux/slab.h:595 [inline]
       pneigh_lookup+0x60f/0xd70 net/core/neighbour.c:766
       arp_req_set_public net/ipv4/arp.c:1016 [inline]
       arp_req_set+0x430/0x10a0 net/ipv4/arp.c:1032
       arp_ioctl+0x8d4/0xb60 net/ipv4/arp.c:1232
       inet_ioctl+0x4ef/0x820 net/ipv4/af_inet.c:947
       sock_do_ioctl net/socket.c:1118 [inline]
       sock_ioctl+0xa3f/0x13e0 net/socket.c:1235
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:874 [inline]
       __se_sys_ioctl+0x2df/0x4a0 fs/ioctl.c:860
       __x64_sys_ioctl+0xd8/0x110 fs/ioctl.c:860
       do_syscall_x64 arch/x86/entry/common.c:51 [inline]
       do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      CPU: 1 PID: 20001 Comm: syz-executor.0 Not tainted 5.16.0-rc3-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 62dd9318 ("[IPV6] NDISC: Set per-entry is_router flag in Proxy NA.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20211206165329.1049835-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e195e9b5
  7. 07 12月, 2021 3 次提交
  8. 22 11月, 2021 1 次提交
    • D
      net, neigh: Fix crash in v6 module initialization error path · 4177d5b0
      Daniel Borkmann 提交于
      When IPv6 module gets initialized, but it's hitting an error in inet6_init()
      where it then needs to undo all the prior initialization work, it also might
      do a call to ndisc_cleanup() which then calls neigh_table_clear(). In there
      is a missing timer cancellation of the table's managed_work item.
      
      The kernel test robot explicitly triggered this error path and caused a UAF
      crash similar to the below:
      
        [...]
        [   28.833183][    C0] BUG: unable to handle page fault for address: f7a43288
        [   28.833973][    C0] #PF: supervisor write access in kernel mode
        [   28.834660][    C0] #PF: error_code(0x0002) - not-present page
        [   28.835319][    C0] *pde = 06b2c067 *pte = 00000000
        [   28.835853][    C0] Oops: 0002 [#1] PREEMPT
        [   28.836367][    C0] CPU: 0 PID: 303 Comm: sed Not tainted 5.16.0-rc1-00233-g83ff5faa0d3b #7
        [   28.837293][    C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
        [   28.838338][    C0] EIP: __run_timers.constprop.0+0x82/0x440
        [...]
        [   28.845607][    C0] Call Trace:
        [   28.845942][    C0]  <SOFTIRQ>
        [   28.846333][    C0]  ? check_preemption_disabled.isra.0+0x2a/0x80
        [   28.846975][    C0]  ? __this_cpu_preempt_check+0x8/0xa
        [   28.847570][    C0]  run_timer_softirq+0xd/0x40
        [   28.848050][    C0]  __do_softirq+0xf5/0x576
        [   28.848547][    C0]  ? __softirqentry_text_start+0x10/0x10
        [   28.849127][    C0]  do_softirq_own_stack+0x2b/0x40
        [   28.849749][    C0]  </SOFTIRQ>
        [   28.850087][    C0]  irq_exit_rcu+0x7d/0xc0
        [   28.850587][    C0]  common_interrupt+0x2a/0x40
        [   28.851068][    C0]  asm_common_interrupt+0x119/0x120
        [...]
      
      Note that IPv6 module cannot be unloaded as per 8ce44061 ("ipv6: do not
      allow ipv6 module to be removed") hence this can only be seen during module
      initialization error. Tested with kernel test robot's reproducer.
      
      Fixes: 7482e384 ("net, neigh: Add NTF_MANAGED flag for managed neighbor entries")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Li Zhijian <zhijianx.li@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4177d5b0
  9. 15 10月, 2021 3 次提交
  10. 12 10月, 2021 4 次提交
    • D
      net, neigh: Add NTF_MANAGED flag for managed neighbor entries · 7482e384
      Daniel Borkmann 提交于
      Allow a user space control plane to insert entries with a new NTF_EXT_MANAGED
      flag. The flag then indicates to the kernel that the neighbor entry should be
      periodically probed for keeping the entry in NUD_REACHABLE state iff possible.
      
      The use case for this is targeting XDP or tc BPF load-balancers which use
      the bpf_fib_lookup() BPF helper in order to piggyback on neighbor resolution
      for their backends. Given they cannot be resolved in fast-path, a control
      plane inserts the L3 (without L2) entries manually into the neighbor table
      and lets the kernel do the neighbor resolution either on the gateway or on
      the backend directly in case the latter resides in the same L2. This avoids
      to deal with L2 in the control plane and to rebuild what the kernel already
      does best anyway.
      
      NTF_EXT_MANAGED can be combined with NTF_EXT_LEARNED in order to avoid GC
      eviction. The kernel then adds NTF_MANAGED flagged entries to a per-neighbor
      table which gets triggered by the system work queue to periodically call
      neigh_event_send() for performing the resolution. The implementation allows
      migration from/to NTF_MANAGED neighbor entries, so that already existing
      entries can be converted by the control plane if needed. Potentially, we could
      make the interval for periodically calling neigh_event_send() configurable;
      right now it's set to DELAY_PROBE_TIME which is also in line with mlxsw which
      has similar driver-internal infrastructure c723c735 ("mlxsw: spectrum_router:
      Periodically update the kernel's neigh table"). In future, the latter could
      possibly reuse the NTF_MANAGED neighbors as well.
      
      Example:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 managed extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a managed extern_learn REACHABLE
        [...]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Link: https://linuxplumbersconf.org/event/11/contributions/953/Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7482e384
    • R
      net, neigh: Extend neigh->flags to 32 bit to allow for extensions · 2c611ad9
      Roopa Prabhu 提交于
      Currently, all bits in struct ndmsg's ndm_flags are used up with the most
      recent addition of 435f2e7c ("net: bridge: add support for sticky fdb
      entries"). This makes it impossible to extend the neighboring subsystem
      with new NTF_* flags:
      
        struct ndmsg {
          __u8   ndm_family;
          __u8   ndm_pad1;
          __u16  ndm_pad2;
          __s32  ndm_ifindex;
          __u16  ndm_state;
          __u8   ndm_flags;
          __u8   ndm_type;
        };
      
      There are ndm_pad{1,2} attributes which are not used. However, due to
      uncareful design, the kernel does not enforce them to be zero upon new
      neighbor entry addition, and given they've been around forever, it is
      not possible to reuse them today due to risk of breakage. One option to
      overcome this limitation is to add a new NDA_FLAGS_EXT attribute for
      extended flags.
      
      In struct neighbour, there is a 3 byte hole between protocol and ha_lock,
      which allows neigh->flags to be extended from 8 to 32 bits while still
      being on the same cacheline as before. This also allows for all future
      NTF_* flags being in neigh->flags rather than yet another flags field.
      Unknown flags in NDA_FLAGS_EXT will be rejected by the kernel.
      Co-developed-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c611ad9
    • D
      net, neigh: Enable state migration between NUD_PERMANENT and NTF_USE · 3dc20f47
      Daniel Borkmann 提交于
      Currently, it is not possible to migrate a neighbor entry between NUD_PERMANENT
      state and NTF_USE flag with a dynamic NUD state from a user space control plane.
      Similarly, it is not possible to add/remove NTF_EXT_LEARNED flag from an existing
      neighbor entry in combination with NTF_USE flag.
      
      This is due to the latter directly calling into neigh_event_send() without any
      meta data updates as happening in __neigh_update(). Thus, to enable this use
      case, extend the latter with a NEIGH_UPDATE_F_USE flag where we break the
      NUD_PERMANENT state in particular so that a latter neigh_event_send() is able
      to re-resolve a neighbor entry.
      
      Before fix, NUD_PERMANENT -> NUD_* & NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
      
      As can be seen, despite the admin-triggered replace, the entry remains in the
      NUD_PERMANENT state.
      
      After fix, NUD_PERMANENT -> NUD_* & NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn STALE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a PERMANENT
        [...]
      
      After the fix, the admin-triggered replace switches to a dynamic state from
      the NTF_USE flag which triggered a new neighbor resolution. Likewise, we can
      transition back from there, if needed, into NUD_PERMANENT.
      
      Similar before/after behavior can be observed for below transitions:
      
      Before fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
      
      After fix, NTF_USE -> NTF_USE | NTF_EXT_LEARNED -> NTF_USE:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [..]
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dc20f47
    • D
      net, neigh: Fix NTF_EXT_LEARNED in combination with NTF_USE · e4400bbf
      Daniel Borkmann 提交于
      The NTF_EXT_LEARNED neigh flag is usually propagated back to user space
      upon dump of the neighbor table. However, when used in combination with
      NTF_USE flag this is not the case despite exempting the entry from the
      garbage collector. This results in inconsistent state since entries are
      typically marked in neigh->flags with NTF_EXT_LEARNED, but here they are
      not. Fix it by propagating the creation flag to ___neigh_create().
      
      Before fix:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a REACHABLE
        [...]
      
      After fix:
      
        # ./ip/ip n replace 192.168.178.30 dev enp5s0 use extern_learn
        # ./ip/ip n
        192.168.178.30 dev enp5s0 lladdr f4:8c:50:5e:71:9a extern_learn REACHABLE
        [...]
      
      Fixes: 9ce33e46 ("neighbour: support for NTF_EXT_LEARNED flag")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4400bbf
  11. 11 8月, 2021 1 次提交
  12. 05 8月, 2021 1 次提交
  13. 03 8月, 2021 1 次提交
  14. 08 6月, 2021 1 次提交
  15. 11 5月, 2021 1 次提交
  16. 22 4月, 2021 1 次提交
    • C
      neighbour: Prevent Race condition in neighbour subsytem · eefb45ee
      Chinmay Agarwal 提交于
      Following Race Condition was detected:
      
      <CPU A, t0>: Executing: __netif_receive_skb() ->__netif_receive_skb_core()
      -> arp_rcv() -> arp_process().arp_process() calls __neigh_lookup() which
      takes a reference on neighbour entry 'n'.
      Moves further along, arp_process() and calls neigh_update()->
      __neigh_update(). Neighbour entry is unlocked just before a call to
      neigh_update_gc_list.
      
      This unlocking paves way for another thread that may take a reference on
      the same and mark it dead and remove it from gc_list.
      
      <CPU B, t1> - neigh_flush_dev() is under execution and calls
      neigh_mark_dead(n) marking the neighbour entry 'n' as dead. Also n will be
      removed from gc_list.
      Moves further along neigh_flush_dev() and calls
      neigh_cleanup_and_release(n), but since reference count increased in t1,
      'n' couldn't be destroyed.
      
      <CPU A, t3>- Code hits neigh_update_gc_list, with neighbour entry
      set as dead.
      
      <CPU A, t4> - arp_process() finally calls neigh_release(n), destroying
      the neighbour entry and we have a destroyed ntry still part of gc_list.
      
      Fixes: eb4e8fac("neighbour: Prevent a dead entry from updating gc_list")
      Signed-off-by: NChinmay Agarwal <chinagar@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eefb45ee
  17. 01 4月, 2021 1 次提交
    • T
      neighbour: Disregard DEAD dst in neigh_update · d47ec7a0
      Tong Zhu 提交于
      After a short network outage, the dst_entry is timed out and put
      in DST_OBSOLETE_DEAD. We are in this code because arp reply comes
      from this neighbour after network recovers. There is a potential
      race condition that dst_entry is still in DST_OBSOLETE_DEAD.
      With that, another neighbour lookup causes more harm than good.
      
      In best case all packets in arp_queue are lost. This is
      counterproductive to the original goal of finding a better path
      for those packets.
      
      I observed a worst case with 4.x kernel where a dst_entry in
      DST_OBSOLETE_DEAD state is associated with loopback net_device.
      It leads to an ethernet header with all zero addresses.
      A packet with all zero source MAC address is quite deadly with
      mac80211, ath9k and 802.11 block ack.  It fails
      ieee80211_find_sta_by_ifaddr in ath9k (xmit.c). Ath9k flushes tx
      queue (ath_tx_complete_aggr). BAW (block ack window) is not
      updated. BAW logic is damaged and ath9k transmission is disabled.
      Signed-off-by: NTong Zhu <zhutong@amazon.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d47ec7a0
  18. 31 1月, 2021 1 次提交
    • C
      neighbour: Prevent a dead entry from updating gc_list · eb4e8fac
      Chinmay Agarwal 提交于
      Following race condition was detected:
      <CPU A, t0> - neigh_flush_dev() is under execution and calls
      neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
      
      <CPU B, t1> - Executing: __netif_receive_skb() ->
      __netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
      calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
      
      <CPU A, t2> - Moves further along neigh_flush_dev() and calls
      neigh_cleanup_and_release(n), but since reference count increased in t2,
      'n' couldn't be destroyed.
      
      <CPU B, t3> - Moves further along, arp_process() and calls
      neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
      the neighbour entry back in gc_list(neigh_mark_dead(), removed it
      earlier in t0 from gc_list)
      
      <CPU B, t4> - arp_process() finally calls neigh_release(n), destroying
      the neighbour entry.
      
      This leads to 'n' still being part of gc_list, but the actual
      neighbour structure has been freed.
      
      The situation can be prevented from happening if we disallow a dead
      entry to have any possibility of updating gc_list. This is what the
      patch intends to achieve.
      
      Fixes: 9c29a2f5 ("neighbor: Fix locking order for gc_list changes")
      Signed-off-by: NChinmay Agarwal <chinagar@codeaurora.org>
      Reviewed-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      eb4e8fac
  19. 16 1月, 2021 1 次提交
  20. 29 12月, 2020 1 次提交
  21. 14 11月, 2020 1 次提交
  22. 25 6月, 2020 1 次提交
  23. 30 5月, 2020 1 次提交
  24. 23 5月, 2020 1 次提交
    • R
      vxlan: ecmp support for mac fdb entries · 1274e1cc
      Roopa Prabhu 提交于
      Todays vxlan mac fdb entries can point to multiple remote
      ips (rdsts) with the sole purpose of replicating
      broadcast-multicast and unknown unicast packets to those remote ips.
      
      E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be
      load balanced to remote switches (vteps) belonging to the
      same multi-homed ethernet segment (E-VPN multihoming is analogous
      to multi-homed LAG implementations, but with the inter-switch
      peerlink replaced with a vxlan tunnel). In other words it needs
      support for mac ecmp. Furthermore, for faster convergence, E-VPN
      multihoming needs the ability to update fdb ecmp nexthops independent
      of the fdb entries.
      
      New route nexthop API is perfect for this usecase.
      This patch extends the vxlan fdb code to take a nexthop id
      pointing to an ecmp nexthop group.
      
      Changes include:
      - New NDA_NH_ID attribute for fdbs
      - Use the newly added fdb nexthop groups
      - makes vxlan rdsts and nexthop handling code mutually
        exclusive
      - since this is a new use-case and the requirement is for ecmp
      nexthop groups, the fdb add and update path checks that the
      nexthop is really an ecmp nexthop group. This check can be relaxed
      in the future, if we want to introduce replication fdb nexthop groups
      and allow its use in lieu of current rdst lists.
      - fdb update requests with nexthop id's only allowed for existing
      fdb's that have nexthop id's
      - learning will not override an existing fdb entry with nexthop
      group
      - I have wrapped the switchdev offload code around the presence of
      rdst
      
      [1] E-VPN RFC https://tools.ietf.org/html/rfc7432
      [2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365
      [3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf
      
      Includes a null check fix in vxlan_xmit from Nikolay
      
      v2 - Fixed build issue:
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1274e1cc
  25. 06 5月, 2020 1 次提交
  26. 27 4月, 2020 1 次提交
  27. 03 4月, 2020 1 次提交
    • H
      neigh: support smaller retrans_time settting · 19e16d22
      Hangbin Liu 提交于
      Currently, we limited the retrans_time to be greater than HZ/2. i.e.
      setting retrans_time less than 500ms will not work. This makes the user
      unable to achieve a more accurate control for bonding arp fast failover.
      
      Update the sanity check to HZ/100, which is 10ms, to let users have more
      ability on the retrans_time control.
      
      v3: sync the behavior with IPv6 and update all the timer handler
      v2: use HZ instead of hard code number
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19e16d22
  28. 21 2月, 2020 1 次提交
  29. 24 1月, 2020 1 次提交
  30. 10 12月, 2019 1 次提交
  31. 08 11月, 2019 1 次提交
    • E
      net: add annotations on hh->hh_len lockless accesses · c305c6ae
      Eric Dumazet 提交于
      KCSAN reported a data-race [1]
      
      While we can use READ_ONCE() on the read sides,
      we need to make sure hh->hh_len is written last.
      
      [1]
      
      BUG: KCSAN: data-race in eth_header_cache / neigh_resolve_output
      
      write to 0xffff8880b9dedcb8 of 4 bytes by task 29760 on cpu 0:
       eth_header_cache+0xa9/0xd0 net/ethernet/eth.c:247
       neigh_hh_init net/core/neighbour.c:1463 [inline]
       neigh_resolve_output net/core/neighbour.c:1480 [inline]
       neigh_resolve_output+0x415/0x470 net/core/neighbour.c:1470
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
       ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
       rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      read to 0xffff8880b9dedcb8 of 4 bytes by task 29572 on cpu 1:
       neigh_resolve_output net/core/neighbour.c:1479 [inline]
       neigh_resolve_output+0x113/0x470 net/core/neighbour.c:1470
       neigh_output include/net/neighbour.h:511 [inline]
       ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       NF_HOOK include/linux/netfilter.h:305 [inline]
       ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
       ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
       rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
       process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
       worker_thread+0xa0/0x800 kernel/workqueue.c:2415
       kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 29572 Comm: kworker/1:4 Not tainted 5.4.0-rc6+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events rt6_probe_deferred
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c305c6ae
  32. 07 11月, 2019 1 次提交
  33. 28 7月, 2019 1 次提交