1. 26 2月, 2022 2 次提交
  2. 25 2月, 2022 2 次提交
  3. 23 2月, 2022 1 次提交
  4. 21 2月, 2022 2 次提交
    • I
      ipv4: Invalidate neighbour for broadcast address upon address addition · 0c51e12e
      Ido Schimmel 提交于
      In case user space sends a packet destined to a broadcast address when a
      matching broadcast route is not configured, the kernel will create a
      unicast neighbour entry that will never be resolved [1].
      
      When the broadcast route is configured, the unicast neighbour entry will
      not be invalidated and continue to linger, resulting in packets being
      dropped.
      
      Solve this by invalidating unresolved neighbour entries for broadcast
      addresses after routes for these addresses are internally configured by
      the kernel. This allows the kernel to create a broadcast neighbour entry
      following the next route lookup.
      
      Another possible solution that is more generic but also more complex is
      to have the ARP code register a listener to the FIB notification chain
      and invalidate matching neighbour entries upon the addition of broadcast
      routes.
      
      It is also possible to wave off the issue as a user space problem, but
      it seems a bit excessive to expect user space to be that intimately
      familiar with the inner workings of the FIB/neighbour kernel code.
      
      [1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/Reported-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Tested-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c51e12e
    • T
      gso: do not skip outer ip header in case of ipip and net_failover · cc20cced
      Tao Liu 提交于
      We encounter a tcp drop issue in our cloud environment. Packet GROed in
      host forwards to a VM virtio_net nic with net_failover enabled. VM acts
      as a IPVS LB with ipip encapsulation. The full path like:
      host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
       -> ipip encap -> net_failover tx -> virtio_net tx
      
      When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
      SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
      did because it supports TSO and GSO_IPXIP4. But network_header points to
      inner ip header.
      
      Call Trace:
       tcp4_gso_segment        ------> return NULL
       inet_gso_segment        ------> inner iph, network_header points to
       ipip_gso_segment
       inet_gso_segment        ------> outer iph
       skb_mac_gso_segment
      
      Afterwards virtio_net transmits the pkt, only inner ip header is modified.
      And the outer one just keeps unchanged. The pkt will be dropped in remote
      host.
      
      Call Trace:
       inet_gso_segment        ------> inner iph, outer iph is skipped
       skb_mac_gso_segment
       __skb_gso_segment
       validate_xmit_skb
       validate_xmit_skb_list
       sch_direct_xmit
       __qdisc_run
       __dev_queue_xmit        ------> virtio_net
       dev_hard_start_xmit
       __dev_queue_xmit        ------> net_failover
       ip_finish_output2
       ip_output
       iptunnel_xmit
       ip_tunnel_xmit
       ipip_tunnel_xmit        ------> ipip
       dev_hard_start_xmit
       __dev_queue_xmit
       ip_finish_output2
       ip_output
       ip_forward
       ip_rcv
       __netif_receive_skb_one_core
       netif_receive_skb_internal
       napi_gro_receive
       receive_buf
       virtnet_poll
       net_rx_action
      
      The root cause of this issue is specific with the rare combination of
      SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
      SKB_GSO_DODGY is set from external virtio_net. We need to reset network
      header when callbacks.gso_segment() returns NULL.
      
      This patch also includes ipv6_gso_segment(), considering SIT, etc.
      
      Fixes: cb32f511 ("ipip: add GSO/TSO support")
      Signed-off-by: NTao Liu <thomas.liu@ucloud.cn>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc20cced
  5. 20 2月, 2022 8 次提交
  6. 19 2月, 2022 1 次提交
  7. 18 2月, 2022 2 次提交
    • E
      net-timestamp: convert sk->sk_tskey to atomic_t · a1cdec57
      Eric Dumazet 提交于
      UDP sendmsg() can be lockless, this is causing all kinds
      of data races.
      
      This patch converts sk->sk_tskey to remove one of these races.
      
      BUG: KCSAN: data-race in __ip_append_data / __ip_append_data
      
      read to 0xffff8881035d4b6c of 4 bytes by task 8877 on cpu 1:
       __ip_append_data+0x1c1/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      write to 0xffff8881035d4b6c of 4 bytes by task 8880 on cpu 0:
       __ip_append_data+0x1d8/0x1de0 net/ipv4/ip_output.c:994
       ip_make_skb+0x13f/0x2d0 net/ipv4/ip_output.c:1636
       udp_sendmsg+0x12bd/0x14c0 net/ipv4/udp.c:1249
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000054d -> 0x0000054e
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8880 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00167-gdcb85f85-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1cdec57
    • E
      ipv4: fix data races in fib_alias_hw_flags_set · 9fcf986c
      Eric Dumazet 提交于
      fib_alias_hw_flags_set() can be used by concurrent threads,
      and is only RCU protected.
      
      We need to annotate accesses to following fields of struct fib_alias:
      
          offload, trap, offload_failed
      
      Because of READ_ONCE()WRITE_ONCE() limitations, make these
      field u8.
      
      BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set
      
      read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
       fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
       nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
       nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
       process_scheduled_works kernel/workqueue.c:2370 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
       kthread+0x1bf/0x1e0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30
      
      write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
       fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
       nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
       nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
       process_scheduled_works kernel/workqueue.c:2370 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
       kthread+0x1bf/0x1e0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00 -> 0x02
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e8-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events nsim_fib_event_work
      
      Fixes: 90b93f1b ("ipv4: Add "offload" and "trap" indications to routes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9fcf986c
  8. 17 2月, 2022 1 次提交
    • X
      ping: fix the dif and sdif check in ping_lookup · 35a79e64
      Xin Long 提交于
      When 'ping' changes to use PING socket instead of RAW socket by:
      
         # sysctl -w net.ipv4.ping_group_range="0 100"
      
      There is another regression caused when matching sk_bound_dev_if
      and dif, RAW socket is using inet_iif() while PING socket lookup
      is using skb->dev->ifindex, the cmd below fails due to this:
      
        # ip link add dummy0 type dummy
        # ip link set dummy0 up
        # ip addr add 192.168.111.1/24 dev dummy0
        # ping -I dummy0 192.168.111.1 -c1
      
      The issue was also reported on:
      
        https://github.com/iputils/iputils/issues/104
      
      But fixed in iputils in a wrong way by not binding to device when
      destination IP is on device, and it will cause some of kselftests
      to fail, as Jianlin noticed.
      
      This patch is to use inet(6)_iif and inet(6)_sdif to get dif and
      sdif for PING socket, and keep consistent with RAW socket.
      
      Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35a79e64
  9. 15 2月, 2022 1 次提交
    • Z
      ipv4: add description about martian source · 9d2d38c3
      Zhang Yunkai 提交于
      When multiple containers are running in the environment and multiple
      macvlan network port are configured in each container, a lot of martian
      source prints will appear after martian_log is enabled. they are almost
      the same, and printed by net_warn_ratelimited. Each arp message will
      trigger this print on each network port.
      
      Such as:
      IPv4: martian source 173.254.95.16 from 173.254.100.109,
      on dev eth0
      ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d
      08 06        ......@...dm..
      IPv4: martian source 173.254.95.16 from 173.254.100.109,
      on dev eth1
      ll header: 00000000: ff ff ff ff ff ff 40 00 ad fe 64 6d
      08 06        ......@...dm..
      
      There is no description of this kind of source in the RFC1812.
      Signed-off-by: NZhang Yunkai <zhang.yunkai@zte.com.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d2d38c3
  10. 11 2月, 2022 2 次提交
    • E
      ipv4: add (struct uncached_list)->quarantine list · 29e5375d
      Eric Dumazet 提交于
      This is an optimization to keep the per-cpu lists as short as possible:
      
      Whenever rt_flush_dev() changes one rtable dst.dev
      matching the disappearing device, it can can transfer the object
      to a quarantine list, waiting for a final rt_del_uncached_list().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29e5375d
    • D
      net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a
      D. Wythe 提交于
      This patch intends to provide a mechanism to put constraint on SMC
      connections visit according to the pressure of SMC handshake process.
      At present, frequent visits will cause the incoming connections to be
      backlogged in SMC handshake queue, raise the connections established
      time. Which is quite unacceptable for those applications who base on
      short lived connections.
      
      There are two ways to implement this mechanism:
      
      1. Put limitation after TCP established.
      2. Put limitation before TCP established.
      
      In the first way, we need to wait and receive CLC messages that the
      client will potentially send, and then actively reply with a decline
      message, in a sense, which is also a sort of SMC handshake, affect the
      connections established time on its way.
      
      In the second way, the only problem is that we need to inject SMC logic
      into TCP when it is about to reply the incoming SYN, since we already do
      that, it's seems not a problem anymore. And advantage is obvious, few
      additional processes are required to complete the constraint.
      
      This patch use the second way. After this patch, connections who beyond
      constraint will not informed any SMC indication, and SMC will not be
      involved in any of its subsequent processes.
      
      Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48b6190a
  11. 10 2月, 2022 2 次提交
  12. 09 2月, 2022 4 次提交
    • E
      ipmr,ip6mr: acquire RTNL before calling ip[6]mr_free_table() on failure path · 5611a006
      Eric Dumazet 提交于
      ip[6]mr_free_table() can only be called under RTNL lock.
      
      RTNL: assertion failed at net/core/dev.c (10367)
      WARNING: CPU: 1 PID: 5890 at net/core/dev.c:10367 unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
      Modules linked in:
      CPU: 1 PID: 5890 Comm: syz-executor.2 Not tainted 5.16.0-syzkaller-11627-g422ee58dc0ef #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:unregister_netdevice_many+0x1246/0x1850 net/core/dev.c:10367
      Code: 0f 85 9b ee ff ff e8 69 07 4b fa ba 7f 28 00 00 48 c7 c6 00 90 ae 8a 48 c7 c7 40 90 ae 8a c6 05 6d b1 51 06 01 e8 8c 90 d8 01 <0f> 0b e9 70 ee ff ff e8 3e 07 4b fa 4c 89 e7 e8 86 2a 59 fa e9 ee
      RSP: 0018:ffffc900046ff6e0 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: ffff888050f51d00 RSI: ffffffff815fa008 RDI: fffff520008dfece
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      R10: ffffffff815f3d6e R11: 0000000000000000 R12: 00000000fffffff4
      R13: dffffc0000000000 R14: ffffc900046ff750 R15: ffff88807b7dc000
      FS:  00007f4ab736e700(0000) GS:ffff8880b9d00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fee0b4f8990 CR3: 000000001e7d2000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       mroute_clean_tables+0x244/0xb40 net/ipv6/ip6mr.c:1509
       ip6mr_free_table net/ipv6/ip6mr.c:389 [inline]
       ip6mr_rules_init net/ipv6/ip6mr.c:246 [inline]
       ip6mr_net_init net/ipv6/ip6mr.c:1306 [inline]
       ip6mr_net_init+0x3f0/0x4e0 net/ipv6/ip6mr.c:1298
       ops_init+0xaf/0x470 net/core/net_namespace.c:140
       setup_net+0x54f/0xbb0 net/core/net_namespace.c:331
       copy_net_ns+0x318/0x760 net/core/net_namespace.c:475
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       copy_namespaces+0x391/0x450 kernel/nsproxy.c:178
       copy_process+0x2e0c/0x7300 kernel/fork.c:2167
       kernel_clone+0xe7/0xab0 kernel/fork.c:2555
       __do_sys_clone+0xc8/0x110 kernel/fork.c:2672
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f4ab89f9059
      Code: Unable to access opcode bytes at RIP 0x7f4ab89f902f.
      RSP: 002b:00007f4ab736e118 EFLAGS: 00000206 ORIG_RAX: 0000000000000038
      RAX: ffffffffffffffda RBX: 00007f4ab8b0bf60 RCX: 00007f4ab89f9059
      RDX: 0000000020000280 RSI: 0000000020000270 RDI: 0000000040200000
      RBP: 00007f4ab8a5308d R08: 0000000020000300 R09: 0000000020000300
      R10: 00000000200002c0 R11: 0000000000000206 R12: 0000000000000000
      R13: 00007ffc3977cc1f R14: 00007f4ab736e300 R15: 0000000000022000
       </TASK>
      
      Fixes: f243e5a7 ("ipmr,ip6mr: call ip6mr_free_table() on failure path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <cong.wang@bytedance.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20220208053451.2885398-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      5611a006
    • E
      ipmr: introduce ipmr_net_exit_batch() · 696e595f
      Eric Dumazet 提交于
      cleanup_net() is competing with other rtnl users.
      
      Avoiding to acquire rtnl for each netns before calling
      ipmr_rules_exit() gives chance for cleanup_net()
      to progress much faster, holding rtnl a bit longer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      696e595f
    • E
      ipv4: add fib_net_exit_batch() · 1c695764
      Eric Dumazet 提交于
      cleanup_net() is competing with other rtnl users.
      
      Instead of acquiring rtnl at each fib_net_exit() invocation,
      add fib_net_exit_batch() so that rtnl is acquired once.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      1c695764
    • E
      nexthop: change nexthop_net_exit() to nexthop_net_exit_batch() · fea7b201
      Eric Dumazet 提交于
      cleanup_net() is competing with other rtnl users.
      
      nexthop_net_exit() seems a good candidate for exit_batch(),
      as this gives chance for cleanup_net() to progress much faster,
      holding rtnl a bit longer.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fea7b201
  13. 08 2月, 2022 3 次提交
  14. 07 2月, 2022 5 次提交
  15. 05 2月, 2022 1 次提交
  16. 04 2月, 2022 1 次提交
  17. 03 2月, 2022 1 次提交
  18. 02 2月, 2022 1 次提交
    • A
      tcp: Use BPF timeout setting for SYN ACK RTO · 5903123f
      Akhmat Karakotov 提交于
      When setting RTO through BPF program, some SYN ACK packets were unaffected
      and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
      option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
      and is reassigned through BPF using tcp_timeout_init call. SYN ACK
      retransmits now use newly added timeout option.
      Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      
      v2:
      	- Add timeout option to struct request_sock. Do not call
      	  tcp_timeout_init on every syn ack retransmit.
      
      v3:
      	- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.
      
      v4:
      	- Refactor duplicate code by adding reqsk_timeout function.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5903123f