1. 25 4月, 2022 4 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
    • P
      ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode · 31c417c9
      Peilin Ye 提交于
      As pointed out by Jakub Kicinski, currently using TUNNEL_SEQ in
      collect_md mode is racy for [IP6]GRE[TAP] devices.  Consider the
      following sequence of events:
      
      1. An [IP6]GRE[TAP] device is created in collect_md mode using "ip link
         add ... external".  "ip" ignores "[o]seq" if "external" is specified,
         so TUNNEL_SEQ is off, and the device is marked as NETIF_F_LLTX (i.e.
         it uses lockless TX);
      2. Someone sets TUNNEL_SEQ on outgoing skb's, using e.g.
         bpf_skb_set_tunnel_key() in an eBPF program attached to this device;
      3. gre_fb_xmit() or __gre6_xmit() processes these skb's:
      
      	gre_build_header(skb, tun_hlen,
      			 flags, protocol,
      			 tunnel_id_to_key32(tun_info->key.tun_id),
      			 (flags & TUNNEL_SEQ) ? htonl(tunnel->o_seqno++)
      					      : 0);   ^^^^^^^^^^^^^^^^^
      
      Since we are not using the TX lock (&txq->_xmit_lock), multiple CPUs may
      try to do this tunnel->o_seqno++ in parallel, which is racy.  Fix it by
      making o_seqno atomic_t.
      
      As mentioned by Eric Dumazet in commit b790e01a ("ip_gre: lockless
      xmit"), making o_seqno atomic_t increases "chance for packets being out
      of order at receiver" when NETIF_F_LLTX is on.
      
      Maybe a better fix would be:
      
      1. Do not ignore "oseq" in external mode.  Users MUST specify "oseq" if
         they want the kernel to allow sequencing of outgoing packets;
      2. Reject all outgoing TUNNEL_SEQ packets if the device was not created
         with "oseq".
      
      Unfortunately, that would break userspace.
      
      We could now make [IP6]GRE[TAP] devices always NETIF_F_LLTX, but let us
      do it in separate patches to keep this fix minimal.
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Fixes: 77a5196a ("gre: add sequence number for collect md mode.")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31c417c9
    • P
      ip_gre: Make o_seqno start from 0 in native mode · ff827beb
      Peilin Ye 提交于
      For GRE and GRETAP devices, currently o_seqno starts from 1 in native
      mode.  According to RFC 2890 2.2., "The first datagram is sent with a
      sequence number of 0."  Fix it.
      
      It is worth mentioning that o_seqno already starts from 0 in collect_md
      mode, see gre_fb_xmit(), where tunnel->o_seqno is passed to
      gre_build_header() before getting incremented.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NPeilin Ye <peilin.ye@bytedance.com>
      Acked-by: NWilliam Tu <u9012063@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff827beb
    • R
      netfilter: flowtable: Remove the empty file · b9b1e0da
      Rongguang Wei 提交于
      CONFIG_NF_FLOW_TABLE_IPV4 is already removed and the real user is also
      removed(nf_flow_table_ipv4.c is empty).
      
      Fixes: c42ba429 ("netfilter: flowtable: remove ipv4/ipv6 modules")
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b9b1e0da
  2. 23 4月, 2022 2 次提交
  3. 16 4月, 2022 1 次提交
  4. 13 4月, 2022 1 次提交
    • S
      esp: limit skb_page_frag_refill use to a single page · 5bd8baab
      Sabrina Dubroca 提交于
      Commit ebe48d36 ("esp: Fix possible buffer overflow in ESP
      transformation") tried to fix skb_page_frag_refill usage in ESP by
      capping allocsize to 32k, but that doesn't completely solve the issue,
      as skb_page_frag_refill may return a single page. If that happens, we
      will write out of bounds, despite the check introduced in the previous
      patch.
      
      This patch forces COW in cases where we would end up calling
      skb_page_frag_refill with a size larger than a page (first in
      esp_output_head with tailen, then in esp_output_tail with
      skb->data_len).
      
      Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
      Fixes: 03e2a30f ("esp6: Avoid skb_cow_data whenever possible")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      5bd8baab
  5. 01 4月, 2022 1 次提交
    • N
      net: ipv4: fix route with nexthop object delete warning · 6bf92d70
      Nikolay Aleksandrov 提交于
      FRR folks have hit a kernel warning[1] while deleting routes[2] which is
      caused by trying to delete a route pointing to a nexthop id without
      specifying nhid but matching on an interface. That is, a route is found
      but we hit a warning while matching it. The warning is from
      fib_info_nh() in include/net/nexthop.h because we run it on a fib_info
      with nexthop object. The call chain is:
       inet_rtm_delroute -> fib_table_delete -> fib_nh_match (called with a
      nexthop fib_info and also with fc_oif set thus calling fib_info_nh on
      the fib_info and triggering the warning). The fix is to not do any
      matching in that branch if the fi has a nexthop object because those are
      managed separately. I.e. we should match when deleting without nh spec and
      should fail when deleting a nexthop route with old-style nh spec because
      nexthop objects are managed separately, e.g.:
       $ ip r show 1.2.3.4/32
       1.2.3.4 nhid 12 via 192.168.11.2 dev dummy0
      
       $ ip r del 1.2.3.4/32
       $ ip r del 1.2.3.4/32 nhid 12
       <both should work>
      
       $ ip r del 1.2.3.4/32 dev dummy0
       <should fail with ESRCH>
      
      [1]
       [  523.462226] ------------[ cut here ]------------
       [  523.462230] WARNING: CPU: 14 PID: 22893 at include/net/nexthop.h:468 fib_nh_match+0x210/0x460
       [  523.462236] Modules linked in: dummy rpcsec_gss_krb5 xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw iptable_raw bpf_preload xt_statistic ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark nf_tables xt_nat veth nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay dm_crypt nfsv3 nfs fscache netfs vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack 8021q garp mrp ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core ip6table_filter xt_comment ip6_tables vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr bnep binfmt_misc xfs vfat fat squashfs loop nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb btrtl iwlmvm uvcvideo btbcm snd_hda_intel edac_mce_amd
       [  523.462274]  videobuf2_vmalloc videobuf2_memops btintel snd_intel_dspcfg videobuf2_v4l2 snd_intel_sdw_acpi bluetooth snd_usb_audio snd_hda_codec mac80211 snd_usbmidi_lib joydev snd_hda_core videobuf2_common kvm_amd snd_rawmidi snd_hwdep snd_seq videodev ccp snd_seq_device libarc4 ecdh_generic mc snd_pcm kvm iwlwifi snd_timer drm_kms_helper snd cfg80211 cec soundcore irqbypass rapl wmi_bmof i2c_piix4 rfkill k10temp pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sp5100_tco r8169 nvme_core wmi ipmi_devintf ipmi_msghandler fuse
       [  523.462300] CPU: 14 PID: 22893 Comm: ip Tainted: P           OE     5.16.18-200.fc35.x86_64 #1
       [  523.462302] Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.C0 10/29/2020
       [  523.462303] RIP: 0010:fib_nh_match+0x210/0x460
       [  523.462304] Code: 7c 24 20 48 8b b5 90 00 00 00 e8 bb ee f4 ff 48 8b 7c 24 20 41 89 c4 e8 ee eb f4 ff 45 85 e4 0f 85 2e fe ff ff e9 4c ff ff ff <0f> 0b e9 17 ff ff ff 3c 0a 0f 85 61 fe ff ff 48 8b b5 98 00 00 00
       [  523.462306] RSP: 0018:ffffaa53d4d87928 EFLAGS: 00010286
       [  523.462307] RAX: 0000000000000000 RBX: ffffaa53d4d87a90 RCX: ffffaa53d4d87bb0
       [  523.462308] RDX: ffff9e3d2ee6be80 RSI: ffffaa53d4d87a90 RDI: ffffffff920ed380
       [  523.462309] RBP: ffff9e3d2ee6be80 R08: 0000000000000064 R09: 0000000000000000
       [  523.462310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000031
       [  523.462310] R13: 0000000000000020 R14: 0000000000000000 R15: ffff9e3d331054e0
       [  523.462311] FS:  00007f245517c1c0(0000) GS:ffff9e492ed80000(0000) knlGS:0000000000000000
       [  523.462313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [  523.462313] CR2: 000055e5dfdd8268 CR3: 00000003ef488000 CR4: 0000000000350ee0
       [  523.462315] Call Trace:
       [  523.462316]  <TASK>
       [  523.462320]  fib_table_delete+0x1a9/0x310
       [  523.462323]  inet_rtm_delroute+0x93/0x110
       [  523.462325]  rtnetlink_rcv_msg+0x133/0x370
       [  523.462327]  ? _copy_to_iter+0xb5/0x6f0
       [  523.462330]  ? rtnl_calcit.isra.0+0x110/0x110
       [  523.462331]  netlink_rcv_skb+0x50/0xf0
       [  523.462334]  netlink_unicast+0x211/0x330
       [  523.462336]  netlink_sendmsg+0x23f/0x480
       [  523.462338]  sock_sendmsg+0x5e/0x60
       [  523.462340]  ____sys_sendmsg+0x22c/0x270
       [  523.462341]  ? import_iovec+0x17/0x20
       [  523.462343]  ? sendmsg_copy_msghdr+0x59/0x90
       [  523.462344]  ? __mod_lruvec_page_state+0x85/0x110
       [  523.462348]  ___sys_sendmsg+0x81/0xc0
       [  523.462350]  ? netlink_seq_start+0x70/0x70
       [  523.462352]  ? __dentry_kill+0x13a/0x180
       [  523.462354]  ? __fput+0xff/0x250
       [  523.462356]  __sys_sendmsg+0x49/0x80
       [  523.462358]  do_syscall_64+0x3b/0x90
       [  523.462361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
       [  523.462364] RIP: 0033:0x7f24552aa337
       [  523.462365] Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
       [  523.462366] RSP: 002b:00007fff7f05a838 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       [  523.462368] RAX: ffffffffffffffda RBX: 000000006245bf91 RCX: 00007f24552aa337
       [  523.462368] RDX: 0000000000000000 RSI: 00007fff7f05a8a0 RDI: 0000000000000003
       [  523.462369] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
       [  523.462370] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
       [  523.462370] R13: 00007fff7f05ce08 R14: 0000000000000000 R15: 000055e5dfdd1040
       [  523.462373]  </TASK>
       [  523.462374] ---[ end trace ba537bc16f6bf4ed ]---
      
      [2] https://github.com/FRRouting/frr/issues/6412
      
      Fixes: 4c7e8084 ("ipv4: Plumb support for nexthop object in a fib_info")
      Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bf92d70
  6. 22 3月, 2022 1 次提交
  7. 20 3月, 2022 3 次提交
  8. 19 3月, 2022 1 次提交
    • G
      ipv4: Fix route lookups when handling ICMP redirects and PMTU updates · 544b4dd5
      Guillaume Nault 提交于
      The PMTU update and ICMP redirect helper functions initialise their fl4
      variable with either __build_flow_key() or build_sk_flow_key(). These
      initialisation functions always set ->flowi4_scope with
      RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
      not a problem when the route lookup is later done via
      ip_route_output_key_hash(), which properly clears the ECN bits from
      ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
      flag. However, some helpers call fib_lookup() directly, without
      sanitising the tos and scope fields, so the route lookup can fail and,
      as a result, the ICMP redirect or PMTU update aren't taken into
      account.
      
      Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
      code into ip_rt_fix_tos(), then use this function in handlers that call
      fib_lookup() directly.
      
      Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
      place (like __build_flow_key() or flowi4_init_output()), because
      ip_route_output_key_hash() expects non-sanitised values. When called
      with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
      RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
      sanitise the values only for those paths that don't call
      ip_route_output_key_hash().
      
      Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
      ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
      RT_SCOPE_LINK probably wasn't really a problem: sockets with the
      SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
      normally shouldn't receive ICMP redirects or PMTU updates.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      544b4dd5
  9. 16 3月, 2022 1 次提交
    • D
      net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74
      David Ahern 提交于
      The fundamental premise of VRF and l3mdev core code is binding a socket
      to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
      Legacy code resets flowi_oif to the l3mdev losing any original port
      device binding. Ben (among others) has demonstrated use cases where the
      original port device binding is important and needs to be retained.
      This patch handles that by adding a new entry to the common flow struct
      that can indicate the l3mdev index for later rule and table matching
      avoiding the need to reset flowi_oif.
      
      In addition to allowing more use cases that require port device binds,
      this patch brings a few datapath simplications:
      
      1. l3mdev_fib_rule_match is only called when walking fib rules and
         always after l3mdev_update_flow. That allows an optimization to bail
         early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
         only that index needs to be checked for the FIB table id.
      
      2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
         (e.g., VRF) device. By resetting flowi_oif only for this case the
         FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
         removing several checks in the datapath. The flowi_iif path can be
         simplified to only be called if the it is not loopback (loopback can
         not be assigned to an L3 domain) and the l3mdev index is not already
         set.
      
      3. Avoid another device lookup in the output path when the fib lookup
         returns a reject failure.
      
      Note: 2 functional tests for local traffic with reject fib rules are
      updated to reflect the new direct failure at FIB lookup time for ping
      rather than the failure on packet path. The current code fails like this:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: Warning: source address might be selected on device other than: eth1
          PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.
      
          --- 172.16.3.1 ping statistics ---
          1 packets transmitted, 0 received, 100% packet loss, time 0ms
      
      where the test now directly fails:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: connect: No route to host
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Tested-by: NBen Greear <greearb@candelatech.com>
      Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      40867d74
  10. 15 3月, 2022 2 次提交
    • W
      bpf, sockmap: Fix double uncharge the mem of sk_msg · 2486ab43
      Wang Yufen 提交于
      If tcp_bpf_sendmsg is running during a tear down operation, psock may be
      freed.
      
      tcp_bpf_sendmsg()
       tcp_bpf_send_verdict()
        sk_msg_return()
        tcp_bpf_sendmsg_redir()
         unlikely(!psock))
           sk_msg_free()
      
      The mem of msg has been uncharged in tcp_bpf_send_verdict() by
      sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
      is null, we can simply returning an error code, this would then trigger
      the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
      the side effect of throwing an error up to user space. This would be a
      slight change in behavior from user side but would look the same as an
      error if the redirect on the socket threw an error.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
      2486ab43
    • W
      bpf, sockmap: Fix more uncharged while msg has more_data · 84472b43
      Wang Yufen 提交于
      In tcp_bpf_send_verdict(), if msg has more data after
      tcp_bpf_sendmsg_redir():
      
      tcp_bpf_send_verdict()
       tosend = msg->sg.size  //msg->sg.size = 22220
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(22220) sk->sk_forward_alloc
        tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000
       goto more_data;
       tosend = msg->sg.size  //msg->sg.size = 11000
       case __SK_REDIRECT:
        sk_msg_return()  //uncharged msg->sg.size(11000) to sk->sk_forward_alloc
      
      The msg->sg.size(11000) has been uncharged twice, to fix we can charge the
      remaining msg->sg.size before goto more data.
      
      This issue can cause the following info:
      WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
      Call Trace:
       <TASK>
       inet_csk_destroy_sock+0x55/0x110
       __tcp_close+0x279/0x470
       tcp_close+0x1f/0x60
       inet_release+0x3f/0x80
       __sock_release+0x3d/0xb0
       sock_close+0x11/0x20
       __fput+0x92/0x250
       task_work_run+0x6a/0xa0
       do_exit+0x33b/0xb60
       do_group_exit+0x2f/0xa0
       get_signal+0xb6/0x950
       arch_do_signal_or_restart+0xac/0x2a0
       ? vfs_write+0x237/0x290
       exit_to_user_mode_prepare+0xa9/0x200
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x46/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      
      WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
      Call Trace:
       <TASK>
       __sk_destruct+0x24/0x1f0
       sk_psock_destroy+0x19b/0x1c0
       process_one_work+0x1b3/0x3c0
       worker_thread+0x30/0x350
       ? process_one_work+0x3c0/0x3c0
       kthread+0xe6/0x110
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x22/0x30
       </TASK>
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com
      84472b43
  11. 12 3月, 2022 1 次提交
  12. 10 3月, 2022 2 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
    • E
      tcp: autocork: take MSG_EOR hint into consideration · b0de0cf4
      Eric Dumazet 提交于
      tcp_should_autocork() is evaluating if it makes senses
      to not immediately send current skb, hoping that
      user space will add more payload on it by the
      time TCP stack reacts to upcoming TX completions.
      
      If current skb got MSG_EOR mark, then we know
      that no further data will be added, it is therefore
      futile to wait.
      
      SOF_TIMESTAMPING_TX_ACK will become a bit more accurate,
      if prior packets are still in qdisc/device queues.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20220309054706.2857266-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b0de0cf4
  13. 09 3月, 2022 1 次提交
  14. 07 3月, 2022 3 次提交
  15. 03 3月, 2022 6 次提交
    • M
      net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally · cd14e9b7
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time in the ingress path
      before the routing decision is made.  This patch can postpone clearing
      delivery_time in a skb until knowing it is delivered locally and also
      set the (rcv) timestamp if needed.  This patch moves the
      skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
      and ip6_input_finish().
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd14e9b7
    • M
      net: ip: Handle delivery_time in ip defrag · 8672406e
      Martin KaFai Lau 提交于
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally.  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      An earlier attempt was to do skb_clear_delivery_time() in
      ip_local_deliver() and ip6_input().  The discussion [0] requested
      to move it one step later into ip_local_deliver_finish()
      and ip6_input_finish() so that the delivery_time can be kept
      for the ip_vs forwarding path also.
      
      To do that, this patch also needs to take care of the (rcv) timestamp
      usecase in ip_is_fragment().  It needs to expect delivery_time in
      the skb->tstamp, so it needs to save the mono_delivery_time bit in
      inet_frag_queue such that the delivery_time (if any) can be restored
      in the final defragmented skb.
      
      [Note that it will only happen when the locally generated skb is looping
       from egress to ingress over a virtual interface (e.g. veth, loopback...),
       skb->tstamp may have the delivery time before it is known that it will
       be delivered locally and received by another sk.]
      
      [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8672406e
    • M
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau 提交于
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d98d58a0
    • M
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau 提交于
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de799101
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
    • E
      tcp: make tcp_read_sock() more robust · e3d5ea2c
      Eric Dumazet 提交于
      If recv_actor() returns an incorrect value, tcp_read_sock()
      might loop forever.
      
      Instead, issue a one time warning and make sure to make progress.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/r/20220302161723.3910001-2-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e3d5ea2c
  16. 26 2月, 2022 2 次提交
  17. 25 2月, 2022 2 次提交
  18. 23 2月, 2022 1 次提交
  19. 21 2月, 2022 2 次提交
    • I
      ipv4: Invalidate neighbour for broadcast address upon address addition · 0c51e12e
      Ido Schimmel 提交于
      In case user space sends a packet destined to a broadcast address when a
      matching broadcast route is not configured, the kernel will create a
      unicast neighbour entry that will never be resolved [1].
      
      When the broadcast route is configured, the unicast neighbour entry will
      not be invalidated and continue to linger, resulting in packets being
      dropped.
      
      Solve this by invalidating unresolved neighbour entries for broadcast
      addresses after routes for these addresses are internally configured by
      the kernel. This allows the kernel to create a broadcast neighbour entry
      following the next route lookup.
      
      Another possible solution that is more generic but also more complex is
      to have the ARP code register a listener to the FIB notification chain
      and invalidate matching neighbour entries upon the addition of broadcast
      routes.
      
      It is also possible to wave off the issue as a user space problem, but
      it seems a bit excessive to expect user space to be that intimately
      familiar with the inner workings of the FIB/neighbour kernel code.
      
      [1] https://lore.kernel.org/netdev/55a04a8f-56f3-f73c-2aea-2195923f09d1@huawei.com/Reported-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Tested-by: NWang Hai <wanghai38@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c51e12e
    • T
      gso: do not skip outer ip header in case of ipip and net_failover · cc20cced
      Tao Liu 提交于
      We encounter a tcp drop issue in our cloud environment. Packet GROed in
      host forwards to a VM virtio_net nic with net_failover enabled. VM acts
      as a IPVS LB with ipip encapsulation. The full path like:
      host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
       -> ipip encap -> net_failover tx -> virtio_net tx
      
      When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
      SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
      did because it supports TSO and GSO_IPXIP4. But network_header points to
      inner ip header.
      
      Call Trace:
       tcp4_gso_segment        ------> return NULL
       inet_gso_segment        ------> inner iph, network_header points to
       ipip_gso_segment
       inet_gso_segment        ------> outer iph
       skb_mac_gso_segment
      
      Afterwards virtio_net transmits the pkt, only inner ip header is modified.
      And the outer one just keeps unchanged. The pkt will be dropped in remote
      host.
      
      Call Trace:
       inet_gso_segment        ------> inner iph, outer iph is skipped
       skb_mac_gso_segment
       __skb_gso_segment
       validate_xmit_skb
       validate_xmit_skb_list
       sch_direct_xmit
       __qdisc_run
       __dev_queue_xmit        ------> virtio_net
       dev_hard_start_xmit
       __dev_queue_xmit        ------> net_failover
       ip_finish_output2
       ip_output
       iptunnel_xmit
       ip_tunnel_xmit
       ipip_tunnel_xmit        ------> ipip
       dev_hard_start_xmit
       __dev_queue_xmit
       ip_finish_output2
       ip_output
       ip_forward
       ip_rcv
       __netif_receive_skb_one_core
       netif_receive_skb_internal
       napi_gro_receive
       receive_buf
       virtnet_poll
       net_rx_action
      
      The root cause of this issue is specific with the rare combination of
      SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
      SKB_GSO_DODGY is set from external virtio_net. We need to reset network
      header when callbacks.gso_segment() returns NULL.
      
      This patch also includes ipv6_gso_segment(), considering SIT, etc.
      
      Fixes: cb32f511 ("ipip: add GSO/TSO support")
      Signed-off-by: NTao Liu <thomas.liu@ucloud.cn>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc20cced
  20. 20 2月, 2022 3 次提交