提交 · 42f0c1934c7cb3e94c2fe8f5771245fa5631d0e7 · openeuler / Kernel

03 3月, 2022 1 次提交

由 Tao Chen 提交于 3月 01, 2022

Last tcp_write_queue_head() use was removed in commit
114f39fe ("tcp: restore autocorking"), so remove it.
Signed-off-by: NTao Chen <chentao3@hotmail.com>
Link: https://lore.kernel.org/r/SYZP282MB33317DEE1253B37C0F57231E86029@SYZP282MB3331.AUSP282.PROD.OUTLOOK.COMSigned-off-by: NJakub Kicinski <kuba@kernel.org>

42f0c193

25 2月, 2022 1 次提交

net/tcp: Merge TCP-MD5 inbound callbacks · 7bbb765b

由 Dmitry Safonov 提交于 2月 23, 2022

The functions do essentially the same work to verify TCP-MD5 sign.
Code can be merged into one family-independent function in order to
reduce copy'n'paste and generated code.
Later with TCP-AO option added, this will allow to create one function
that's responsible for segment verification, that will have all the
different checks for MD5/AO/non-signed packets, which in turn will help
to see checks for all corner-cases in one function, rather than spread
around different families and functions.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220223175740.452397-1-dima@arista.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

7bbb765b

20 2月, 2022 1 次提交

net: tcp: add skb drop reasons to tcp_add_backlog() · 7a26dc9e

由 Menglong Dong 提交于 2月 20, 2022

Pass the address of drop_reason to tcp_add_backlog() to store the
reasons for skb drops when fails. Following drop reasons are
introduced:

SKB_DROP_REASON_SOCKET_BACKLOG
Reviewed-by: NMengen Sun <mengensun@tencent.com>
Reviewed-by: NHao Peng <flyingpeng@tencent.com>
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a26dc9e

02 2月, 2022 1 次提交

tcp: Use BPF timeout setting for SYN ACK RTO · 5903123f

由 Akhmat Karakotov 提交于 1月 28, 2022

When setting RTO through BPF program, some SYN ACK packets were unaffected
and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
and is reassigned through BPF using tcp_timeout_init call. SYN ACK
retransmits now use newly added timeout option.
Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru>
Acked-by: NMartin KaFai Lau <kafai@fb.com>

v2:
	- Add timeout option to struct request_sock. Do not call
	  tcp_timeout_init on every syn ack retransmit.

v3:
	- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.

v4:
	- Refactor duplicate code by adding reqsk_timeout function.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5903123f

21 1月, 2022 1 次提交

tcp: Add a stub for sk_defer_free_flush() · 48cec899

由 Gal Pressman 提交于 1月 20, 2022

When compiling the kernel with CONFIG_INET disabled, the
sk_defer_free_flush() should be defined as a nop.

This resolves the following compilation error:
  ld: net/core/sock.o: in function `sk_defer_free_flush':
  ./include/net/tcp.h:1378: undefined reference to `__sk_defer_free_flush'

Fixes: 79074a72 ("net: Flush deferred skb free on socket destroy")
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NTariq Toukan <tariqt@nvidia.com>
Signed-off-by: NGal Pressman <gal@nvidia.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220120123440.9088-1-gal@nvidia.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

48cec899

16 11月, 2021 2 次提交

tcp: defer skb freeing after socket lock is released · f35f8219

由 Eric Dumazet 提交于 11月 15, 2021

tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.

A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.

Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.

This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.

This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.

Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.

One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)

Tested:
 100Gbit NIC
 Max throughput for one TCP_STREAM flow, over 10 runs

MTU : 1500
Before: 55 Gbit
After:  66 Gbit

MTU : 4096+(headers)
Before: 82 Gbit
After:  95 Gbit
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f35f8219

tcp: annotate data-races on tp->segs_in and tp->data_segs_in · 0307a0b7

由 Eric Dumazet 提交于 11月 15, 2021

tcp_segs_in() can be called from BH, while socket spinlock
is held but socket owned by user, eventually reading these
fields from tcp_get_info()

Found by code inspection, no need to backport this patch
to older kernels.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0307a0b7

03 11月, 2021 1 次提交

net: avoid double accounting for pure zerocopy skbs · 9b65b17d

由 Talal Ahmad 提交于 11月 02, 2021

Track skbs containing only zerocopy data and avoid charging them to
kernel memory to correctly account the memory utilization for
msg_zerocopy. All of the data in such skbs is held in user pages which
are already accounted to user. Before this change, they are charged
again in kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.

Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.

A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.

In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(skb_end_offset(skb)) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.

Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.

With this commit we don't see the warning we saw in previous commit
which resulted in commit 84882cf7.
Signed-off-by: NTalal Ahmad <talalahmad@google.com>
Acked-by: NArjun Roy <arjunroy@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9b65b17d

02 11月, 2021 3 次提交

Revert "net: avoid double accounting for pure zerocopy skbs" · 84882cf7

由 Jakub Kicinski 提交于 11月 01, 2021

This reverts commit f1a456f8.

  WARNING: CPU: 1 PID: 6819 at net/core/skbuff.c:5429 skb_try_coalesce+0x78b/0x7e0
  CPU: 1 PID: 6819 Comm: xxxxxxx Kdump: loaded Tainted: G S                5.15.0-04194-gd852503f7711 #16
  RIP: 0010:skb_try_coalesce+0x78b/0x7e0
  Code: e8 2a bf 41 ff 44 8b b3 bc 00 00 00 48 8b 7c 24 30 e8 19 c0 41 ff 44 89 f0 48 03 83 c0 00 00 00 48 89 44 24 40 e9 47 fb ff ff <0f> 0b e9 ca fc ff ff 4c 8d 70 ff 48 83 c0 07 48 89 44 24 38 e9 61
  RSP: 0018:ffff88881f449688 EFLAGS: 00010282
  RAX: 00000000fffffe96 RBX: ffff8881566e4460 RCX: ffffffff82079f7e
  RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffff8881566e47b0
  RBP: ffff8881566e46e0 R08: ffffed102619235d R09: ffffed102619235d
  R10: ffff888130c91ae3 R11: ffffed102619235c R12: ffff88881f4498a0
  R13: 0000000000000056 R14: 0000000000000009 R15: ffff888130c91ac0
  FS:  00007fec2cbb9700(0000) GS:ffff88881f440000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fec1b060d80 CR3: 00000003acf94005 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   tcp_try_coalesce+0xeb/0x290
   ? tcp_parse_options+0x610/0x610
   ? mark_held_locks+0x79/0xa0
   tcp_queue_rcv+0x69/0x2f0
   tcp_rcv_established+0xa49/0xd40
   ? tcp_data_queue+0x18a0/0x18a0
   tcp_v6_do_rcv+0x1c9/0x880
   ? rt6_mtu_change_route+0x100/0x100
   tcp_v6_rcv+0x1624/0x1830
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

84882cf7

net: avoid double accounting for pure zerocopy skbs · f1a456f8

由 Talal Ahmad 提交于 10月 29, 2021

Track skbs with only zerocopy data and avoid charging them to kernel
memory to correctly account the memory utilization for msg_zerocopy.
All of the data in such skbs is held in user pages which are already
accounted to user. Before this change, they are charged again in
kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.

Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.

A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.

In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.

Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.
Signed-off-by: NTalal Ahmad <talalahmad@google.com>
Acked-by: NArjun Roy <arjunroy@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

f1a456f8

tcp: rename sk_wmem_free_skb · 03271f3a

由 Talal Ahmad 提交于 10月 29, 2021

sk_wmem_free_skb() is only used by TCP.

Rename it to make this clear, and move its declaration to
include/net/tcp.h
Signed-off-by: NTalal Ahmad <talalahmad@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NArjun Roy <arjunroy@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

03271f3a

28 10月, 2021 1 次提交

tcp: cleanup tcp_remove_empty_skb() use · 27728ba8

由 Eric Dumazet 提交于 10月 27, 2021

All tcp_remove_empty_skb() callers now use tcp_write_queue_tail()
for the skb argument, we can therefore factorize code.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

27728ba8

26 10月, 2021 1 次提交

tcp: rename sk_stream_alloc_skb · f8dd3b8d

由 Eric Dumazet 提交于 10月 25, 2021

sk_stream_alloc_skb() is only used by TCP.

Rename it to make this clear, and move its declaration
to include/net/tcp.h
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f8dd3b8d

15 10月, 2021 2 次提交

tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0 · a76c2315

由 Leonard Crestez 提交于 10月 15, 2021

Multiple VRFs are generally meant to be "separate" but right now md5
keys for the default VRF also affect connections inside VRFs if the IP
addresses happen to overlap.

So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
was an error, accept this to mean "key only applies to default VRF".
This is what applications using VRFs for traffic separation want.
Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a76c2315

tcp: switch orphan_count to bare per-cpu counters · 19757ceb

由 Eric Dumazet 提交于 10月 14, 2021

Use of percpu_counter structure to track count of orphaned
sockets is causing problems on modern hosts with 256 cpus
or more.

Stefan Bach reported a serious spinlock contention in real workloads,
that I was able to reproduce with a netfilter rule dropping
incoming FIN packets.

    53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
            |
            ---queued_spin_lock_slowpath
               |
                --53.51%--_raw_spin_lock_irqsave
                          |
                           --53.51%--__percpu_counter_sum
                                     tcp_check_oom
                                     |
                                     |--39.03%--__tcp_close
                                     |          tcp_close
                                     |          inet_release
                                     |          inet6_release
                                     |          sock_close
                                     |          __fput
                                     |          ____fput
                                     |          task_work_run
                                     |          exit_to_usermode_loop
                                     |          do_syscall_64
                                     |          entry_SYSCALL_64_after_hwframe
                                     |          __GI___libc_close
                                     |
                                      --14.48%--tcp_out_of_resources
                                                tcp_write_timeout
                                                tcp_retransmit_timer
                                                tcp_write_timer_handler
                                                tcp_write_timer
                                                call_timer_fn
                                                expire_timers
                                                __run_timers
                                                run_timer_softirq
                                                __softirqentry_text_start

As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
batch for dst entries accounting"), default batch size is too big
for the default value of tcp_max_orphans (262144).

But even if we reduce batch sizes, there would still be cases
where the estimated count of orphans is beyond the limit,
and where tcp_too_many_orphans() has to call the expensive
percpu_counter_sum_positive().

One solution is to use plain per-cpu counters, and have
a timer to periodically refresh this cache.

Updating this cache every 100ms seems about right, tcp pressure
state is not radically changing over shorter periods.

percpu_counter was nice 15 years ago while hosts had less
than 16 cpus, not anymore by current standards.

v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
    reported by kernel test robot <lkp@intel.com>
    Remove unused socket argument from tcp_too_many_orphans()

Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NStefan Bach <sfb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

19757ceb

30 9月, 2021 1 次提交

tcp: adjust rcv_ssthresh according to sk_reserved_mem · 053f3684

由 Wei Wang 提交于 9月 29, 2021

When user sets SO_RESERVE_MEM socket option, in order to utilize the
reserved memory when in memory pressure state, we adjust rcv_ssthresh
according to the available reserved memory for the socket, instead of
using 4 * advmss always.
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

053f3684

24 9月, 2021 1 次提交

tcp: tracking packets with CE marks in BW rate sample · 40bc6063

由 Yuchung Cheng 提交于 9月 23, 2021

In order to track CE marks per rate sample (one round trip), TCP needs a
per-skb header field to record the tp->delivered_ce count when the skb
was sent. To make space, we replace the "last_in_flight" field which is
used exclusively for NV congestion control. The stat needed by NV can be
alternatively approximated by existing stats tcp_sock delivered and
mss_cache.

This patch counts the number of packets delivered which have CE marks in
the rate sample, using similar approach of delivery accounting.

Cc: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NLuke Hsiao <lukehsiao@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

40bc6063

23 9月, 2021 2 次提交

tcp: make tcp_build_frag() static · ff6fb083

由 Paolo Abeni 提交于 9月 22, 2021

After the previous patch the mentioned helper is
used only inside its compilation unit: let's make
it static.

RFC -> v1:
 - preserve the tcp_build_frag() helper (Eric)
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff6fb083

tcp: expose the tcp_mark_push() and tcp_skb_entail() helpers · 04d8825c

由 Paolo Abeni 提交于 9月 22, 2021

the tcp_skb_entail() helper is actually skb_entail(), renamed
to provide proper scope.

    The two helper will be used by the next patch.

RFC -> v1:
 - rename skb_entail to tcp_skb_entail (Eric)
Acked-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

04d8825c

24 7月, 2021 1 次提交

bpf: tcp: seq_file: Remove bpf_seq_afinfo from tcp_iter_state · 62001372

由 Martin KaFai Lau 提交于 7月 01, 2021

A following patch will create a separate struct to store extra
bpf_iter state and it will embed the existing tcp_iter_state like this:
struct bpf_tcp_iter_state {
	struct tcp_iter_state state;
	/* More bpf_iter specific states here ... */
}

As a prep work, this patch removes the
"struct tcp_seq_afinfo *bpf_seq_afinfo" where its purpose is
to tell if it is iterating from bpf_iter instead of proc fs.
Currently, if "*bpf_seq_afinfo" is not NULL, it is iterating from
bpf_iter.  The kernel should not filter by the addr family and
leave this filtering decision to the bpf prog.

Instead of adding a "*bpf_seq_afinfo" pointer, this patch uses the
"seq->op == &bpf_iter_tcp_seq_ops" test to tell if it is iterating
from the bpf iter.

The bpf_iter_(init|fini)_tcp() is left here to prepare for
the change of a following patch.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200554.1034982-1-kafai@fb.com

62001372

20 7月, 2021 1 次提交

net/tcp_fastopen: remove obsolete extern · 74946876

由 Eric Dumazet 提交于 7月 19, 2021

After cited commit, sysctl_tcp_fastopen_blackhole_timeout is no longer
a global variable.

Fixes: 3733be14 ("ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NWei Wang <weiwan@google.com>
Link: https://lore.kernel.org/r/20210719092028.3016745-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

74946876

02 7月, 2021 1 次提交

tcp: consistently disable header prediction for mptcp · 71158bb1

由 Paolo Abeni 提交于 6月 30, 2021

The MPTCP receive path is hooked only into the TCP slow-path.
The DSS presence allows plain MPTCP traffic to hit that
consistently.

Since commit e1ff9e82 ("net: mptcp: improve fallback to TCP"),
when an MPTCP socket falls back to TCP, it can hit the TCP receive
fast-path, and delay or stop triggering the event notification.

Address the issue explicitly disabling the header prediction
for MPTCP sockets.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/200
Fixes: e1ff9e82 ("net: mptcp: improve fallback to TCP")
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

71158bb1

05 6月, 2021 1 次提交

tcp: export timestamp helpers for mptcp · 892bfd3d

由 Florian Westphal 提交于 6月 03, 2021

MPTCP is builtin, so no need to add EXPORT_SYMBOL()s.

It will be used to support SO_TIMESTAMP(NS) ancillary
messages in the mptcp receive path.
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

892bfd3d

12 4月, 2021 1 次提交

skmsg: Pass psock pointer to ->psock_update_sk_prot() · 51e0158a

由 Cong Wang 提交于 4月 06, 2021

Using sk_psock() to retrieve psock pointer from sock requires
RCU read lock, but we already get psock pointer before calling
->psock_update_sk_prot() in both cases, so we can just pass it
without bothering sk_psock().

Fixes: 8a59f9d1 ("sock: Introduce sk->sk_prot->psock_update_sk_prot()")
Reported-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Tested-by: syzbot+320a3bc8d80f478c37e4@syzkaller.appspotmail.com
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20210407032111.33398-1-xiyou.wangcong@gmail.com

51e0158a

03 4月, 2021 1 次提交

tcp: reorder tcp_congestion_ops for better cache locality · 82506665

由 Eric Dumazet 提交于 4月 02, 2021

Group all the often used fields in the first cache line,
to reduce cache line misses.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

82506665

02 4月, 2021 2 次提交

skmsg: Extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() · 2bc793e3

由 Cong Wang 提交于 3月 30, 2021

Although these two functions are only used by TCP, they are not
specific to TCP at all, both operate on skmsg and ingress_msg,
so fit in net/core/skmsg.c very well.

And we will need them for non-TCP, so rename and move them to
skmsg.c and export them to modules.
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-13-xiyou.wangcong@gmail.com

2bc793e3

sock: Introduce sk->sk_prot->psock_update_sk_prot() · 8a59f9d1

由 Cong Wang 提交于 3月 30, 2021

Currently sockmap calls into each protocol to update the struct
proto and replace it. This certainly won't work when the protocol
is implemented as a module, for example, AF_UNIX.

Introduce a new ops sk->sk_prot->psock_update_sk_prot(), so each
protocol can implement its own way to replace the struct proto.
This also helps get rid of symbol dependencies on CONFIG_INET.
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210331023237.41094-11-xiyou.wangcong@gmail.com

8a59f9d1

27 2月, 2021 3 次提交

skmsg: Move sk_redir from TCP_SKB_CB to skb · e3526bb9

由 Cong Wang 提交于 2月 23, 2021

Currently TCP_SKB_CB() is hard-coded in skmsg code, it certainly
does not work for any other non-TCP protocols. We can move them to
skb ext, but it introduces a memory allocation on fast path.

Fortunately, we only need to a word-size to store all the information,
because the flags actually only contains 1 bit so can be just packed
into the lowest bit of the "pointer", which is stored as unsigned
long.

Inside struct sk_buff, '_skb_refdst' can be reused because skb dst is
no longer needed after ->sk_data_ready() so we can just drop it.
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-5-xiyou.wangcong@gmail.com

e3526bb9

bpf: Compute data_end dynamically with JIT code · 16137b09

由 Cong Wang 提交于 2月 23, 2021

Currently, we compute ->data_end with a compile-time constant
offset of skb. But as Jakub pointed out, we can actually compute
it in eBPF JIT code at run-time, so that we can competely get
rid of ->data_end. This is similar to skb_shinfo(skb) computation
in bpf_convert_shinfo_access().
Suggested-by: NJakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-4-xiyou.wangcong@gmail.com

16137b09

bpf: Clean up sockmap related Kconfigs · 88759609

由 Cong Wang 提交于 2月 23, 2021

As suggested by John, clean up sockmap related Kconfigs:

Reduce the scope of CONFIG_BPF_STREAM_PARSER down to TCP stream
parser, to reflect its name.

Make the rest sockmap code simply depend on CONFIG_BPF_SYSCALL
and CONFIG_INET, the latter is still needed at this point because
of TCP/UDP proto update. And leave CONFIG_NET_SOCK_MSG untouched,
as it is used by non-sockmap cases.
Signed-off-by: NCong Wang <cong.wang@bytedance.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NLorenz Bauer <lmb@cloudflare.com>
Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
Acked-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20210223184934.6054-2-xiyou.wangcong@gmail.com

88759609

13 2月, 2021 2 次提交

tcp: factorize logic into tcp_epollin_ready() · 05dc72ab

由 Eric Dumazet 提交于 2月 12, 2021

Both tcp_data_ready() and tcp_stream_is_readable() share the same logic.

Add tcp_epollin_ready() helper to avoid duplication.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Arjun Roy <arjunroy@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

05dc72ab

tcp: fix SO_RCVLOWAT related hangs under mem pressure · f969dc5a

由 Eric Dumazet 提交于 2月 12, 2021

While commit 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint,
it missed to address issue caused by memory pressure.

1) If we are under memory pressure and socket receive queue is empty.
First incoming packet is allowed to be queued, after commit
76dfa608 ("tcp: allow one skb to be received per socket under memory pressure")

But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat
is bigger than skb length.

2) Then, when next packet comes, it is dropped, and we directly
call sk->sk_data_ready().

3) If application is using poll(), tcp_poll() will then use
tcp_stream_is_readable() and decide the socket receive queue is
not yet filled, so nothing will happen.

Even when sender retransmits packets, phases 2) & 3) repeat
and flow is effectively frozen, until memory pressure is off.

Fix is to consider tcp_under_memory_pressure() to take care
of global memory pressure or memcg pressure.

Fixes: 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NArjun Roy <arjunroy@google.com>
Suggested-by: NWei Wang <weiwan@google.com>
Reviewed-by: NWei Wang <weiwan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f969dc5a

24 1月, 2021 2 次提交

tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN · 62d9f1a6

由 Pengcheng Yang 提交于 1月 24, 2021

Upon receiving a cumulative ACK that changes the congestion state from
Disorder to Open, the TLP timer is not set. If the sender is app-limited,
it can only wait for the RTO timer to expire and retransmit.

The reason for this is that the TLP timer is set before the congestion
state changes in tcp_ack(), so we delay the time point of calling
tcp_set_xmit_timer() until after tcp_fastretrans_alert() returns and
remove the FLAG_SET_XMIT_TIMER from ack_flag when the RACK reorder timer
is set.

This commit has two additional benefits:
1) Make sure to reset RTO according to RFC6298 when receiving ACK, to
avoid spurious RTO caused by RTO timer early expires.
2) Reduce the xmit timer reschedule once per ACK when the RACK reorder
timer is set.

Fixes: df92c839 ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Link: https://lore.kernel.org/netdev/1611311242-6675-1-git-send-email-yangpc@wangsu.comSigned-off-by: NPengcheng Yang <yangpc@wangsu.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611464834-23030-1-git-send-email-yangpc@wangsu.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

62d9f1a6

tcp: make TCP_USER_TIMEOUT accurate for zero window probes · 344db93a

由 Enke Chen 提交于 1月 22, 2021

The TCP_USER_TIMEOUT is checked by the 0-window probe timer. As the
timer has backoff with a max interval of about two minutes, the
actual timeout for TCP_USER_TIMEOUT can be off by up to two minutes.

In this patch the TCP_USER_TIMEOUT is made more accurate by taking it
into account when computing the timer value for the 0-window probes.

This patch is similar to and builds on top of the one that made
TCP_USER_TIMEOUT accurate for RTOs in commit b701a99e ("tcp: Add
tcp_clamp_rto_to_user_timeout() helper to improve accuracy").

Fixes: 9721e709 ("tcp: simplify window probe aborting on USER_TIMEOUT")
Signed-off-by: NEnke Chen <enchen@paloaltonetworks.com>
Reviewed-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20210122191306.GA99540@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>

344db93a

21 1月, 2021 1 次提交

bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f

由 Stanislav Fomichev 提交于 1月 15, 2021

Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.

Without this patch:
     3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
            |
             --3.30%--__cgroup_bpf_run_filter_getsockopt
                       |
                        --0.81%--__kmalloc

With the patch applied:
     0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern

Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Signed-off-by: NStanislav Fomichev <sdf@google.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com

9cacf81f

15 12月, 2020 2 次提交

net: Limit logical shift left of TCP probe0 timeout · 6d4634d1

由 Cambda Zhu 提交于 12月 08, 2020

For each TCP zero window probe, the icsk_backoff is increased by one and
its max value is tcp_retries2. If tcp_retries2 is greater than 63, the
probe0 timeout shift may exceed its max bits. On x86_64/ARMv8/MIPS, the
shift count would be masked to range 0 to 63. And on ARMv7 the result is
zero. If the shift count is masked, only several probes will be sent
with timeout shorter than TCP_RTO_MAX. But if the timeout is zero, it
needs tcp_retries2 times probes to end this false timeout. Besides,
bitwise shift greater than or equal to the width is an undefined
behavior.

This patch adds a limit to the backoff. The max value of max_when is
TCP_RTO_MAX and the min value of timeout base is TCP_RTO_MIN. The limit
is the backoff from TCP_RTO_MIN to TCP_RTO_MAX.
Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
Link: https://lore.kernel.org/r/20201208091910.37618-1-cambda@linux.alibaba.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6d4634d1

tcp: parse mptcp options contained in reset packets · 049fe386

由 Florian Westphal 提交于 12月 10, 2020

Because TCP-level resets only affect the subflow, there is a MPTCP
option to indicate that the MPTCP-level connection should be closed
immediately without a mptcp-level fin exchange.

This is the 'MPTCP fast close option'.  It can be carried on ack
segments or TCP resets.  In the latter case, its needed to parse mptcp
options also for reset packets so that MPTCP can act accordingly.

Next patch will add receive side fastclose support in MPTCP.
Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

049fe386

04 12月, 2020 2 次提交

bpf: Adds support for setting window clamp · cb811109

由 Prankur gupta 提交于 12月 02, 2020

Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
which sets the maximum receiver window size. It will be useful for
limiting receiver window based on RTT.
Signed-off-by: NPrankur gupta <prankgup@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com

cb811109

tcp: merge 'init_req' and 'route_req' functions · 7ea851d1

由 Florian Westphal 提交于 11月 30, 2020

The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
a TCP reset if the token in a MP_JOIN request is unknown.

At this time we don't do this, the 3whs completes and the 'new subflow'
is reset afterwards.  There are two ways to allow MPTCP to send the
reset.

1. override 'send_synack' callback and emit the rst from there.
   The drawback is that the request socket gets inserted into the
   listeners queue just to get removed again right away.

2. Send the reset from the 'route_req' function instead.
   This avoids the 'add&remove request socket', but route_req lacks the
   skb that is required to send the TCP reset.

Instead of just adding the skb to that function for MPTCP sake alone,
Paolo suggested to merge init_req and route_req functions.

This saves one indirection from syn processing path and provides the skb
to the merged function at the same time.

'send reset on unknown mptcp join token' is added in next patch.
Suggested-by: NPaolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

7ea851d1

17 11月, 2020 1 次提交

tcp: factor out __tcp_close() helper · 77c3c956

由 Paolo Abeni 提交于 11月 16, 2020

unlocked version of protocol level close, will be used by
MPTCP to allow decouple orphaning and subflow level close.
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

77c3c956

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功