提交 · e65775fdd389e4f47eb1972ef6372e20c6c2cc05 · openeuler / Kernel

11 6月, 2022 2 次提交

Documentation: add description for net.sctp.intl_enable · e65775fd

由 Xin Long 提交于 6月 09, 2022

Describe it in networking/ip-sysctl.rst like other SCTP options.
We need to document this especially as when using the feature
of User Message Interleaving, some socket options also needs
to be set.

Fixes: 463118c3 ("sctp: support sysctl to allow users to use stream interleave")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e65775fd

Documentation: add description for net.sctp.reconf_enable · c349ae5f

由 Xin Long 提交于 6月 09, 2022

Describe it in networking/ip-sysctl.rst like other SCTP options.

Fixes: c0d8bab6 ("sctp: add get and set sockopt for reconf_enable")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c349ae5f

31 5月, 2022 1 次提交

net/ipv6: Expand and rename accept_unsolicited_na to accept_untracked_na · 3e0b8f52

由 Arun Ajith S 提交于 5月 30, 2022

RFC 9131 changes default behaviour of handling RX of NA messages when the
corresponding entry is absent in the neighbour cache. The current
implementation is limited to accept just unsolicited NAs. However, the
RFC is more generic where it also accepts solicited NAs. Both types
should result in adding a STALE entry for this case.

Expand accept_untracked_na behaviour to also accept solicited NAs to
be compliant with the RFC and rename the sysctl knob to
accept_untracked_na.

Fixes: f9a2fb73 ("net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131")
Signed-off-by: NArun Ajith S <aajith@arista.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20220530101414.65439-1-aajith@arista.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>

3e0b8f52

20 4月, 2022 1 次提交

doc/ip-sysctl: add bc_forwarding · c6a4254c

由 Nicolas Dichtel 提交于 4月 13, 2022

Let's describe this sysctl.

Fixes: 5cbf777c ("route: add support for directed broadcast forwarding")
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c6a4254c

17 4月, 2022 1 次提交

net/ipv6: Introduce accept_unsolicited_na knob to implement router-side changes for RFC9131 · f9a2fb73

由 Arun Ajith S 提交于 4月 15, 2022

Add a new neighbour cache entry in STALE state for routers on receiving
an unsolicited (gratuitous) neighbour advertisement with
target link-layer-address option specified.
This is similar to the arp_accept configuration for IPv4.
A new sysctl endpoint is created to turn on this behaviour:
/proc/sys/net/ipv6/conf/interface/accept_unsolicited_na.
Signed-off-by: NArun Ajith S <aajith@arista.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f9a2fb73

10 3月, 2022 1 次提交

tcp: adjust TSO packet sizes based on min_rtt · 65466904

由 Eric Dumazet 提交于 3月 08, 2022

Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.

The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.

Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()

Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.

This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.

/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.

Tested:

100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.

Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)

~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96005

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         65,945.29 msec task-clock                #    2.845 CPUs utilized
         1,314,632      context-switches          # 19935.279 M/sec
             5,292      cpu-migrations            #   80.249 M/sec
           940,641      page-faults               # 14264.023 M/sec
   201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
    17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
   136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
    53,809,530,436      instructions              #    0.27  insn per cycle
                                                  #    2.54  stalled cycles per insn  (83.36%)
     9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
       153,008,621      branch-misses             #    1.69% of all branches          (83.32%)

      23.182970846 seconds time elapsed

TcpInSegs                       15648792           0.0
TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered              58654791           0.0
TcpExtTCPDeliveredCE            19                 0.0

After patch:

~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
  96046

 Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':

         48,982.58 msec task-clock                #    2.104 CPUs utilized
           186,014      context-switches          # 3797.599 M/sec
             3,109      cpu-migrations            #   63.472 M/sec
           941,180      page-faults               # 19214.814 M/sec
   153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
    12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
   120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
    36,803,672,106      instructions              #    0.24  insn per cycle
                                                  #    3.27  stalled cycles per insn  (83.18%)
     5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
        87,984,616      branch-misses             #    1.48% of all branches          (83.43%)

      23.281200256 seconds time elapsed

TcpInSegs                       1434706            0.0
TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered              58878971           0.0
TcpExtTCPDeliveredCE            9664               0.0
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NNeal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

65466904

30 12月, 2021 1 次提交

Documentation: fix outdated interpretation of ip_no_pmtu_disc · be1c5b53

由 xu xin 提交于 12月 30, 2021

The updating way of pmtu has changed, but documentation is still in the
old way. So this patch updates the interpretation of ip_no_pmtu_disc and
min_pmtu.

See commit 28d35bcd ("net: ipv4: don't let PMTU updates increase
route MTU")
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: Nxu xin <xu.xin16@zte.com.cn>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

be1c5b53

05 11月, 2021 1 次提交

net: udp: correct the document for udp_mem · 69dfccbc

由 Menglong Dong 提交于 11月 05, 2021

udp_mem is a vector of 3 INTEGERs, which is used to limit the number of
pages allowed for queueing by all UDP sockets.

However, sk_has_memory_pressure() in __sk_mem_raise_allocated() always
return false for udp, as memory pressure is not supported by udp, which
means that __sk_mem_raise_allocated() will fail once pages allocated
for udp socket exceeds udp_mem[0].

Therefor, udp_mem[0] is the only one that limit the number of pages.
However, the document of udp_mem just express that udp_mem[2] is the
limitation. So, just fix it.
Signed-off-by: NMenglong Dong <imagedong@tencent.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69dfccbc

02 11月, 2021 2 次提交

net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter · 18ac597a

由 James Prestwood 提交于 11月 01, 2021

In most situations the neighbor discovery cache should be cleared on a
NOCARRIER event which is currently done unconditionally. But for wireless
roams the neighbor discovery cache can and should remain intact since
the underlying network has not changed.

This patch introduces a sysctl option ndisc_evict_nocarrier which can
be disabled by a wireless supplicant during a roam. This allows packets
to be sent after a roam immediately without having to wait for
neighbor discovery.

A user reported roughly a 1 second delay after a roam before packets
could be sent out (note, on IPv4). This delay was due to the ARP
cache being cleared. During testing of this same scenario using IPv6
no delay was noticed, but regardless there is no reason to clear
the ndisc cache for wireless roams.
Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

18ac597a

net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0

由 James Prestwood 提交于 11月 01, 2021

This change introduces a new sysctl parameter, arp_evict_nocarrier.
When set (default) the ARP cache will be cleared on a NOCARRIER event.
This new option has been defaulted to '1' which maintains existing
behavior.

Clearing the ARP cache on NOCARRIER is relatively new, introduced by:

commit 859bd2ef
Author: David Ahern <dsahern@gmail.com>
Date:   Thu Oct 11 20:33:49 2018 -0700

    net: Evict neighbor entries on carrier down

The reason for this changes is to prevent the ARP cache from being
cleared when a wireless device roams. Specifically for wireless roams
the ARP cache should not be cleared because the underlying network has not
changed. Clearing the ARP cache in this case can introduce significant
delays sending out packets after a roam.

A user reported such a situation here:

https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/

After some investigation it was found that the kernel was holding onto
packets until ARP finished which resulted in this 1 second delay. It
was also found that the first ARP who-has was never responded to,
which is actually what caues the delay. This change is more or less
working around this behavior, but again, there is no reason to clear
the cache on a roam anyways.

As for the unanswered who-has, we know the packet made it OTA since
it was seen while monitoring. Why it never received a response is
unknown. In any case, since this is a problem on the AP side of things
all that can be done is to work around it until it is solved.

Some background on testing/reproducing the packet delay:

Hardware:
 - 2 access points configured for Fast BSS Transition (Though I don't
   see why regular reassociation wouldn't have the same behavior)
 - Wireless station running IWD as supplicant
 - A device on network able to respond to pings (I used one of the APs)

Procedure:
 - Connect to first AP
 - Ping once to establish an ARP entry
 - Start a tcpdump
 - Roam to second AP
 - Wait for operstate UP event, and note the timestamp
 - Start pinging

Results:

Below is the tcpdump after UP. It was recorded the interface went UP at
10:42:01.432875.

10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64

You can see the first ARP who-has went out very quickly after UP, but
was never responded to. Nearly a second later the kernel retries and
gets a response. Only then do the ping packets go out. If an ARP entry
is manually added prior to UP (after the cache is cleared) it is seen
that the first ping is never responded to, so its not only an issue with
ARP but with data packets in general.

As mentioned prior, the wireless interface was also monitored to verify
the ping/ARP packet made it OTA which was observed to be true.
Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

fcdb44d0

23 9月, 2021 1 次提交

tcp: remove sk_{tr}x_skb_cache · d8b81175

由 Eric Dumazet 提交于 9月 22, 2021

This reverts the following patches :

- commit 2e05fcae ("tcp: fix compile error if !CONFIG_SYSCTL")
- commit 4f661542 ("tcp: fix zerocopy and notsent_lowat issues")
- commit 472c2e07 ("tcp: add one skb cache for tx")
- commit 8b27dae5 ("tcp: add one skb cache for rx")

Having a cache of one skb (in each direction) per TCP socket is fragile,
since it can cause a significant increase of memory needs,
and not good enough for high speed flows anyway where more than one skb
is needed.

We want instead to add a generic infrastructure, with more flexible
per-cpu caches, for alien NUMA nodes.
Acked-by: NPaolo Abeni <pabeni@redhat.com>
Acked-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8b81175

22 7月, 2021 1 次提交

tcp: disable TFO blackhole logic by default · 213ad73d

由 Wei Wang 提交于 7月 21, 2021

Multiple complaints have been raised from the TFO users on the internet
stating that the TFO blackhole logic is too aggressive and gets falsely
triggered too often.
(e.g. https://blog.apnic.net/2021/07/05/tcp-fast-open-not-so-fast/)
Considering that most middleboxes no longer drop TFO packets, we decide
to disable the blackhole logic by setting
/proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_set to 0 by default.

Fixes: cf1ef3f0 ("net/tcp_fastopen: Disable active side TFO in certain scenarios")
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

213ad73d

21 7月, 2021 1 次提交

ipv6: ioam: Documentation for new IOAM sysctls · de8e80a5

由 Justin Iurman 提交于 7月 20, 2021

Add documentation for new IOAM sysctls:
 - ioam6_id and ioam6_id_wide: two per-namespace sysctls
 - ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls

Example of IOAM configuration based on the following simple topology:

 _____              _____              _____
|     | eth0  eth0 |     | eth1  eth0 |     |
|  A  |.----------.|  B  |.----------.|  C  |
|_____|            |_____|            |_____|

1) Node and interface IDs can be configured for IOAM:

  # IOAM ID of A = 1, IOAM ID of A.eth0 = 11
  (A) sysctl -w net.ipv6.ioam6_id=1
  (A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11

  # IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22
  (B) sysctl -w net.ipv6.ioam6_id=2
  (B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21
  (B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22

  # IOAM ID of C = 3, IOAM ID of C.eth0 = 31
  (C) sysctl -w net.ipv6.ioam6_id=3
  (C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31

  Note that "_wide" IDs equivalents can be configured the same way.

2) Each node can be configured to form an IOAM domain. For instance,
   we allow IOAM from A to C only (not the reverse path), i.e. enable
   IOAM on ingress for B.eth0 and C.eth0:

  (B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
  (C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1

3) An IOAM domain (e.g. ID=123) is defined and made known to each node:

  (A) ip ioam namespace add 123
  (B) ip ioam namespace add 123
  (C) ip ioam namespace add 123

4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent
   by A when C (e.g. db02::2) is the destination:

  (A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123
      size 12 dev eth0
Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de8e80a5

25 6月, 2021 1 次提交

sctp: send the next probe immediately once the last one is acked · fea1d5b1

由 Xin Long 提交于 6月 24, 2021

These is no need to wait for 'interval' period for the next probe
if the last probe is already acked in search state. The 'interval'
period waiting should be only for probe failure timeout and the
current pmtu check when it's in search complete state.

This change will shorten the probe time a lot in search state, and
also fix the document accordingly.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fea1d5b1

23 6月, 2021 1 次提交

sctp: add probe_interval in sysctl and sock/asoc/transport · d1e462a7

由 Xin Long 提交于 6月 22, 2021

PLPMTUD can be enabled by doing 'sysctl -w net.sctp.probe_interval=n'.
'n' is the interval for PLPMTUD probe timer in milliseconds, and it
can't be less than 5000 if it's not 0.

All asoc/transport's PLPMTUD in a new socket will be enabled by default.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d1e462a7

16 6月, 2021 1 次提交

net: Introduce net.ipv4.tcp_migrate_req. · f9ac779f

由 Kuniyuki Iwashima 提交于 6月 12, 2021

This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this
option is enabled or eBPF program is attached, we will be able to migrate
child sockets from a listener to another in the same reuseport group after
close() or shutdown() syscalls.
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp

f9ac779f

19 5月, 2021 4 次提交

ipv6: Add custom multipath hash policy · 73c2c5cb

由 Ido Schimmel 提交于 5月 17, 2021

Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.

The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.

In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

73c2c5cb

ipv6: Add a sysctl to control multipath hash fields · ed13923f

由 Ido Schimmel 提交于 5月 17, 2021

A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.

The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:

 # sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
 net.ipv6.fib_multipath_hash_fields = 0x0037

To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
'bindv6only' field after the multipath hash fields.

The kernel rejects unknown fields, for example:

 # sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
 sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed13923f

ipv4: Add custom multipath hash policy · 4253b498

由 Ido Schimmel 提交于 5月 17, 2021

Add a new multipath hash policy where the packet fields used for hash
calculation are determined by user space via the
fib_multipath_hash_fields sysctl that was introduced in the previous
patch.

The current set of available packet fields includes both outer and inner
fields, which requires two invocations of the flow dissector. Avoid
unnecessary dissection of the outer or inner flows by skipping
dissection if none of the outer or inner fields are required.

In accordance with the existing policies, when an skb is not available,
packet fields are extracted from the provided flow key. In which case,
only outer fields are considered.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4253b498

ipv4: Add a sysctl to control multipath hash fields · ce5c9c20

由 Ido Schimmel 提交于 5月 17, 2021

A subsequent patch will add a new multipath hash policy where the packet
fields used for multipath hash calculation are determined by user space.
This patch adds a sysctl that allows user space to set these fields.

The packet fields are represented using a bitmask and are common between
IPv4 and IPv6 to allow user space to use the same numbering across both
protocols. For example, to hash based on standard 5-tuple:

 # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
 net.ipv4.fib_multipath_hash_fields = 0x0037

The kernel rejects unknown fields, for example:

 # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
 sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument

More fields can be added in the future, if needed.
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ce5c9c20

15 4月, 2021 1 次提交

doc: move seg6_flowlabel to seg6-sysctl.rst · 292ecd9f

由 Nicolas Dichtel 提交于 4月 14, 2021

Let's have all seg6 sysctl at the same place.

Fixes: a6dc6670 ("ipv6: sr: Add documentation for seg_flowlabel sysctl")
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

292ecd9f

02 4月, 2021 1 次提交

net: document a side effect of ip_local_reserved_ports · a7a80b17

由 Otto Hollmann 提交于 4月 01, 2021

If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723

    int __inet_hash_connect(
    ...
            for (i = 0; i < remaining; i += 2, port += 2) {
                    if (unlikely(port >= high))
                            port -= remaining;
                    if (inet_is_local_reserved_port(net, port))
                            continue;

    E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others.
    If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter.
Signed-off-by: NOtto Hollmann <otto.hollmann@suse.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7a80b17

31 3月, 2021 1 次提交

net: add sysctl for enabling RFC 8335 PROBE messages · f1b8fa9f

由 Andreas Roeseler 提交于 3月 29, 2021

Section 8 of RFC 8335 specifies potential security concerns of
responding to PROBE requests, and states that nodes that support PROBE
functionality MUST be able to enable/disable responses and that
responses MUST be disabled by default
Signed-off-by: NAndreas Roeseler <andreas.a.roeseler@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f1b8fa9f

12 2月, 2021 1 次提交

tcp: fix tcp_rmem documentation · 1d1be912

由 Eric Dumazet 提交于 2月 10, 2021

tcp_rmem[1] has been changed to 131072, we should update the documentation
to reflect this.

Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NZhibin Liu <zhibinliu@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1d1be912

10 2月, 2021 1 次提交

Documentation: networking: ip-sysctl: Document src_valid_mark sysctl · 8cf5d8cc

由 Jay Vosburgh 提交于 2月 08, 2021

Provide documentation for src_valid_mark sysctl, which was added
in commit 28f6aeea ("net: restore ip source validation").
Signed-off-by: NJay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8cf5d8cc

09 2月, 2021 2 次提交

IPv6: Extend 'fib_notify_on_flag_change' sysctl · 6fad361a

由 Amit Cohen 提交于 2月 07, 2021

Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.

Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6fad361a

IPv4: Extend 'fib_notify_on_flag_change' sysctl · 648106c3

由 Amit Cohen 提交于 2月 07, 2021

Add the value '2' to 'fib_notify_on_flag_change' to allow sending
notifications only for failed route installation.

Separate value is added for such notifications because there are less of
them, so they do not impact performance and some users will find them more
important.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

648106c3

03 2月, 2021 2 次提交

net: ipv6: Emit notification when fib hardware flags are changed · 907eea48

由 Amit Cohen 提交于 2月 01, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead
to a routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.

Move fib6_info_hw_flags_set() to C file because it is no longer a short
function.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

907eea48

net: ipv4: Emit notification when fib hardware flags are changed · 680aea08

由 Amit Cohen 提交于 2月 01, 2021

After installing a route to the kernel, user space receives an
acknowledgment, which means the route was installed in the kernel,
but not necessarily in hardware.

The asynchronous nature of route installation in hardware can lead to a
routing daemon advertising a route before it was actually installed in
hardware. This can result in packet loss or mis-routed packets until the
route is installed in hardware.

It is also possible for a route already installed in hardware to change
its action and therefore its flags. For example, a host route that is
trapping packets can be "promoted" to perform decapsulation following
the installation of an IPinIP/VXLAN tunnel.

Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
are changed. The aim is to provide an indication to user-space
(e.g., routing daemons) about the state of the route in hardware.

Introduce a sysctl that controls this behavior.

Keep the default value at 0 (i.e., do not emit notifications) for several
reasons:
- Multiple RTM_NEWROUTE notification per-route might confuse existing
  routing daemons.
- Convergence reasons in routing daemons.
- The extra notifications will negatively impact the insertion rate.
- Not all users are interested in these notifications.
Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
Acked-by: NRoopa Prabhu <roopa@nvidia.com>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

680aea08

02 2月, 2021 1 次提交

docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc · 31628201

由 Vincent Bernat 提交于 1月 30, 2021

Signed-off-by: NVincent Bernat <vincent@bernat.ch>
Link: https://lore.kernel.org/r/20210130190518.854806-1-vincent@bernat.chSigned-off-by: NJakub Kicinski <kuba@kernel.org>

31628201

27 1月, 2021 1 次提交

net: allow user to set metric on default route learned via Router Advertisement · 6b2e04bc

由 Praveen Chaudhary 提交于 1月 25, 2021

For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.

Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This patch adds sysctl for the same.

Logs:

For IPv4:

Config in etc/network/interfaces:
auto eth0
iface eth0 inet dhcp
    metric 4261413864

IPv4 Kernel Route Table:
$ ip route list
default via 172.21.47.1 dev eth0 metric 4261413864

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m

i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
Similar behavior is not possible for IPv6, without this fix.

After fix [for IPv6]:
sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705

IP monitor: [When IPv6 RA is received]
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high

Kernel IPv6 routing table
$ ip -6 route list
default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high

FRR Table, if a static route is configured:
[In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
Codes: K - kernel route, C - connected, S - static, R - RIPng,
       O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
       v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       > - selected route, * - FIB route

S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m

If the metric is changed later, the effect will be seen only when next IPv6
RA is received, because the default route must be fully controlled by RA msg.
Below metric is changed from 1996489705 to 1996489704.

$ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704

IP monitor:
[On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]

Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
Signed-off-by: NPraveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: NZhenggen Xu <zxu@linkedin.com>
Reviewed-by: NDavid Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

6b2e04bc

24 1月, 2021 1 次提交

doc: networking: ip-sysctl: Document conf/all/disable_ipv6 and conf/default/disable_ipv6 · fc024c5c

由 Pali Rohár 提交于 1月 21, 2021

This patch adds documentation for sysctl conf/all/disable_ipv6 and
conf/default/disable_ipv6 settings which is currently missing.
Signed-off-by: NPali Rohár <pali@kernel.org>
Link: https://lore.kernel.org/r/20210121150244.20483-1-pali@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>

fc024c5c

12 11月, 2020 1 次提交

net: evaluate net.ipvX.conf.all.ignore_routes_with_linkdown · c0c5a60f

由 Vincent Bernat 提交于 11月 07, 2020

Introduced in 0eeb075f, the "ignore_routes_with_linkdown" sysctl
ignores a route whose interface is down. It is provided as a
per-interface sysctl. However, while a "all" variant is exposed, it
was a noop since it was never evaluated. We use the usual "or" logic
for this kind of sysctls.

Tested with:

    ip link add type veth # veth0 + veth1
    ip link add type veth # veth1 + veth2
    ip link set up dev veth0
    ip link set up dev veth1 # link-status paired with veth0
    ip link set up dev veth2
    ip link set up dev veth3 # link-status paired with veth2

    # First available path
    ip -4 addr add 203.0.113.${uts#H}/24 dev veth0
    ip -6 addr add 2001:db8:1::${uts#H}/64 dev veth0

    # Second available path
    ip -4 addr add 192.0.2.${uts#H}/24 dev veth2
    ip -6 addr add 2001:db8:2::${uts#H}/64 dev veth2

    # More specific route through first path
    ip -4 route add 198.51.100.0/25 via 203.0.113.254 # via veth0
    ip -6 route add 2001:db8:3::/56 via 2001:db8:1::ff # via veth0

    # Less specific route through second path
    ip -4 route add 198.51.100.0/24 via 192.0.2.254 # via veth2
    ip -6 route add 2001:db8:3::/48 via 2001:db8:2::ff # via veth2

    # H1: enable on "all"
    # H2: enable on "veth0"
    for v in ipv4 ipv6; do
      case $uts in
        H1)
          sysctl -qw net.${v}.conf.all.ignore_routes_with_linkdown=1
          ;;
        H2)
          sysctl -qw net.${v}.conf.veth0.ignore_routes_with_linkdown=1
          ;;
      esac
    done

    set -xe
    # When veth0 is up, best route is through veth0
    ip -o route get 198.51.100.1 | grep -Fw veth0
    ip -o route get 2001:db8:3::1 | grep -Fw veth0

    # When veth0 is down, best route should be through veth2 on H1/H2,
    # but on veth0 on H2
    ip link set down dev veth1 # down veth0
    ip route show
    [ $uts != H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth0
    [ $uts != H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth0
    [ $uts = H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth2
    [ $uts = H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth2

Without this patch, the two last lines would fail on H1 (the one using
the "all" sysctl). With the patch, everything succeeds as expected.

Also document the sysctl in `ip-sysctl.rst`.

Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
Signed-off-by: NVincent Bernat <vincent@bernat.ch>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

c0c5a60f

31 10月, 2020 2 次提交

sctp: enable udp tunneling socks · 046c052b

由 Xin Long 提交于 10月 29, 2020

This patch is to enable udp tunneling socks by calling
sctp_udp_sock_start() in sctp_ctrlsock_init(), and
sctp_udp_sock_stop() in sctp_ctrlsock_exit().

Also add sysctl udp_port to allow changing the listening
sock's port by users.

Wit this patch, the whole sctp over udp feature can be
enabled and used.

v1->v2:
  - Also update ctl_sock udp_port in proc_sctp_do_udp_port()
    where netns udp_port gets changed.
v2->v3:
  - Call htons() when setting sk udp_port from netns udp_port.
v3->v4:
  - Not call sctp_udp_sock_start() when new_value is 0.
  - Add udp_port entry in ip-sysctl.rst.
v4->v5:
  - Not call sctp_udp_sock_start/stop() in sctp_ctrlsock_init/exit().
  - Improve the description of udp_port in ip-sysctl.rst.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

046c052b

sctp: add encap_port for netns sock asoc and transport · e8a3001c

由 Xin Long 提交于 10月 29, 2020

encap_port is added as per netns/sock/assoc/transport, and the
latter one's encap_port inherits the former one's by default.
The transport's encap_port value would mostly decide if one
packet should go out with udp encapsulated or not.

This patch also allows users to set netns' encap_port by sysctl.

v1->v2:
  - Change to define encap_port as __be16 for sctp_sock, asoc and
    transport.
v2->v3:
  - No change.
v3->v4:
  - Add 'encap_port' entry in ip-sysctl.rst.
v4->v5:
  - Improve the description of encap_port in ip-sysctl.rst.
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

e8a3001c

17 10月, 2020 1 次提交

icmp: randomize the global rate limiter · b38e7819

由 Eric Dumazet 提交于 10月 15, 2020

Keyu Man reported that the ICMP rate limiter could be used
by attackers to get useful signal. Details will be provided
in an upcoming academic publication.

Our solution is to add some noise, so that the attackers
no longer can get help from the predictable token bucket limiter.

Fixes: 4cdf507d ("icmp: add a global rate limitation")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NKeyu Man <kman001@ucr.edu>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

b38e7819

05 7月, 2020 1 次提交

Documentation: networking: ip-sysctl: drop doubled word · a7db3c76

由 Randy Dunlap 提交于 7月 03, 2020

Drop the doubled word "that".
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a7db3c76

07 5月, 2020 1 次提交

ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464

由 Fernando Gont 提交于 5月 01, 2020

Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09

* Reduces the default Valid Lifetime to 2 days
  The number of extra addresses employed when Valid Lifetime was
  7 days exacerbated the stress caused on network
  elements/devices. Additionally, the motivation for temporary
  addresses is indeed privacy and reduced exposure. With a
  default Valid Lifetime of 7 days, an address that becomes
  revealed by active communication is reachable and exposed for
  one whole week. The only use case for a Valid Lifetime of 7
  days could be some application that is expecting to have long
  lived connections. But if you want to have a long lived
  connections, you shouldn't be using a temporary address in the
  first place. Additionally, in the era of mobile devices, general
  applications should nevertheless be prepared and robust to
  address changes (e.g. nodes swap wifi <-> 4G, etc.)

* Employs different IIDs for different prefixes
  To avoid network activity correlation among addresses configured
  for different prefixes

* Uses a simpler algorithm for IID generation
  No need to store "history" anywhere
Signed-off-by: NFernando Gont <fgont@si6networks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

969c5464

01 5月, 2020 2 次提交

tcp: add hrtimer slack to sack compression · a70437cc

由 Eric Dumazet 提交于 4月 30, 2020

Add a sysctl to control hrtimer slack, default of 100 usec.

This gives the opportunity to reduce system overhead,
and help very short RTT flows.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a70437cc

docs: networking: convert tcp-thin.txt to ReST · ff159f4f

由 Mauro Carvalho Chehab 提交于 4月 30, 2020

Not much to be done here:

- add SPDX header;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ff159f4f

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功