1. 02 11月, 2021 1 次提交
    • J
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood 提交于
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fcdb44d0
  2. 23 9月, 2021 1 次提交
  3. 22 7月, 2021 1 次提交
  4. 21 7月, 2021 1 次提交
    • J
      ipv6: ioam: Documentation for new IOAM sysctls · de8e80a5
      Justin Iurman 提交于
      Add documentation for new IOAM sysctls:
       - ioam6_id and ioam6_id_wide: two per-namespace sysctls
       - ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls
      
      Example of IOAM configuration based on the following simple topology:
      
       _____              _____              _____
      |     | eth0  eth0 |     | eth1  eth0 |     |
      |  A  |.----------.|  B  |.----------.|  C  |
      |_____|            |_____|            |_____|
      
      1) Node and interface IDs can be configured for IOAM:
      
        # IOAM ID of A = 1, IOAM ID of A.eth0 = 11
        (A) sysctl -w net.ipv6.ioam6_id=1
        (A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11
      
        # IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22
        (B) sysctl -w net.ipv6.ioam6_id=2
        (B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21
        (B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22
      
        # IOAM ID of C = 3, IOAM ID of C.eth0 = 31
        (C) sysctl -w net.ipv6.ioam6_id=3
        (C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31
      
        Note that "_wide" IDs equivalents can be configured the same way.
      
      2) Each node can be configured to form an IOAM domain. For instance,
         we allow IOAM from A to C only (not the reverse path), i.e. enable
         IOAM on ingress for B.eth0 and C.eth0:
      
        (B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
        (C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
      
      3) An IOAM domain (e.g. ID=123) is defined and made known to each node:
      
        (A) ip ioam namespace add 123
        (B) ip ioam namespace add 123
        (C) ip ioam namespace add 123
      
      4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent
         by A when C (e.g. db02::2) is the destination:
      
        (A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123
            size 12 dev eth0
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8e80a5
  5. 25 6月, 2021 1 次提交
  6. 23 6月, 2021 1 次提交
  7. 16 6月, 2021 1 次提交
  8. 19 5月, 2021 4 次提交
    • I
      ipv6: Add custom multipath hash policy · 73c2c5cb
      Ido Schimmel 提交于
      Add a new multipath hash policy where the packet fields used for hash
      calculation are determined by user space via the
      fib_multipath_hash_fields sysctl that was introduced in the previous
      patch.
      
      The current set of available packet fields includes both outer and inner
      fields, which requires two invocations of the flow dissector. Avoid
      unnecessary dissection of the outer or inner flows by skipping
      dissection if none of the outer or inner fields are required.
      
      In accordance with the existing policies, when an skb is not available,
      packet fields are extracted from the provided flow key. In which case,
      only outer fields are considered.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73c2c5cb
    • I
      ipv6: Add a sysctl to control multipath hash fields · ed13923f
      Ido Schimmel 提交于
      A subsequent patch will add a new multipath hash policy where the packet
      fields used for multipath hash calculation are determined by user space.
      This patch adds a sysctl that allows user space to set these fields.
      
      The packet fields are represented using a bitmask and are common between
      IPv4 and IPv6 to allow user space to use the same numbering across both
      protocols. For example, to hash based on standard 5-tuple:
      
       # sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
       net.ipv6.fib_multipath_hash_fields = 0x0037
      
      To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
      'bindv6only' field after the multipath hash fields.
      
      The kernel rejects unknown fields, for example:
      
       # sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
       sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed13923f
    • I
      ipv4: Add custom multipath hash policy · 4253b498
      Ido Schimmel 提交于
      Add a new multipath hash policy where the packet fields used for hash
      calculation are determined by user space via the
      fib_multipath_hash_fields sysctl that was introduced in the previous
      patch.
      
      The current set of available packet fields includes both outer and inner
      fields, which requires two invocations of the flow dissector. Avoid
      unnecessary dissection of the outer or inner flows by skipping
      dissection if none of the outer or inner fields are required.
      
      In accordance with the existing policies, when an skb is not available,
      packet fields are extracted from the provided flow key. In which case,
      only outer fields are considered.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4253b498
    • I
      ipv4: Add a sysctl to control multipath hash fields · ce5c9c20
      Ido Schimmel 提交于
      A subsequent patch will add a new multipath hash policy where the packet
      fields used for multipath hash calculation are determined by user space.
      This patch adds a sysctl that allows user space to set these fields.
      
      The packet fields are represented using a bitmask and are common between
      IPv4 and IPv6 to allow user space to use the same numbering across both
      protocols. For example, to hash based on standard 5-tuple:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
       net.ipv4.fib_multipath_hash_fields = 0x0037
      
      The kernel rejects unknown fields, for example:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
       sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
      
      More fields can be added in the future, if needed.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5c9c20
  9. 15 4月, 2021 1 次提交
  10. 02 4月, 2021 1 次提交
    • O
      net: document a side effect of ip_local_reserved_ports · a7a80b17
      Otto Hollmann 提交于
      If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723
      
          int __inet_hash_connect(
          ...
                  for (i = 0; i < remaining; i += 2, port += 2) {
                          if (unlikely(port >= high))
                                  port -= remaining;
                          if (inet_is_local_reserved_port(net, port))
                                  continue;
      
          E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others.
          If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter.
      Signed-off-by: NOtto Hollmann <otto.hollmann@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7a80b17
  11. 31 3月, 2021 1 次提交
  12. 12 2月, 2021 1 次提交
  13. 10 2月, 2021 1 次提交
  14. 09 2月, 2021 2 次提交
  15. 03 2月, 2021 2 次提交
    • A
      net: ipv6: Emit notification when fib hardware flags are changed · 907eea48
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead
      to a routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      
      Move fib6_info_hw_flags_set() to C file because it is no longer a short
      function.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      907eea48
    • A
      net: ipv4: Emit notification when fib hardware flags are changed · 680aea08
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      680aea08
  16. 02 2月, 2021 1 次提交
  17. 27 1月, 2021 1 次提交
    • P
      net: allow user to set metric on default route learned via Router Advertisement · 6b2e04bc
      Praveen Chaudhary 提交于
      For IPv4, default route is learned via DHCPv4 and user is allowed to change
      metric using config etc/network/interfaces. But for IPv6, default route can
      be learned via RA, for which, currently a fixed metric value 1024 is used.
      
      Ideally, user should be able to configure metric on default route for IPv6
      similar to IPv4. This patch adds sysctl for the same.
      
      Logs:
      
      For IPv4:
      
      Config in etc/network/interfaces:
      auto eth0
      iface eth0 inet dhcp
          metric 4261413864
      
      IPv4 Kernel Route Table:
      $ ip route list
      default via 172.21.47.1 dev eth0 metric 4261413864
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIP,
             O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
             T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
      K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m
      
      i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
      Similar behavior is not possible for IPv6, without this fix.
      
      After fix [for IPv6]:
      sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705
      
      IP monitor: [When IPv6 RA is received]
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high
      
      Kernel IPv6 routing table
      $ ip -6 route list
      default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIPng,
             O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
             v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
      K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m
      
      If the metric is changed later, the effect will be seen only when next IPv6
      RA is received, because the default route must be fully controlled by RA msg.
      Below metric is changed from 1996489705 to 1996489704.
      
      $ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
      net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704
      
      IP monitor:
      [On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]
      
      Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
      Signed-off-by: NPraveen Chaudhary <pchaudhary@linkedin.com>
      Signed-off-by: NZhenggen Xu <zxu@linkedin.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6b2e04bc
  18. 24 1月, 2021 1 次提交
  19. 12 11月, 2020 1 次提交
    • V
      net: evaluate net.ipvX.conf.all.ignore_routes_with_linkdown · c0c5a60f
      Vincent Bernat 提交于
      Introduced in 0eeb075f, the "ignore_routes_with_linkdown" sysctl
      ignores a route whose interface is down. It is provided as a
      per-interface sysctl. However, while a "all" variant is exposed, it
      was a noop since it was never evaluated. We use the usual "or" logic
      for this kind of sysctls.
      
      Tested with:
      
          ip link add type veth # veth0 + veth1
          ip link add type veth # veth1 + veth2
          ip link set up dev veth0
          ip link set up dev veth1 # link-status paired with veth0
          ip link set up dev veth2
          ip link set up dev veth3 # link-status paired with veth2
      
          # First available path
          ip -4 addr add 203.0.113.${uts#H}/24 dev veth0
          ip -6 addr add 2001:db8:1::${uts#H}/64 dev veth0
      
          # Second available path
          ip -4 addr add 192.0.2.${uts#H}/24 dev veth2
          ip -6 addr add 2001:db8:2::${uts#H}/64 dev veth2
      
          # More specific route through first path
          ip -4 route add 198.51.100.0/25 via 203.0.113.254 # via veth0
          ip -6 route add 2001:db8:3::/56 via 2001:db8:1::ff # via veth0
      
          # Less specific route through second path
          ip -4 route add 198.51.100.0/24 via 192.0.2.254 # via veth2
          ip -6 route add 2001:db8:3::/48 via 2001:db8:2::ff # via veth2
      
          # H1: enable on "all"
          # H2: enable on "veth0"
          for v in ipv4 ipv6; do
            case $uts in
              H1)
                sysctl -qw net.${v}.conf.all.ignore_routes_with_linkdown=1
                ;;
              H2)
                sysctl -qw net.${v}.conf.veth0.ignore_routes_with_linkdown=1
                ;;
            esac
          done
      
          set -xe
          # When veth0 is up, best route is through veth0
          ip -o route get 198.51.100.1 | grep -Fw veth0
          ip -o route get 2001:db8:3::1 | grep -Fw veth0
      
          # When veth0 is down, best route should be through veth2 on H1/H2,
          # but on veth0 on H2
          ip link set down dev veth1 # down veth0
          ip route show
          [ $uts != H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth0
          [ $uts != H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth0
          [ $uts = H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth2
          [ $uts = H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth2
      
      Without this patch, the two last lines would fail on H1 (the one using
      the "all" sysctl). With the patch, everything succeeds as expected.
      
      Also document the sysctl in `ip-sysctl.rst`.
      
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c0c5a60f
  20. 31 10月, 2020 2 次提交
    • X
      sctp: enable udp tunneling socks · 046c052b
      Xin Long 提交于
      This patch is to enable udp tunneling socks by calling
      sctp_udp_sock_start() in sctp_ctrlsock_init(), and
      sctp_udp_sock_stop() in sctp_ctrlsock_exit().
      
      Also add sysctl udp_port to allow changing the listening
      sock's port by users.
      
      Wit this patch, the whole sctp over udp feature can be
      enabled and used.
      
      v1->v2:
        - Also update ctl_sock udp_port in proc_sctp_do_udp_port()
          where netns udp_port gets changed.
      v2->v3:
        - Call htons() when setting sk udp_port from netns udp_port.
      v3->v4:
        - Not call sctp_udp_sock_start() when new_value is 0.
        - Add udp_port entry in ip-sysctl.rst.
      v4->v5:
        - Not call sctp_udp_sock_start/stop() in sctp_ctrlsock_init/exit().
        - Improve the description of udp_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      046c052b
    • X
      sctp: add encap_port for netns sock asoc and transport · e8a3001c
      Xin Long 提交于
      encap_port is added as per netns/sock/assoc/transport, and the
      latter one's encap_port inherits the former one's by default.
      The transport's encap_port value would mostly decide if one
      packet should go out with udp encapsulated or not.
      
      This patch also allows users to set netns' encap_port by sysctl.
      
      v1->v2:
        - Change to define encap_port as __be16 for sctp_sock, asoc and
          transport.
      v2->v3:
        - No change.
      v3->v4:
        - Add 'encap_port' entry in ip-sysctl.rst.
      v4->v5:
        - Improve the description of encap_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e8a3001c
  21. 17 10月, 2020 1 次提交
  22. 05 7月, 2020 1 次提交
  23. 07 5月, 2020 1 次提交
    • F
      ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464
      Fernando Gont 提交于
      Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
      https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09
      
      * Reduces the default Valid Lifetime to 2 days
        The number of extra addresses employed when Valid Lifetime was
        7 days exacerbated the stress caused on network
        elements/devices. Additionally, the motivation for temporary
        addresses is indeed privacy and reduced exposure. With a
        default Valid Lifetime of 7 days, an address that becomes
        revealed by active communication is reachable and exposed for
        one whole week. The only use case for a Valid Lifetime of 7
        days could be some application that is expecting to have long
        lived connections. But if you want to have a long lived
        connections, you shouldn't be using a temporary address in the
        first place. Additionally, in the era of mobile devices, general
        applications should nevertheless be prepared and robust to
        address changes (e.g. nodes swap wifi <-> 4G, etc.)
      
      * Employs different IIDs for different prefixes
        To avoid network activity correlation among addresses configured
        for different prefixes
      
      * Uses a simpler algorithm for IID generation
        No need to store "history" anywhere
      Signed-off-by: NFernando Gont <fgont@si6networks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969c5464
  24. 01 5月, 2020 2 次提交
  25. 29 4月, 2020 2 次提交
  26. 23 4月, 2020 1 次提交
  27. 16 4月, 2020 1 次提交
  28. 13 3月, 2020 1 次提交
    • K
      tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted. · 4b01a967
      Kuniyuki Iwashima 提交于
      Commit aacd9289 ("tcp: bind() use stronger
      condition for bind_conflict") introduced a restriction to forbid to bind
      SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
      assign ports dispersedly so that we can connect to the same remote host.
      
      The change results in accelerating port depletion so that we fail to bind
      sockets to the same local port even if we want to connect to the different
      remote hosts.
      
      You can reproduce this issue by following instructions below.
      
        1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
        2. set SO_REUSEADDR to two sockets.
        3. bind two sockets to (localhost, 0) and the latter fails.
      
      Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
      the legacy behaviour to enable the SO_REUSEADDR option and make it possible
      to connect to different remote (addr, port) tuples.
      
      This patch allows us to bind SO_REUSEADDR enabled sockets to the same
      (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
      ephemeral ports are exhausted. This also allows connect() and listen() to
      share ports in the following way and may break some applications. So the
      ip_autobind_reuse is 0 by default and disables the feature.
      
        1. setsockopt(sk1, SO_REUSEADDR)
        2. setsockopt(sk2, SO_REUSEADDR)
        3. bind(sk1, saddr, 0)
        4. bind(sk2, saddr, 0)
        5. connect(sk1, daddr)
        6. listen(sk2)
      
      If it is set 1, we can fully utilize the 4-tuples, but we should use
      IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
      
      The notable thing is that if all sockets bound to the same port have
      both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
      ephemeral port and also do listen().
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b01a967
  29. 03 1月, 2020 1 次提交
  30. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  31. 27 11月, 2019 1 次提交
  32. 09 11月, 2019 1 次提交
    • X
      sctp: add support for Primary Path Switchover · 34515e94
      Xin Long 提交于
      This is a new feature defined in section 5 of rfc7829: "Primary Path
      Switchover". By introducing a new tunable parameter:
      
        Primary.Switchover.Max.Retrans (PSMR)
      
      The primary path will be changed to another active path when the path
      error counter on the old primary path exceeds PSMR, so that "the SCTP
      sender is allowed to continue data transmission on a new working path
      even when the old primary destination address becomes active again".
      
      This patch is to add this tunable parameter, 'ps_retrans' per netns,
      sock, asoc and transport. It also allows a user to change ps_retrans
      per netns by sysctl, and ps_retrans per sock/asoc/transport will be
      initialized with it.
      
      The check will be done in sctp_do_8_2_transport_strike() when this
      feature is enabled.
      
      Note this feature is disabled by initializing 'ps_retrans' per netns
      as 0xffff by default, and its value can't be less than 'pf_retrans'
      when changing by sysctl.
      
      v3->v4:
        - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
          sysctl 'ps_retrans'.
        - add a new entry for ps_retrans on ip-sysctl.txt.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34515e94