1. 11 6月, 2022 2 次提交
  2. 31 5月, 2022 1 次提交
  3. 20 4月, 2022 1 次提交
  4. 17 4月, 2022 1 次提交
  5. 10 3月, 2022 1 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
  6. 30 12月, 2021 1 次提交
  7. 05 11月, 2021 1 次提交
    • M
      net: udp: correct the document for udp_mem · 69dfccbc
      Menglong Dong 提交于
      udp_mem is a vector of 3 INTEGERs, which is used to limit the number of
      pages allowed for queueing by all UDP sockets.
      
      However, sk_has_memory_pressure() in __sk_mem_raise_allocated() always
      return false for udp, as memory pressure is not supported by udp, which
      means that __sk_mem_raise_allocated() will fail once pages allocated
      for udp socket exceeds udp_mem[0].
      
      Therefor, udp_mem[0] is the only one that limit the number of pages.
      However, the document of udp_mem just express that udp_mem[2] is the
      limitation. So, just fix it.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69dfccbc
  8. 02 11月, 2021 2 次提交
    • J
      net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter · 18ac597a
      James Prestwood 提交于
      In most situations the neighbor discovery cache should be cleared on a
      NOCARRIER event which is currently done unconditionally. But for wireless
      roams the neighbor discovery cache can and should remain intact since
      the underlying network has not changed.
      
      This patch introduces a sysctl option ndisc_evict_nocarrier which can
      be disabled by a wireless supplicant during a roam. This allows packets
      to be sent after a roam immediately without having to wait for
      neighbor discovery.
      
      A user reported roughly a 1 second delay after a roam before packets
      could be sent out (note, on IPv4). This delay was due to the ARP
      cache being cleared. During testing of this same scenario using IPv6
      no delay was noticed, but regardless there is no reason to clear
      the ndisc cache for wireless roams.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      18ac597a
    • J
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood 提交于
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      Signed-off-by: NJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      fcdb44d0
  9. 23 9月, 2021 1 次提交
  10. 22 7月, 2021 1 次提交
  11. 21 7月, 2021 1 次提交
    • J
      ipv6: ioam: Documentation for new IOAM sysctls · de8e80a5
      Justin Iurman 提交于
      Add documentation for new IOAM sysctls:
       - ioam6_id and ioam6_id_wide: two per-namespace sysctls
       - ioam6_enabled, ioam6_id and ioam6_id_wide: three per-interface sysctls
      
      Example of IOAM configuration based on the following simple topology:
      
       _____              _____              _____
      |     | eth0  eth0 |     | eth1  eth0 |     |
      |  A  |.----------.|  B  |.----------.|  C  |
      |_____|            |_____|            |_____|
      
      1) Node and interface IDs can be configured for IOAM:
      
        # IOAM ID of A = 1, IOAM ID of A.eth0 = 11
        (A) sysctl -w net.ipv6.ioam6_id=1
        (A) sysctl -w net.ipv6.conf.eth0.ioam6_id=11
      
        # IOAM ID of B = 2, IOAM ID of B.eth0 = 21, IOAM ID of B.eth1 = 22
        (B) sysctl -w net.ipv6.ioam6_id=2
        (B) sysctl -w net.ipv6.conf.eth0.ioam6_id=21
        (B) sysctl -w net.ipv6.conf.eth1.ioam6_id=22
      
        # IOAM ID of C = 3, IOAM ID of C.eth0 = 31
        (C) sysctl -w net.ipv6.ioam6_id=3
        (C) sysctl -w net.ipv6.conf.eth0.ioam6_id=31
      
        Note that "_wide" IDs equivalents can be configured the same way.
      
      2) Each node can be configured to form an IOAM domain. For instance,
         we allow IOAM from A to C only (not the reverse path), i.e. enable
         IOAM on ingress for B.eth0 and C.eth0:
      
        (B) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
        (C) sysctl -w net.ipv6.conf.eth0.ioam6_enabled=1
      
      3) An IOAM domain (e.g. ID=123) is defined and made known to each node:
      
        (A) ip ioam namespace add 123
        (B) ip ioam namespace add 123
        (C) ip ioam namespace add 123
      
      4) Finally, an IOAM Pre-allocated Trace can be inserted in traffic sent
         by A when C (e.g. db02::2) is the destination:
      
        (A) ip -6 route add db02::2/128 encap ioam6 trace type 0x800000 ns 123
            size 12 dev eth0
      Signed-off-by: NJustin Iurman <justin.iurman@uliege.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8e80a5
  12. 25 6月, 2021 1 次提交
  13. 23 6月, 2021 1 次提交
  14. 16 6月, 2021 1 次提交
  15. 19 5月, 2021 4 次提交
    • I
      ipv6: Add custom multipath hash policy · 73c2c5cb
      Ido Schimmel 提交于
      Add a new multipath hash policy where the packet fields used for hash
      calculation are determined by user space via the
      fib_multipath_hash_fields sysctl that was introduced in the previous
      patch.
      
      The current set of available packet fields includes both outer and inner
      fields, which requires two invocations of the flow dissector. Avoid
      unnecessary dissection of the outer or inner flows by skipping
      dissection if none of the outer or inner fields are required.
      
      In accordance with the existing policies, when an skb is not available,
      packet fields are extracted from the provided flow key. In which case,
      only outer fields are considered.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73c2c5cb
    • I
      ipv6: Add a sysctl to control multipath hash fields · ed13923f
      Ido Schimmel 提交于
      A subsequent patch will add a new multipath hash policy where the packet
      fields used for multipath hash calculation are determined by user space.
      This patch adds a sysctl that allows user space to set these fields.
      
      The packet fields are represented using a bitmask and are common between
      IPv4 and IPv6 to allow user space to use the same numbering across both
      protocols. For example, to hash based on standard 5-tuple:
      
       # sysctl -w net.ipv6.fib_multipath_hash_fields=0x0037
       net.ipv6.fib_multipath_hash_fields = 0x0037
      
      To avoid introducing holes in 'struct netns_sysctl_ipv6', move the
      'bindv6only' field after the multipath hash fields.
      
      The kernel rejects unknown fields, for example:
      
       # sysctl -w net.ipv6.fib_multipath_hash_fields=0x1000
       sysctl: setting key "net.ipv6.fib_multipath_hash_fields": Invalid argument
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed13923f
    • I
      ipv4: Add custom multipath hash policy · 4253b498
      Ido Schimmel 提交于
      Add a new multipath hash policy where the packet fields used for hash
      calculation are determined by user space via the
      fib_multipath_hash_fields sysctl that was introduced in the previous
      patch.
      
      The current set of available packet fields includes both outer and inner
      fields, which requires two invocations of the flow dissector. Avoid
      unnecessary dissection of the outer or inner flows by skipping
      dissection if none of the outer or inner fields are required.
      
      In accordance with the existing policies, when an skb is not available,
      packet fields are extracted from the provided flow key. In which case,
      only outer fields are considered.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4253b498
    • I
      ipv4: Add a sysctl to control multipath hash fields · ce5c9c20
      Ido Schimmel 提交于
      A subsequent patch will add a new multipath hash policy where the packet
      fields used for multipath hash calculation are determined by user space.
      This patch adds a sysctl that allows user space to set these fields.
      
      The packet fields are represented using a bitmask and are common between
      IPv4 and IPv6 to allow user space to use the same numbering across both
      protocols. For example, to hash based on standard 5-tuple:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x0037
       net.ipv4.fib_multipath_hash_fields = 0x0037
      
      The kernel rejects unknown fields, for example:
      
       # sysctl -w net.ipv4.fib_multipath_hash_fields=0x1000
       sysctl: setting key "net.ipv4.fib_multipath_hash_fields": Invalid argument
      
      More fields can be added in the future, if needed.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce5c9c20
  16. 15 4月, 2021 1 次提交
  17. 02 4月, 2021 1 次提交
    • O
      net: document a side effect of ip_local_reserved_ports · a7a80b17
      Otto Hollmann 提交于
      If there is overlapp between ip_local_port_range and ip_local_reserved_ports with a huge reserved block, it will affect probability of selecting ephemeral ports, see file net/ipv4/inet_hashtables.c:723
      
          int __inet_hash_connect(
          ...
                  for (i = 0; i < remaining; i += 2, port += 2) {
                          if (unlikely(port >= high))
                                  port -= remaining;
                          if (inet_is_local_reserved_port(net, port))
                                  continue;
      
          E.g. if there is reserved block of 10000 ports, two ports right after this block will be 5000 more likely selected than others.
          If this was intended, we can/should add note into documentation as proposed in this commit, otherwise we should think about different solution. One option could be mapping table of continuous port ranges. Second option could be letting user to modify step (port+=2) in above loop, e.g. using new sysctl parameter.
      Signed-off-by: NOtto Hollmann <otto.hollmann@suse.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7a80b17
  18. 31 3月, 2021 1 次提交
  19. 12 2月, 2021 1 次提交
  20. 10 2月, 2021 1 次提交
  21. 09 2月, 2021 2 次提交
  22. 03 2月, 2021 2 次提交
    • A
      net: ipv6: Emit notification when fib hardware flags are changed · 907eea48
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead
      to a routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      
      Move fib6_info_hw_flags_set() to C file because it is no longer a short
      function.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      907eea48
    • A
      net: ipv4: Emit notification when fib hardware flags are changed · 680aea08
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel,
      but not necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      It is also possible for a route already installed in hardware to change
      its action and therefore its flags. For example, a host route that is
      trapping packets can be "promoted" to perform decapsulation following
      the installation of an IPinIP/VXLAN tunnel.
      
      Emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed. The aim is to provide an indication to user-space
      (e.g., routing daemons) about the state of the route in hardware.
      
      Introduce a sysctl that controls this behavior.
      
      Keep the default value at 0 (i.e., do not emit notifications) for several
      reasons:
      - Multiple RTM_NEWROUTE notification per-route might confuse existing
        routing daemons.
      - Convergence reasons in routing daemons.
      - The extra notifications will negatively impact the insertion rate.
      - Not all users are interested in these notifications.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Acked-by: NRoopa Prabhu <roopa@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      680aea08
  23. 02 2月, 2021 1 次提交
  24. 27 1月, 2021 1 次提交
    • P
      net: allow user to set metric on default route learned via Router Advertisement · 6b2e04bc
      Praveen Chaudhary 提交于
      For IPv4, default route is learned via DHCPv4 and user is allowed to change
      metric using config etc/network/interfaces. But for IPv6, default route can
      be learned via RA, for which, currently a fixed metric value 1024 is used.
      
      Ideally, user should be able to configure metric on default route for IPv6
      similar to IPv4. This patch adds sysctl for the same.
      
      Logs:
      
      For IPv4:
      
      Config in etc/network/interfaces:
      auto eth0
      iface eth0 inet dhcp
          metric 4261413864
      
      IPv4 Kernel Route Table:
      $ ip route list
      default via 172.21.47.1 dev eth0 metric 4261413864
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIP,
             O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
             T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
      K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m
      
      i.e. User can prefer Default Router learned via Routing Protocol in IPv4.
      Similar behavior is not possible for IPv6, without this fix.
      
      After fix [for IPv6]:
      sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.ra_defrtr_metric=1996489705
      
      IP monitor: [When IPv6 RA is received]
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705  pref high
      
      Kernel IPv6 routing table
      $ ip -6 route list
      default via fe80::be16:65ff:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 21sec hoplimit 64 pref high
      
      FRR Table, if a static route is configured:
      [In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.]
      Codes: K - kernel route, C - connected, S - static, R - RIPng,
             O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
             v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
             > - selected route, * - FIB route
      
      S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
      K   ::/0 [119/1001] via fe80::xx16:xxxx:feb3:ce8e, eth0, 6d07h43m
      
      If the metric is changed later, the effect will be seen only when next IPv6
      RA is received, because the default route must be fully controlled by RA msg.
      Below metric is changed from 1996489705 to 1996489704.
      
      $ sudo sysctl -w net.ipv6.conf.eth0.ra_defrtr_metric=1996489704
      net.ipv6.conf.eth0.ra_defrtr_metric = 1996489704
      
      IP monitor:
      [On next IPv6 RA msg, Kernel deletes prev route and installs new route with updated metric]
      
      Deleted default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high
      default via fe80::xx16:xxxx:feb3:ce8e dev eth0 proto ra metric 1996489704 pref high
      Signed-off-by: NPraveen Chaudhary <pchaudhary@linkedin.com>
      Signed-off-by: NZhenggen Xu <zxu@linkedin.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20210125214430.24079-1-pchaudhary@linkedin.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      6b2e04bc
  25. 24 1月, 2021 1 次提交
  26. 12 11月, 2020 1 次提交
    • V
      net: evaluate net.ipvX.conf.all.ignore_routes_with_linkdown · c0c5a60f
      Vincent Bernat 提交于
      Introduced in 0eeb075f, the "ignore_routes_with_linkdown" sysctl
      ignores a route whose interface is down. It is provided as a
      per-interface sysctl. However, while a "all" variant is exposed, it
      was a noop since it was never evaluated. We use the usual "or" logic
      for this kind of sysctls.
      
      Tested with:
      
          ip link add type veth # veth0 + veth1
          ip link add type veth # veth1 + veth2
          ip link set up dev veth0
          ip link set up dev veth1 # link-status paired with veth0
          ip link set up dev veth2
          ip link set up dev veth3 # link-status paired with veth2
      
          # First available path
          ip -4 addr add 203.0.113.${uts#H}/24 dev veth0
          ip -6 addr add 2001:db8:1::${uts#H}/64 dev veth0
      
          # Second available path
          ip -4 addr add 192.0.2.${uts#H}/24 dev veth2
          ip -6 addr add 2001:db8:2::${uts#H}/64 dev veth2
      
          # More specific route through first path
          ip -4 route add 198.51.100.0/25 via 203.0.113.254 # via veth0
          ip -6 route add 2001:db8:3::/56 via 2001:db8:1::ff # via veth0
      
          # Less specific route through second path
          ip -4 route add 198.51.100.0/24 via 192.0.2.254 # via veth2
          ip -6 route add 2001:db8:3::/48 via 2001:db8:2::ff # via veth2
      
          # H1: enable on "all"
          # H2: enable on "veth0"
          for v in ipv4 ipv6; do
            case $uts in
              H1)
                sysctl -qw net.${v}.conf.all.ignore_routes_with_linkdown=1
                ;;
              H2)
                sysctl -qw net.${v}.conf.veth0.ignore_routes_with_linkdown=1
                ;;
            esac
          done
      
          set -xe
          # When veth0 is up, best route is through veth0
          ip -o route get 198.51.100.1 | grep -Fw veth0
          ip -o route get 2001:db8:3::1 | grep -Fw veth0
      
          # When veth0 is down, best route should be through veth2 on H1/H2,
          # but on veth0 on H2
          ip link set down dev veth1 # down veth0
          ip route show
          [ $uts != H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth0
          [ $uts != H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth0
          [ $uts = H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth2
          [ $uts = H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth2
      
      Without this patch, the two last lines would fail on H1 (the one using
      the "all" sysctl). With the patch, everything succeeds as expected.
      
      Also document the sysctl in `ip-sysctl.rst`.
      
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c0c5a60f
  27. 31 10月, 2020 2 次提交
    • X
      sctp: enable udp tunneling socks · 046c052b
      Xin Long 提交于
      This patch is to enable udp tunneling socks by calling
      sctp_udp_sock_start() in sctp_ctrlsock_init(), and
      sctp_udp_sock_stop() in sctp_ctrlsock_exit().
      
      Also add sysctl udp_port to allow changing the listening
      sock's port by users.
      
      Wit this patch, the whole sctp over udp feature can be
      enabled and used.
      
      v1->v2:
        - Also update ctl_sock udp_port in proc_sctp_do_udp_port()
          where netns udp_port gets changed.
      v2->v3:
        - Call htons() when setting sk udp_port from netns udp_port.
      v3->v4:
        - Not call sctp_udp_sock_start() when new_value is 0.
        - Add udp_port entry in ip-sysctl.rst.
      v4->v5:
        - Not call sctp_udp_sock_start/stop() in sctp_ctrlsock_init/exit().
        - Improve the description of udp_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      046c052b
    • X
      sctp: add encap_port for netns sock asoc and transport · e8a3001c
      Xin Long 提交于
      encap_port is added as per netns/sock/assoc/transport, and the
      latter one's encap_port inherits the former one's by default.
      The transport's encap_port value would mostly decide if one
      packet should go out with udp encapsulated or not.
      
      This patch also allows users to set netns' encap_port by sysctl.
      
      v1->v2:
        - Change to define encap_port as __be16 for sctp_sock, asoc and
          transport.
      v2->v3:
        - No change.
      v3->v4:
        - Add 'encap_port' entry in ip-sysctl.rst.
      v4->v5:
        - Improve the description of encap_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e8a3001c
  28. 17 10月, 2020 1 次提交
  29. 05 7月, 2020 1 次提交
  30. 07 5月, 2020 1 次提交
    • F
      ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464
      Fernando Gont 提交于
      Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
      https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09
      
      * Reduces the default Valid Lifetime to 2 days
        The number of extra addresses employed when Valid Lifetime was
        7 days exacerbated the stress caused on network
        elements/devices. Additionally, the motivation for temporary
        addresses is indeed privacy and reduced exposure. With a
        default Valid Lifetime of 7 days, an address that becomes
        revealed by active communication is reachable and exposed for
        one whole week. The only use case for a Valid Lifetime of 7
        days could be some application that is expecting to have long
        lived connections. But if you want to have a long lived
        connections, you shouldn't be using a temporary address in the
        first place. Additionally, in the era of mobile devices, general
        applications should nevertheless be prepared and robust to
        address changes (e.g. nodes swap wifi <-> 4G, etc.)
      
      * Employs different IIDs for different prefixes
        To avoid network activity correlation among addresses configured
        for different prefixes
      
      * Uses a simpler algorithm for IID generation
        No need to store "history" anywhere
      Signed-off-by: NFernando Gont <fgont@si6networks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969c5464
  31. 01 5月, 2020 2 次提交