1. 02 2月, 2021 1 次提交
  2. 24 1月, 2021 1 次提交
  3. 12 11月, 2020 1 次提交
    • V
      net: evaluate net.ipvX.conf.all.ignore_routes_with_linkdown · c0c5a60f
      Vincent Bernat 提交于
      Introduced in 0eeb075f, the "ignore_routes_with_linkdown" sysctl
      ignores a route whose interface is down. It is provided as a
      per-interface sysctl. However, while a "all" variant is exposed, it
      was a noop since it was never evaluated. We use the usual "or" logic
      for this kind of sysctls.
      
      Tested with:
      
          ip link add type veth # veth0 + veth1
          ip link add type veth # veth1 + veth2
          ip link set up dev veth0
          ip link set up dev veth1 # link-status paired with veth0
          ip link set up dev veth2
          ip link set up dev veth3 # link-status paired with veth2
      
          # First available path
          ip -4 addr add 203.0.113.${uts#H}/24 dev veth0
          ip -6 addr add 2001:db8:1::${uts#H}/64 dev veth0
      
          # Second available path
          ip -4 addr add 192.0.2.${uts#H}/24 dev veth2
          ip -6 addr add 2001:db8:2::${uts#H}/64 dev veth2
      
          # More specific route through first path
          ip -4 route add 198.51.100.0/25 via 203.0.113.254 # via veth0
          ip -6 route add 2001:db8:3::/56 via 2001:db8:1::ff # via veth0
      
          # Less specific route through second path
          ip -4 route add 198.51.100.0/24 via 192.0.2.254 # via veth2
          ip -6 route add 2001:db8:3::/48 via 2001:db8:2::ff # via veth2
      
          # H1: enable on "all"
          # H2: enable on "veth0"
          for v in ipv4 ipv6; do
            case $uts in
              H1)
                sysctl -qw net.${v}.conf.all.ignore_routes_with_linkdown=1
                ;;
              H2)
                sysctl -qw net.${v}.conf.veth0.ignore_routes_with_linkdown=1
                ;;
            esac
          done
      
          set -xe
          # When veth0 is up, best route is through veth0
          ip -o route get 198.51.100.1 | grep -Fw veth0
          ip -o route get 2001:db8:3::1 | grep -Fw veth0
      
          # When veth0 is down, best route should be through veth2 on H1/H2,
          # but on veth0 on H2
          ip link set down dev veth1 # down veth0
          ip route show
          [ $uts != H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth0
          [ $uts != H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth0
          [ $uts = H3 ] || ip -o route get 198.51.100.1 | grep -Fw veth2
          [ $uts = H3 ] || ip -o route get 2001:db8:3::1 | grep -Fw veth2
      
      Without this patch, the two last lines would fail on H1 (the one using
      the "all" sysctl). With the patch, everything succeeds as expected.
      
      Also document the sysctl in `ip-sysctl.rst`.
      
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NVincent Bernat <vincent@bernat.ch>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c0c5a60f
  4. 31 10月, 2020 2 次提交
    • X
      sctp: enable udp tunneling socks · 046c052b
      Xin Long 提交于
      This patch is to enable udp tunneling socks by calling
      sctp_udp_sock_start() in sctp_ctrlsock_init(), and
      sctp_udp_sock_stop() in sctp_ctrlsock_exit().
      
      Also add sysctl udp_port to allow changing the listening
      sock's port by users.
      
      Wit this patch, the whole sctp over udp feature can be
      enabled and used.
      
      v1->v2:
        - Also update ctl_sock udp_port in proc_sctp_do_udp_port()
          where netns udp_port gets changed.
      v2->v3:
        - Call htons() when setting sk udp_port from netns udp_port.
      v3->v4:
        - Not call sctp_udp_sock_start() when new_value is 0.
        - Add udp_port entry in ip-sysctl.rst.
      v4->v5:
        - Not call sctp_udp_sock_start/stop() in sctp_ctrlsock_init/exit().
        - Improve the description of udp_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      046c052b
    • X
      sctp: add encap_port for netns sock asoc and transport · e8a3001c
      Xin Long 提交于
      encap_port is added as per netns/sock/assoc/transport, and the
      latter one's encap_port inherits the former one's by default.
      The transport's encap_port value would mostly decide if one
      packet should go out with udp encapsulated or not.
      
      This patch also allows users to set netns' encap_port by sysctl.
      
      v1->v2:
        - Change to define encap_port as __be16 for sctp_sock, asoc and
          transport.
      v2->v3:
        - No change.
      v3->v4:
        - Add 'encap_port' entry in ip-sysctl.rst.
      v4->v5:
        - Improve the description of encap_port in ip-sysctl.rst.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e8a3001c
  5. 17 10月, 2020 1 次提交
  6. 05 7月, 2020 1 次提交
  7. 07 5月, 2020 1 次提交
    • F
      ipv6: Implement draft-ietf-6man-rfc4941bis · 969c5464
      Fernando Gont 提交于
      Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
      https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09
      
      * Reduces the default Valid Lifetime to 2 days
        The number of extra addresses employed when Valid Lifetime was
        7 days exacerbated the stress caused on network
        elements/devices. Additionally, the motivation for temporary
        addresses is indeed privacy and reduced exposure. With a
        default Valid Lifetime of 7 days, an address that becomes
        revealed by active communication is reachable and exposed for
        one whole week. The only use case for a Valid Lifetime of 7
        days could be some application that is expecting to have long
        lived connections. But if you want to have a long lived
        connections, you shouldn't be using a temporary address in the
        first place. Additionally, in the era of mobile devices, general
        applications should nevertheless be prepared and robust to
        address changes (e.g. nodes swap wifi <-> 4G, etc.)
      
      * Employs different IIDs for different prefixes
        To avoid network activity correlation among addresses configured
        for different prefixes
      
      * Uses a simpler algorithm for IID generation
        No need to store "history" anywhere
      Signed-off-by: NFernando Gont <fgont@si6networks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      969c5464
  8. 01 5月, 2020 2 次提交
  9. 29 4月, 2020 2 次提交
  10. 23 4月, 2020 1 次提交
  11. 16 4月, 2020 1 次提交
  12. 13 3月, 2020 1 次提交
    • K
      tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted. · 4b01a967
      Kuniyuki Iwashima 提交于
      Commit aacd9289 ("tcp: bind() use stronger
      condition for bind_conflict") introduced a restriction to forbid to bind
      SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
      assign ports dispersedly so that we can connect to the same remote host.
      
      The change results in accelerating port depletion so that we fail to bind
      sockets to the same local port even if we want to connect to the different
      remote hosts.
      
      You can reproduce this issue by following instructions below.
      
        1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
        2. set SO_REUSEADDR to two sockets.
        3. bind two sockets to (localhost, 0) and the latter fails.
      
      Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
      the legacy behaviour to enable the SO_REUSEADDR option and make it possible
      to connect to different remote (addr, port) tuples.
      
      This patch allows us to bind SO_REUSEADDR enabled sockets to the same
      (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
      ephemeral ports are exhausted. This also allows connect() and listen() to
      share ports in the following way and may break some applications. So the
      ip_autobind_reuse is 0 by default and disables the feature.
      
        1. setsockopt(sk1, SO_REUSEADDR)
        2. setsockopt(sk2, SO_REUSEADDR)
        3. bind(sk1, saddr, 0)
        4. bind(sk2, saddr, 0)
        5. connect(sk1, daddr)
        6. listen(sk2)
      
      If it is set 1, we can fully utilize the 4-tuples, but we should use
      IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
      
      The notable thing is that if all sockets bound to the same port have
      both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
      ephemeral port and also do listen().
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b01a967
  13. 03 1月, 2020 1 次提交
  14. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  15. 27 11月, 2019 1 次提交
  16. 09 11月, 2019 2 次提交
    • X
      sctp: add support for Primary Path Switchover · 34515e94
      Xin Long 提交于
      This is a new feature defined in section 5 of rfc7829: "Primary Path
      Switchover". By introducing a new tunable parameter:
      
        Primary.Switchover.Max.Retrans (PSMR)
      
      The primary path will be changed to another active path when the path
      error counter on the old primary path exceeds PSMR, so that "the SCTP
      sender is allowed to continue data transmission on a new working path
      even when the old primary destination address becomes active again".
      
      This patch is to add this tunable parameter, 'ps_retrans' per netns,
      sock, asoc and transport. It also allows a user to change ps_retrans
      per netns by sysctl, and ps_retrans per sock/asoc/transport will be
      initialized with it.
      
      The check will be done in sctp_do_8_2_transport_strike() when this
      feature is enabled.
      
      Note this feature is disabled by initializing 'ps_retrans' per netns
      as 0xffff by default, and its value can't be less than 'pf_retrans'
      when changing by sysctl.
      
      v3->v4:
        - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
          sysctl 'ps_retrans'.
        - add a new entry for ps_retrans on ip-sysctl.txt.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34515e94
    • X
      sctp: add pf_expose per netns and sock and asoc · aef587be
      Xin Long 提交于
      As said in rfc7829, section 3, point 12:
      
        The SCTP stack SHOULD expose the PF state of its destination
        addresses to the ULP as well as provide the means to notify the
        ULP of state transitions of its destination addresses from
        active to PF, and vice versa.  However, it is recommended that
        an SCTP stack implementing SCTP-PF also allows for the ULP to be
        kept ignorant of the PF state of its destinations and the
        associated state transitions, thus allowing for retention of the
        simpler state transition model of [RFC4960] in the ULP.
      
      Not only does it allow to expose the PF state to ULP, but also
      allow to ignore sctp-pf to ULP.
      
      So this patch is to add pf_expose per netns, sock and asoc. And in
      sctp_assoc_control_transport(), ulp_notify will be set to false if
      asoc->expose is not 'enabled' in next patch.
      
      It also allows a user to change pf_expose per netns by sysctl, and
      pf_expose per sock and asoc will be initialized with it.
      
      Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
      to not allow a user to query the state of a sctp-pf peer address
      when pf_expose is 'disabled', as said in section 7.3.
      
      v1->v2:
        - Fix a build warning noticed by Nathan Chancellor.
      v2->v3:
        - set pf_expose to UNUSED by default to keep compatible with old
          applications.
      v3->v4:
        - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
        - change this patch to 1/5, and move sctp_assoc_control_transport
          change into 2/5, as Marcelo suggested.
        - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
          set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aef587be
  17. 01 11月, 2019 2 次提交
    • E
      tcp: increase tcp_max_syn_backlog max value · 623d0c2d
      Eric Dumazet 提交于
      tcp_max_syn_backlog default value depends on memory size
      and TCP ehash size. Before this patch, the max value
      was 2048 [1], which is considered too small nowadays.
      
      Increase it to 4096 to match the recent SOMAXCONN change.
      
      [1] This is with TCP ehash size being capped to 524288 buckets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      623d0c2d
    • E
      net: increase SOMAXCONN to 4096 · 19f92a03
      Eric Dumazet 提交于
      SOMAXCONN is /proc/sys/net/core/somaxconn default value.
      
      It has been defined as 128 more than 20 years ago.
      
      Since it caps the listen() backlog values, the very small value has
      caused numerous problems over the years, and many people had
      to raise it on their hosts after beeing hit by problems.
      
      Google has been using 1024 for at least 15 years, and we increased
      this to 4096 after TCP listener rework has been completed, more than
      4 years ago. We got no complain of this change breaking any
      legacy application.
      
      Many applications indeed setup a TCP listener with listen(fd, -1);
      meaning they let the system select the backlog.
      
      Raising SOMAXCONN lowers chance of the port being unavailable under
      even small SYNFLOOD attack, and reduces possibilities of side channel
      vulnerabilities.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f92a03
  18. 10 8月, 2019 1 次提交
    • J
      tcp: add new tcp_mtu_probe_floor sysctl · c04b79b6
      Josh Hunt 提交于
      The current implementation of TCP MTU probing can considerably
      underestimate the MTU on lossy connections allowing the MSS to get down to
      48. We have found that in almost all of these cases on our networks these
      paths can handle much larger MTUs meaning the connections are being
      artificially limited. Even though TCP MTU probing can raise the MSS back up
      we have seen this not to be the case causing connections to be "stuck" with
      an MSS of 48 when heavy loss is present.
      
      Prior to pushing out this change we could not keep TCP MTU probing enabled
      b/c of the above reasons. Now with a reasonble floor set we've had it
      enabled for the past 6 months.
      
      The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
      administrators the ability to control the floor of MSS probing.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04b79b6
  19. 15 7月, 2019 2 次提交
  20. 09 7月, 2019 1 次提交
  21. 02 7月, 2019 1 次提交
  22. 16 6月, 2019 1 次提交
    • E
      tcp: add tcp_min_snd_mss sysctl · 5f3e2bf0
      Eric Dumazet 提交于
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      
      This forces the stack to send packets with a very high network/cpu
      overhead.
      
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      reasons.
      
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f3e2bf0
  23. 15 6月, 2019 2 次提交
  24. 06 6月, 2019 1 次提交
  25. 31 5月, 2019 1 次提交
  26. 22 5月, 2019 1 次提交
  27. 19 4月, 2019 1 次提交
    • S
      ipv6: Add rate limit mask for ICMPv6 messages · 0bc19985
      Stephen Suryaputra 提交于
      To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
      message types use larger numeric values, a simple bitmask doesn't fit.
      I use large bitmap. The input and output are the in form of list of
      ranges. Set the default to rate limit all error messages but Packet Too
      Big. For Packet Too Big, use ratemask instead of hard-coded.
      
      There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
      aren't called. This patch only adds them to icmpv6_echo_reply().
      
      Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
      that it is also acceptable to rate limit informational messages. Thus,
      I removed the current hard-coded behavior of icmpv6_mask_allow() that
      doesn't rate limit informational messages.
      
      v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
          isn't defined, expand the description in ip-sysctl.txt and remove
          unnecessary conditional before kfree().
      v3: Inline the bitmap instead of dynamically allocated. Still is a
          pointer to it is needed because of the way proc_do_large_bitmap work.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bc19985
  28. 18 4月, 2019 1 次提交
    • Z
      ipv4: set the tcp_min_rtt_wlen range from 0 to one day · 19fad20d
      ZhangXiaoxu 提交于
      There is a UBSAN report as below:
      UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56
      signed integer overflow:
      2147483647 * 1000 cannot be represented in type 'int'
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e3 #1
      Call Trace:
       <IRQ>
       dump_stack+0x8c/0xba
       ubsan_epilogue+0x11/0x60
       handle_overflow+0x12d/0x170
       ? ttwu_do_wakeup+0x21/0x320
       __ubsan_handle_mul_overflow+0x12/0x20
       tcp_ack_update_rtt+0x76c/0x780
       tcp_clean_rtx_queue+0x499/0x14d0
       tcp_ack+0x69e/0x1240
       ? __wake_up_sync_key+0x2c/0x50
       ? update_group_capacity+0x50/0x680
       tcp_rcv_established+0x4e2/0xe10
       tcp_v4_do_rcv+0x22b/0x420
       tcp_v4_rcv+0xfe8/0x1190
       ip_protocol_deliver_rcu+0x36/0x180
       ip_local_deliver+0x15b/0x1a0
       ip_rcv+0xac/0xd0
       __netif_receive_skb_one_core+0x7f/0xb0
       __netif_receive_skb+0x33/0xc0
       netif_receive_skb_internal+0x84/0x1c0
       napi_gro_receive+0x2a0/0x300
       receive_buf+0x3d4/0x2350
       ? detach_buf_split+0x159/0x390
       virtnet_poll+0x198/0x840
       ? reweight_entity+0x243/0x4b0
       net_rx_action+0x25c/0x770
       __do_softirq+0x19b/0x66d
       irq_exit+0x1eb/0x230
       do_IRQ+0x7a/0x150
       common_interrupt+0xf/0xf
       </IRQ>
      
      It can be reproduced by:
        echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen
      
      Fixes: f6722583 ("tcp: track min RTT using windowed min-filter")
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19fad20d
  29. 12 4月, 2019 1 次提交
  30. 22 3月, 2019 1 次提交
    • D
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern 提交于
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab948a9
  31. 21 3月, 2019 1 次提交
  32. 20 3月, 2019 1 次提交
  33. 08 12月, 2018 1 次提交
    • D
      neighbor: Improve garbage collection · 58956317
      David Ahern 提交于
      The existing garbage collection algorithm has a number of problems:
      
      1. The gc algorithm will not evict PERMANENT entries as those entries
         are managed by userspace, yet the existing algorithm walks the entire
         hash table which means it always considers PERMANENT entries when
         looking for entries to evict. In some use cases (e.g., EVPN) there
         can be tens of thousands of PERMANENT entries leading to wasted
         CPU cycles when gc kicks in. As an example, with 32k permanent
         entries, neigh_alloc has been observed taking more than 4 msec per
         invocation.
      
      2. Currently, when the number of neighbor entries hits gc_thresh2 and
         the last flush for the table was more than 5 seconds ago gc kicks in
         walks the entire hash table evicting *all* entries not in PERMANENT
         or REACHABLE state and not marked as externally learned. There is no
         discriminator on when the neigh entry was created or if it just moved
         from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
      
         It is possible for entries to be created or for established neighbor
         entries to be moved to STALE (e.g., an external node sends an ARP
         request) right before the 5 second window lapses:
      
              -----|---------x|----------|-----
                  t-5         t         t+5
      
         If that happens those entries are evicted during gc causing unnecessary
         thrashing on neighbor entries and userspace caches trying to track them.
      
         Further, this contradicts the description of gc_thresh2 which says
         "Entries older than 5 seconds will be cleared".
      
         One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
         whole point of having separate thresholds.
      
      3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
         when gc_thresh2 is exceeded is over kill and contributes to trashing
         especially during startup.
      
      This patch addresses these problems as follows:
      
      1. Use of a separate list_head to track entries that can be garbage
         collected along with a separate counter. PERMANENT entries are not
         added to this list.
      
         The gc_thresh parameters are only compared to the new counter, not the
         total entries in the table. The forced_gc function is updated to only
         walk this new gc_list looking for entries to evict.
      
      2. Entries are added to the list head at the tail and removed from the
         front.
      
      3. Entries are only evicted if they were last updated more than 5 seconds
         ago, adhering to the original intent of gc_thresh2.
      
      4. Forced gc is stopped once the number of gc_entries drops below
         gc_thresh2.
      
      5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
         when allocating a new neighbor for a PERMANENT entry. By extension this
         means there are no explicit limits on the number of PERMANENT entries
         that can be created, but this is no different than FIB entries or FDB
         entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58956317