1. 01 5月, 2020 1 次提交
  2. 29 4月, 2020 2 次提交
  3. 23 4月, 2020 1 次提交
  4. 16 4月, 2020 1 次提交
  5. 13 3月, 2020 1 次提交
    • K
      tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted. · 4b01a967
      Kuniyuki Iwashima 提交于
      Commit aacd9289 ("tcp: bind() use stronger
      condition for bind_conflict") introduced a restriction to forbid to bind
      SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
      assign ports dispersedly so that we can connect to the same remote host.
      
      The change results in accelerating port depletion so that we fail to bind
      sockets to the same local port even if we want to connect to the different
      remote hosts.
      
      You can reproduce this issue by following instructions below.
      
        1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
        2. set SO_REUSEADDR to two sockets.
        3. bind two sockets to (localhost, 0) and the latter fails.
      
      Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
      the legacy behaviour to enable the SO_REUSEADDR option and make it possible
      to connect to different remote (addr, port) tuples.
      
      This patch allows us to bind SO_REUSEADDR enabled sockets to the same
      (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
      ephemeral ports are exhausted. This also allows connect() and listen() to
      share ports in the following way and may break some applications. So the
      ip_autobind_reuse is 0 by default and disables the feature.
      
        1. setsockopt(sk1, SO_REUSEADDR)
        2. setsockopt(sk2, SO_REUSEADDR)
        3. bind(sk1, saddr, 0)
        4. bind(sk2, saddr, 0)
        5. connect(sk1, daddr)
        6. listen(sk2)
      
      If it is set 1, we can fully utilize the 4-tuples, but we should use
      IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.
      
      The notable thing is that if all sockets bound to the same port have
      both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
      ephemeral port and also do listen().
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b01a967
  6. 03 1月, 2020 1 次提交
  7. 10 12月, 2019 1 次提交
    • K
      net-tcp: Disable TCP ssthresh metrics cache by default · 65e6d901
      Kevin(Yudong) Yang 提交于
      This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
      that disables TCP ssthresh metrics cache by default. Other parts of TCP
      metrics cache, e.g. rtt, cwnd, remain unchanged.
      
      As modern networks becoming more and more dynamic, TCP metrics cache
      today often causes more harm than benefits. For example, the same IP
      address is often shared by different subscribers behind NAT in residential
      networks. Even if the IP address is not shared by different users,
      caching the slow-start threshold of a previous short flow using loss-based
      congestion control (e.g. cubic) often causes the future longer flows of
      the same network path to exit slow-start prematurely with abysmal
      throughput.
      
      Caching ssthresh is very risky and can lead to terrible performance.
      Therefore it makes sense to make disabling ssthresh caching by
      default and opt-in for specific networks by the administrators.
      This practice also has worked well for several years of deployment with
      CUBIC congestion control at Google.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65e6d901
  8. 27 11月, 2019 1 次提交
  9. 09 11月, 2019 2 次提交
    • X
      sctp: add support for Primary Path Switchover · 34515e94
      Xin Long 提交于
      This is a new feature defined in section 5 of rfc7829: "Primary Path
      Switchover". By introducing a new tunable parameter:
      
        Primary.Switchover.Max.Retrans (PSMR)
      
      The primary path will be changed to another active path when the path
      error counter on the old primary path exceeds PSMR, so that "the SCTP
      sender is allowed to continue data transmission on a new working path
      even when the old primary destination address becomes active again".
      
      This patch is to add this tunable parameter, 'ps_retrans' per netns,
      sock, asoc and transport. It also allows a user to change ps_retrans
      per netns by sysctl, and ps_retrans per sock/asoc/transport will be
      initialized with it.
      
      The check will be done in sctp_do_8_2_transport_strike() when this
      feature is enabled.
      
      Note this feature is disabled by initializing 'ps_retrans' per netns
      as 0xffff by default, and its value can't be less than 'pf_retrans'
      when changing by sysctl.
      
      v3->v4:
        - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
          sysctl 'ps_retrans'.
        - add a new entry for ps_retrans on ip-sysctl.txt.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34515e94
    • X
      sctp: add pf_expose per netns and sock and asoc · aef587be
      Xin Long 提交于
      As said in rfc7829, section 3, point 12:
      
        The SCTP stack SHOULD expose the PF state of its destination
        addresses to the ULP as well as provide the means to notify the
        ULP of state transitions of its destination addresses from
        active to PF, and vice versa.  However, it is recommended that
        an SCTP stack implementing SCTP-PF also allows for the ULP to be
        kept ignorant of the PF state of its destinations and the
        associated state transitions, thus allowing for retention of the
        simpler state transition model of [RFC4960] in the ULP.
      
      Not only does it allow to expose the PF state to ULP, but also
      allow to ignore sctp-pf to ULP.
      
      So this patch is to add pf_expose per netns, sock and asoc. And in
      sctp_assoc_control_transport(), ulp_notify will be set to false if
      asoc->expose is not 'enabled' in next patch.
      
      It also allows a user to change pf_expose per netns by sysctl, and
      pf_expose per sock and asoc will be initialized with it.
      
      Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
      to not allow a user to query the state of a sctp-pf peer address
      when pf_expose is 'disabled', as said in section 7.3.
      
      v1->v2:
        - Fix a build warning noticed by Nathan Chancellor.
      v2->v3:
        - set pf_expose to UNUSED by default to keep compatible with old
          applications.
      v3->v4:
        - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
        - change this patch to 1/5, and move sctp_assoc_control_transport
          change into 2/5, as Marcelo suggested.
        - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
          set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aef587be
  10. 01 11月, 2019 2 次提交
    • E
      tcp: increase tcp_max_syn_backlog max value · 623d0c2d
      Eric Dumazet 提交于
      tcp_max_syn_backlog default value depends on memory size
      and TCP ehash size. Before this patch, the max value
      was 2048 [1], which is considered too small nowadays.
      
      Increase it to 4096 to match the recent SOMAXCONN change.
      
      [1] This is with TCP ehash size being capped to 524288 buckets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      623d0c2d
    • E
      net: increase SOMAXCONN to 4096 · 19f92a03
      Eric Dumazet 提交于
      SOMAXCONN is /proc/sys/net/core/somaxconn default value.
      
      It has been defined as 128 more than 20 years ago.
      
      Since it caps the listen() backlog values, the very small value has
      caused numerous problems over the years, and many people had
      to raise it on their hosts after beeing hit by problems.
      
      Google has been using 1024 for at least 15 years, and we increased
      this to 4096 after TCP listener rework has been completed, more than
      4 years ago. We got no complain of this change breaking any
      legacy application.
      
      Many applications indeed setup a TCP listener with listen(fd, -1);
      meaning they let the system select the backlog.
      
      Raising SOMAXCONN lowers chance of the port being unavailable under
      even small SYNFLOOD attack, and reduces possibilities of side channel
      vulnerabilities.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Yue Cao <ycao009@ucr.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f92a03
  11. 10 8月, 2019 1 次提交
    • J
      tcp: add new tcp_mtu_probe_floor sysctl · c04b79b6
      Josh Hunt 提交于
      The current implementation of TCP MTU probing can considerably
      underestimate the MTU on lossy connections allowing the MSS to get down to
      48. We have found that in almost all of these cases on our networks these
      paths can handle much larger MTUs meaning the connections are being
      artificially limited. Even though TCP MTU probing can raise the MSS back up
      we have seen this not to be the case causing connections to be "stuck" with
      an MSS of 48 when heavy loss is present.
      
      Prior to pushing out this change we could not keep TCP MTU probing enabled
      b/c of the above reasons. Now with a reasonble floor set we've had it
      enabled for the past 6 months.
      
      The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
      administrators the ability to control the floor of MSS probing.
      Signed-off-by: NJosh Hunt <johunt@akamai.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c04b79b6
  12. 15 7月, 2019 2 次提交
  13. 09 7月, 2019 1 次提交
  14. 02 7月, 2019 1 次提交
  15. 16 6月, 2019 1 次提交
    • E
      tcp: add tcp_min_snd_mss sysctl · 5f3e2bf0
      Eric Dumazet 提交于
      Some TCP peers announce a very small MSS option in their SYN and/or
      SYN/ACK messages.
      
      This forces the stack to send packets with a very high network/cpu
      overhead.
      
      Linux has enforced a minimal value of 48. Since this value includes
      the size of TCP options, and that the options can consume up to 40
      bytes, this means that each segment can include only 8 bytes of payload.
      
      In some cases, it can be useful to increase the minimal value
      to a saner value.
      
      We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
      reasons.
      
      Note that TCP_MAXSEG socket option enforces a minimal value
      of (TCP_MIN_MSS). David Miller increased this minimal value
      in commit c39508d6 ("tcp: Make TCP_MAXSEG minimum more correct.")
      from 64 to 88.
      
      We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.
      
      CVE-2019-11479 -- tcp mss hardcoded to 48
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f3e2bf0
  16. 15 6月, 2019 2 次提交
  17. 06 6月, 2019 1 次提交
  18. 31 5月, 2019 1 次提交
  19. 22 5月, 2019 1 次提交
  20. 19 4月, 2019 1 次提交
    • S
      ipv6: Add rate limit mask for ICMPv6 messages · 0bc19985
      Stephen Suryaputra 提交于
      To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
      message types use larger numeric values, a simple bitmask doesn't fit.
      I use large bitmap. The input and output are the in form of list of
      ranges. Set the default to rate limit all error messages but Packet Too
      Big. For Packet Too Big, use ratemask instead of hard-coded.
      
      There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
      aren't called. This patch only adds them to icmpv6_echo_reply().
      
      Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
      that it is also acceptable to rate limit informational messages. Thus,
      I removed the current hard-coded behavior of icmpv6_mask_allow() that
      doesn't rate limit informational messages.
      
      v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
          isn't defined, expand the description in ip-sysctl.txt and remove
          unnecessary conditional before kfree().
      v3: Inline the bitmap instead of dynamically allocated. Still is a
          pointer to it is needed because of the way proc_do_large_bitmap work.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0bc19985
  21. 18 4月, 2019 1 次提交
    • Z
      ipv4: set the tcp_min_rtt_wlen range from 0 to one day · 19fad20d
      ZhangXiaoxu 提交于
      There is a UBSAN report as below:
      UBSAN: Undefined behaviour in net/ipv4/tcp_input.c:2877:56
      signed integer overflow:
      2147483647 * 1000 cannot be represented in type 'int'
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.1.0-rc4-00058-g582549e3 #1
      Call Trace:
       <IRQ>
       dump_stack+0x8c/0xba
       ubsan_epilogue+0x11/0x60
       handle_overflow+0x12d/0x170
       ? ttwu_do_wakeup+0x21/0x320
       __ubsan_handle_mul_overflow+0x12/0x20
       tcp_ack_update_rtt+0x76c/0x780
       tcp_clean_rtx_queue+0x499/0x14d0
       tcp_ack+0x69e/0x1240
       ? __wake_up_sync_key+0x2c/0x50
       ? update_group_capacity+0x50/0x680
       tcp_rcv_established+0x4e2/0xe10
       tcp_v4_do_rcv+0x22b/0x420
       tcp_v4_rcv+0xfe8/0x1190
       ip_protocol_deliver_rcu+0x36/0x180
       ip_local_deliver+0x15b/0x1a0
       ip_rcv+0xac/0xd0
       __netif_receive_skb_one_core+0x7f/0xb0
       __netif_receive_skb+0x33/0xc0
       netif_receive_skb_internal+0x84/0x1c0
       napi_gro_receive+0x2a0/0x300
       receive_buf+0x3d4/0x2350
       ? detach_buf_split+0x159/0x390
       virtnet_poll+0x198/0x840
       ? reweight_entity+0x243/0x4b0
       net_rx_action+0x25c/0x770
       __do_softirq+0x19b/0x66d
       irq_exit+0x1eb/0x230
       do_IRQ+0x7a/0x150
       common_interrupt+0xf/0xf
       </IRQ>
      
      It can be reproduced by:
        echo 2147483647 > /proc/sys/net/ipv4/tcp_min_rtt_wlen
      
      Fixes: f6722583 ("tcp: track min RTT using windowed min-filter")
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19fad20d
  22. 12 4月, 2019 1 次提交
  23. 22 3月, 2019 1 次提交
    • D
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern 提交于
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab948a9
  24. 21 3月, 2019 1 次提交
  25. 20 3月, 2019 1 次提交
  26. 08 12月, 2018 1 次提交
    • D
      neighbor: Improve garbage collection · 58956317
      David Ahern 提交于
      The existing garbage collection algorithm has a number of problems:
      
      1. The gc algorithm will not evict PERMANENT entries as those entries
         are managed by userspace, yet the existing algorithm walks the entire
         hash table which means it always considers PERMANENT entries when
         looking for entries to evict. In some use cases (e.g., EVPN) there
         can be tens of thousands of PERMANENT entries leading to wasted
         CPU cycles when gc kicks in. As an example, with 32k permanent
         entries, neigh_alloc has been observed taking more than 4 msec per
         invocation.
      
      2. Currently, when the number of neighbor entries hits gc_thresh2 and
         the last flush for the table was more than 5 seconds ago gc kicks in
         walks the entire hash table evicting *all* entries not in PERMANENT
         or REACHABLE state and not marked as externally learned. There is no
         discriminator on when the neigh entry was created or if it just moved
         from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).
      
         It is possible for entries to be created or for established neighbor
         entries to be moved to STALE (e.g., an external node sends an ARP
         request) right before the 5 second window lapses:
      
              -----|---------x|----------|-----
                  t-5         t         t+5
      
         If that happens those entries are evicted during gc causing unnecessary
         thrashing on neighbor entries and userspace caches trying to track them.
      
         Further, this contradicts the description of gc_thresh2 which says
         "Entries older than 5 seconds will be cleared".
      
         One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
         whole point of having separate thresholds.
      
      3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
         when gc_thresh2 is exceeded is over kill and contributes to trashing
         especially during startup.
      
      This patch addresses these problems as follows:
      
      1. Use of a separate list_head to track entries that can be garbage
         collected along with a separate counter. PERMANENT entries are not
         added to this list.
      
         The gc_thresh parameters are only compared to the new counter, not the
         total entries in the table. The forced_gc function is updated to only
         walk this new gc_list looking for entries to evict.
      
      2. Entries are added to the list head at the tail and removed from the
         front.
      
      3. Entries are only evicted if they were last updated more than 5 seconds
         ago, adhering to the original intent of gc_thresh2.
      
      4. Forced gc is stopped once the number of gc_entries drops below
         gc_thresh2.
      
      5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
         when allocating a new neighbor for a PERMANENT entry. By extension this
         means there are no explicit limits on the number of PERMANENT entries
         that can be created, but this is no different than FIB entries or FDB
         entries.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58956317
  27. 12 11月, 2018 1 次提交
    • E
      tcp: tsq: no longer use limit_output_bytes for paced flows · c73e5807
      Eric Dumazet 提交于
      FQ pacing guarantees that paced packets queued by one flow do not
      add head-of-line blocking for other flows.
      
      After TCP GSO conversion, increasing limit_output_bytes to 1 MB is safe,
      since this maps to 16 skbs at most in qdisc or device queues.
      (or slightly more if some drivers lower {gso_max_segs|size})
      
      We still can queue at most 1 ms worth of traffic (this can be scaled
      by wifi drivers if they need to)
      
      Tested:
      
      # ethtool -c eth0 | egrep "tx-usecs:|tx-frames:" # 40 Gbit mlx4 NIC
      tx-usecs: 16
      tx-frames: 16
      # tc qdisc replace dev eth0 root fq
      # for f in {1..10};do netperf -P0 -H lpaa24,6 -o THROUGHPUT;done
      
      Before patch:
      27711
      26118
      27107
      27377
      27712
      27388
      27340
      27117
      27278
      27509
      
      After patch:
      37434
      36949
      36658
      36998
      37711
      37291
      37605
      36659
      36544
      37349
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c73e5807
  28. 08 11月, 2018 1 次提交
  29. 30 10月, 2018 1 次提交
  30. 13 10月, 2018 1 次提交
    • D
      net/ipv6: Add knob to skip DELROUTE message on device down · 7c6bb7d2
      David Ahern 提交于
      Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
      notifications when a device is taken down (admin down) or deleted. IPv4
      does not generate a message for routes evicted by the down or delete;
      IPv6 does. A NOS at scale really needs to avoid these messages and have
      IPv4 and IPv6 behave similarly, relying on userspace to handle link
      notifications and evict the routes.
      
      At this point existing user behavior needs to be preserved. Since
      notifications are a global action (not per app) the only way to preserve
      existing behavior and allow the messages to be skipped is to add a new
      sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
      disable the notificatioons.
      
      IPv6 route code already supports the option to skip the message (it is
      used for multipath routes for example). Besides the new sysctl we need
      to pass the skip_notify setting through the generic fib6_clean and
      fib6_walk functions to fib6_clean_node and to set skip_notify on calls
      to __ip_del_rt for the addrconf_ifdown path.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c6bb7d2
  31. 27 9月, 2018 1 次提交
  32. 13 8月, 2018 1 次提交
    • V
      ipv6: Add icmp_echo_ignore_all support for ICMPv6 · e6f86b0f
      Virgile Jarry 提交于
      Preventing the kernel from responding to ICMP Echo Requests messages
      can be useful in several ways. The sysctl parameter
      'icmp_echo_ignore_all' can be used to prevent the kernel from
      responding to IPv4 ICMP echo requests. For IPv6 pings, such
      a sysctl kernel parameter did not exist.
      
      Add the ability to prevent the kernel from responding to IPv6
      ICMP echo requests through the use of the following sysctl
      parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
      Update the documentation to reflect this change.
      Signed-off-by: NVirgile Jarry <virgile@acceis.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6f86b0f
  33. 02 8月, 2018 1 次提交
    • P
      net: ipv4: Control SKB reprioritization after forwarding · 432e05d3
      Petr Machata 提交于
      After IPv4 packets are forwarded, the priority of the corresponding SKB
      is updated according to the TOS field of IPv4 header. This overrides any
      prioritization done earlier by e.g. an skbedit action or ingress-qos-map
      defined at a vlan device.
      
      Such overriding may not always be desirable. Even if the packet ends up
      being routed, which implies this is an L3 network node, an administrator
      may wish to preserve whatever prioritization was done earlier on in the
      pipeline.
      
      Therefore introduce a sysctl that controls this behavior. Keep the
      default value at 1 to maintain backward-compatible behavior.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432e05d3
  34. 12 7月, 2018 1 次提交
  35. 28 6月, 2018 1 次提交
    • F
      skbuff: preserve sock reference when scrubbing the skb. · 9c4c3252
      Flavio Leitner 提交于
      The sock reference is lost when scrubbing the packet and that breaks
      TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
      performance impacts of about 50% in a single TCP stream when crossing
      network namespaces.
      
      XPS breaks because the queue mapping stored in the socket is not
      available, so another random queue might be selected when the stack
      needs to transmit something like a TCP ACK, or TCP Retransmissions.
      That causes packet re-ordering and/or performance issues.
      
      TSQ breaks because it orphans the packet while it is still in the
      host, so packets are queued contributing to the buffer bloat problem.
      
      Preserving the sock reference fixes both issues. The socket is
      orphaned anyways in the receiving path before any relevant action
      and on TX side the netfilter checks if the reference is local before
      use it.
      Signed-off-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c4c3252