1. 16 10月, 2018 2 次提交
    • E
      net: extend sk_pacing_rate to unsigned long · 76a9ebe8
      Eric Dumazet 提交于
      sk_pacing_rate has beed introduced as a u32 field in 2013,
      effectively limiting per flow pacing to 34Gbit.
      
      We believe it is time to allow TCP to pace high speed flows
      on 64bit hosts, as we now can reach 100Gbit on one TCP flow.
      
      This patch adds no cost for 32bit kernels.
      
      The tcpi_pacing_rate and tcpi_max_pacing_rate were already
      exported as 64bit, so iproute2/ss command require no changes.
      
      Unfortunately the SO_MAX_PACING_RATE socket option will stay
      32bit and we will need to add a new option to let applications
      control high pacing rates.
      
      State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
      ESTAB      0      1787144  10.246.9.76:49992             10.246.9.77:36741
                       timer:(on,003ms,0) ino:91863 sk:2 <->
       skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
       ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
       rcvmss:536 advmss:1448
       cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
       segs_in:3916318 data_segs_out:177279175
       bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
       send 28045.5Mbps lastrcv:73333
       pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
       busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
       notsent:2085120 minrtt:0.013
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76a9ebe8
    • E
      tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh · 5f6188a8
      Eric Dumazet 提交于
      In EDT design, I made the mistake of using tcp_wstamp_ns
      to store the last tcp_clock_ns() sample and to store the
      pacing virtual timer.
      
      This causes major regressions at high speed flows.
      
      Introduce tcp_clock_cache to store last tcp_clock_ns().
      This is needed because some arches have slow high-resolution
      kernel time service.
      
      tcp_wstamp_ns is only updated when a packet is sent.
      
      Note that we can remove tcp_mstamp in the future since
      tcp_mstamp is essentially tcp_clock_cache/1000, so the
      apparent socket size increase is temporary.
      
      Fixes: 9799ccb0 ("tcp: add tcp_wstamp_ns socket field")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f6188a8
  2. 13 10月, 2018 1 次提交
    • D
      net: Evict neighbor entries on carrier down · 859bd2ef
      David Ahern 提交于
      When a link's carrier goes down it could be a sign of the port changing
      networks. If the new network has overlapping addresses with the old one,
      then the kernel will continue trying to use neighbor entries established
      based on the old network until the entries finally age out - meaning a
      potentially long delay with communications not working.
      
      This patch evicts neighbor entries on carrier down with the exception of
      those marked permanent. Permanent entries are managed by userspace (either
      an admin or a routing daemon such as FRR).
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      859bd2ef
  3. 11 10月, 2018 3 次提交
  4. 09 10月, 2018 5 次提交
  5. 08 10月, 2018 1 次提交
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  6. 06 10月, 2018 1 次提交
  7. 05 10月, 2018 4 次提交
  8. 03 10月, 2018 6 次提交
  9. 02 10月, 2018 5 次提交
  10. 30 9月, 2018 1 次提交
    • Y
      tcp: up initial rmem to 128KB and SYN rwin to around 64KB · a337531b
      Yuchung Cheng 提交于
      Previously TCP initial receive buffer is ~87KB by default and
      the initial receive window is ~29KB (20 MSS). This patch changes
      the two numbers to 128KB and ~64KB (rounding down to the multiples
      of MSS) respectively. The patch also simplifies the calculations s.t.
      the two numbers are directly controlled by sysctl tcp_rmem[1]:
      
        1) Initial receiver buffer budget (sk_rcvbuf): while this should
           be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
           always override and set a larger size when a new connection
           establishes.
      
        2) Initial receive window in SYN: previously it is set to 20
           packets if MSS <= 1460. The number 20 was based on the initial
           congestion window of 10: the receiver needs twice amount to
           avoid being limited by the receive window upon out-of-order
           delivery in the first window burst. But since this only
           applies if the receiving MSS <= 1460, connection using large MTU
           (e.g. to utilize receiver zero-copy) may be limited by the
           receive window.
      
      With this patch TCP memory configuration is more straight-forward and
      more properly sized to modern high-speed networks by default. Several
      popular stacks have been announcing 64KB rwin in SYNs as well.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a337531b
  11. 28 9月, 2018 1 次提交
    • T
      netfilter: masquerade: don't flush all conntracks if only one address deleted on device · 097f95d3
      Tan Hu 提交于
      We configured iptables as below, which only allowed incoming data on
      established connections:
      
      iptables -t mangle -A PREROUTING -m state --state ESTABLISHED -j ACCEPT
      iptables -t mangle -P PREROUTING DROP
      
      When deleting a secondary address, current masquerade implements would
      flush all conntracks on this device. All the established connections on
      primary address also be deleted, then subsequent incoming data on the
      connections would be dropped wrongly because it was identified as NEW
      connection.
      
      So when an address was delete, it should only flush connections related
      with the address.
      Signed-off-by: NTan Hu <tan.hu@zte.com.cn>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      097f95d3
  12. 27 9月, 2018 3 次提交
  13. 25 9月, 2018 1 次提交
  14. 22 9月, 2018 6 次提交
    • P
      net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net · 83619623
      Peter Oskolkov 提交于
      Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
      hard-limited to those of the root/init ns.
      
      There are at least two use cases when it would be desirable to
      set the high_thresh values higher in a child namespace vs the global hard
      limit:
      
      - a security/ddos protection policy may lower the thresholds in the
        root/init ns but allow for a special exception in a child namespace
      - testing: a test running in a namespace may want to set these
        thresholds higher in its namespace than what is in the root/init ns
      
      The new behavior:
      
       # ip netns add testns
       # ip netns exec testns bash
      
       # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl net.ipv4.ipfrag_high_thresh
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
       net.ipv6.ip6frag_high_thresh = 9000000
      
       # sysctl net.ipv6.ip6frag_high_thresh
       net.ipv6.ip6frag_high_thresh = 9000000
      
      The old behavior:
      
       # ip netns add testns
       # ip netns exec testns bash
      
       # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
       net.ipv4.ipfrag_high_thresh = 9000000
      
       # sysctl net.ipv4.ipfrag_high_thresh
       net.ipv4.ipfrag_high_thresh = 4194304
      
       # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
       net.ipv6.ip6frag_high_thresh = 9000000
      
       # sysctl net.ipv6.ip6frag_high_thresh
       net.ipv6.ip6frag_high_thresh = 4194304
      Signed-off-by: NPeter Oskolkov <posk@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83619623
    • E
      net/ipv4: avoid compile error in fib_info_nh_uses_dev · 075e264f
      Eric Dumazet 提交于
      net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
      net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
      cc1: all warnings being treated as errors
      
      Fixes: 78f2756c ("net/ipv4: Move device validation to helper")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      075e264f
    • E
      tcp: switch tcp_internal_pacing() to tcp_wstamp_ns · c092dd5f
      Eric Dumazet 提交于
      Now TCP keeps track of tcp_wstamp_ns, recording the earliest
      departure time of next packet, we can remove duplicate code
      from tcp_internal_pacing()
      
      This removes one ktime_get_tai_ns() call, and a divide.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c092dd5f
    • E
      tcp: switch tcp and sch_fq to new earliest departure time model · ab408b6d
      Eric Dumazet 提交于
      TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
      no longer has to do it.
      
      Thanks to this model, TCP can get more accurate RTT samples,
      since pacing no longer inflates them.
      
      This has the nice effect of removing some delays caused by FQ
      quantum mechanism, causing inflated max/P99 latencies.
      
      Also we might relax TCP Small Queue tight limits in the future,
      since this new model allow TCP to build bigger batches, since
      sch_fq (or a device with earliest departure time offload) ensure
      these packets will be delivered on time.
      
      Note that other protocols are not converted (they will probably
      never be) so sch_fq has still support for SO_MAX_PACING_RATE
      
      Tested:
      
      Test showing FQ pacing quantum artifact for low-rate flows,
      adding unexpected throttles for RPC flows, inflating max and P99 latencies.
      
      The parameters chosen here are to show what happens typically when
      a TCP flow has a reduced pacing rate (this can be caused by a reduced
      cwin after few losses, or/and rtt above few ms)
      
      MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
      Before :
      $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
       Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
      19,82.78,5279,3825,482.02
      
      After :
      $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
      Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
      20,49.94,128,63,3.18
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab408b6d
    • E
      tcp: switch internal pacing timer to CLOCK_TAI · fd2bca2a
      Eric Dumazet 提交于
      Next patch will use tcp_wstamp_ns to feed internal
      TCP pacing timer, so switch to CLOCK_TAI to share same base.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd2bca2a
    • E
      tcp: provide earliest departure time in skb->tstamp · d3edd06e
      Eric Dumazet 提交于
      Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
      from usec units to nsec units.
      
      Do not clear skb->tstamp before entering IP stacks in TX,
      so that qdisc or devices can implement pacing based on the
      earliest departure time instead of socket sk->sk_pacing_rate
      
      Packets are fed with tcp_wstamp_ns, and following patch
      will update tcp_wstamp_ns when both TCP and sch_fq switch to
      the earliest departure time mechanism.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3edd06e