1. 15 8月, 2013 3 次提交
  2. 14 8月, 2013 1 次提交
    • Y
      tcp: reset reordering est. selectively on timeout · 74c181d5
      Yuchung Cheng 提交于
      On timeout the TCP sender unconditionally resets the estimated degree
      of network reordering (tp->reordering). The idea behind this is that
      the estimate is too large to trigger fast recovery (e.g., due to a IP
      path change).
      
      But for example if the sender only had 2 packets outstanding, then a
      timeout doesn't tell much about reordering. A sender that learns about
      reordering on big writes and loses packets on small writes will end up
      falsely retransmitting again and again, especially when reordering is
      more likely on big writes.
      
      Therefore the sender should only suspect that tp->reordering is too
      high if it could have gone into fast recovery with the (lower) default
      estimate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74c181d5
  3. 10 8月, 2013 4 次提交
  4. 09 8月, 2013 1 次提交
    • E
      net: add SNMP counters tracking incoming ECN bits · 1f07d03e
      Eric Dumazet 提交于
      With GRO/LRO processing, there is a problem because Ip[6]InReceives SNMP
      counters do not count the number of frames, but number of aggregated
      segments.
      
      Its probably too late to change this now.
      
      This patch adds four new counters, tracking number of frames, regardless
      of LRO/GRO, and on a per ECN status basis, for IPv4 and IPv6.
      
      Ip[6]NoECTPkts : Number of packets received with NOECT
      Ip[6]ECT1Pkts  : Number of packets received with ECT(1)
      Ip[6]ECT0Pkts  : Number of packets received with ECT(0)
      Ip[6]CEPkts    : Number of packets received with Congestion Experienced
      
      lph37:~# nstat | egrep "Pkts|InReceive"
      IpInReceives                    1634137            0.0
      Ip6InReceives                   3714107            0.0
      Ip6InNoECTPkts                  19205              0.0
      Ip6InECT0Pkts                   52651828           0.0
      IpExtInNoECTPkts                33630              0.0
      IpExtInECT0Pkts                 15581379           0.0
      IpExtInCEPkts                   6                  0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f07d03e
  5. 08 8月, 2013 1 次提交
  6. 04 8月, 2013 1 次提交
    • S
      fib_rules: fix suppressor names and default values · 73f5698e
      Stefan Tomanek 提交于
      This change brings the suppressor attribute names into line; it also changes
      the data types to provide a more consistent interface.
      
      While -1 indicates that the suppressor is not enabled, values >= 0 for
      suppress_prefixlen or suppress_ifgroup  reject routing decisions violating the
      constraint.
      
      This changes the previously presented behaviour of suppress_prefixlen, where a
      prefix length _less_ than the attribute value was rejected. After this change,
      a prefix length less than *or* equal to the value is considered a violation of
      the rule constraint.
      
      It also changes the default values for default and newly added rules (disabling
      any suppression for those).
      Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73f5698e
  7. 03 8月, 2013 2 次提交
  8. 01 8月, 2013 3 次提交
    • S
      fib_rules: add .suppress operation · 7764a45a
      Stefan Tomanek 提交于
      This change adds a new operation to the fib_rules_ops struct; it allows the
      suppression of routing decisions if certain criteria are not met by its
      results.
      
      The first implemented constraint is a minimum prefix length added to the
      structures of routing rules. If a rule is added with a minimum prefix length
      >0, only routes meeting this threshold will be considered. Any other (more
      general) routing table entries will be ignored.
      
      When configuring a system with multiple network uplinks and default routes, it
      is often convinient to reference the main routing table multiple times - but
      omitting the default route. Using this patch and a modified "ip" utility, this
      can be achieved by using the following command sequence:
      
        $ ip route add table secuplink default via 10.42.23.1
      
        $ ip rule add pref 100            table main prefixlength 1
        $ ip rule add pref 150 fwmark 0xA table secuplink
      
      With this setup, packets marked 0xA will be processed by the additional routing
      table "secuplink", but only if no suitable route in the main routing table can
      be found. By using a minimal prefixlength of 1, the default route (/0) of the
      table "main" is hidden to packets processed by rule 100; packets traveling to
      destinations with more specific routing entries are processed as usual.
      Signed-off-by: NStefan Tomanek <stefan.tomanek@wertarbyte.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7764a45a
    • F
      net: split rt_genid for ipv4 and ipv6 · ca4c3fc2
      fan.du 提交于
      Current net name space has only one genid for both IPv4 and IPv6, it has below
      drawbacks:
      
      - Add/delete an IPv4 address will invalidate all IPv6 routing table entries.
      - Insert/remove XFRM policy will also invalidate both IPv4/IPv6 routing table
        entries even when the policy is only applied for one address family.
      
      Thus, this patch attempt to split one genid for two to cater for IPv4 and IPv6
      separately in a fine granularity.
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca4c3fc2
    • D
      tcp: Remove unused tcpct declarations and comments · c0155b2d
      Dmitry Popov 提交于
      Remove declaration, 4 defines and confusing comment that are no longer used
      since 1a2c6181 ("tcp: Remove TCPCT").
      Signed-off-by: NDmitry Popov <dp@highloadlab.com>
      Acked-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0155b2d
  9. 31 7月, 2013 2 次提交
  10. 29 7月, 2013 1 次提交
  11. 28 7月, 2013 1 次提交
  12. 25 7月, 2013 3 次提交
    • E
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet 提交于
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bee3b7
    • E
      net: add sk_stream_is_writeable() helper · 64dc6130
      Eric Dumazet 提交于
      Several call sites use the hardcoded following condition :
      
      sk_stream_wspace(sk) >= sk_stream_min_wspace(sk)
      
      Lets use a helper because TCP_NOTSENT_LOWAT support will change this
      condition for TCP sockets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64dc6130
    • J
      fib_trie: potential out of bounds access in trie_show_stats() · f585a991
      Jerry Snitselaar 提交于
      With the <= max condition in the for loop, it will be always go 1
      element further than needed. If the condition for the while loop is
      never met, then max is MAX_STAT_DEPTH, and for loop will walk off the
      end of nodesizes[].
      Signed-off-by: NJerry Snitselaar <jerry.snitselaar@oracle.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f585a991
  13. 24 7月, 2013 3 次提交
  14. 23 7月, 2013 4 次提交
    • Y
      tcp: use RTT from SACK for RTO · ed08495c
      Yuchung Cheng 提交于
      If RTT is not available because Karn's check has failed or no
      new packet is acked, use the RTT measured from SACK to estimate
      the RTO. The sender can continue to estimate the RTO during loss
      recovery or reordering event upon receiving non-partial ACKs.
      
      This also changes when the RTO is re-armed. Previously it is
      only re-armed when some data is cummulatively acknowledged (i.e.,
      SND.UNA advances), but now it is re-armed whenever RTT estimator
      is updated. This feature is particularly useful to reduce spurious
      timeout for buffer bloat including cellular carriers [1], and
      RTT estimation on reordering events.
      
      [1] "An In-depth Study of LTE: Effect of Network Protocol and
       Application Behavior on Performance", In Proc. of SIGCOMM 2013
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed08495c
    • Y
      tcp: measure RTT from new SACK · 59c9af42
      Yuchung Cheng 提交于
      Take RTT sample if an ACK selectively acks some sequences that
      have never been retransmitted. The Karn's algorithm does not apply
      even if that ACK (s)acks other retransmitted sequences, because it
      must been generated by an original but perhaps out-of-order packet.
      There is no ambiguity. In case when multiple blocks are newly
      sacked because of ACK losses the earliest block is used to
      measure RTT, similar to cummulative ACKs.
      
      Such RTT samples allow the sender to estimate the RTO during loss
      recovery and packet reordering events. It is still useful even with
      TCP timestamps. That's because during these events the SND.UNA may
      not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
      check before calling tcp_ack_update_rtt()).  Therefore this new
      RTT source is complementary to existing ACK and TS RTT mechanisms.
      
      This patch does not update the RTO. It is done in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c9af42
    • Y
      tcp: prefer packet timing to TS-ECR for RTT · 5b08e47c
      Yuchung Cheng 提交于
      Prefer packet timings to TS-ecr for RTT measurements when both
      sources are available. That's because broken middle-boxes and remote
      peer can return packets with corrupted TS ECR fields. Similarly most
      congestion controls that require RTT signals favor timing-based
      sources as well. Also check for bad TS ECR values to avoid RTT
      blow-ups. It has happened on production Web servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b08e47c
    • Y
      tcp: consolidate SYNACK RTT sampling · 375fe02c
      Yuchung Cheng 提交于
      The first patch consolidates SYNACK and other RTT measurement to use a
      central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
      RTT measurement happens after PAWS check, potentially reducing the
      impact of RTO seeding on bad TCP timestamps values.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      375fe02c
  15. 20 7月, 2013 1 次提交
  16. 17 7月, 2013 1 次提交
    • E
      ipv4: set transport header earlier · 21d1196a
      Eric Dumazet 提交于
      commit 45f00f99 ("ipv4: tcp: clean up tcp_v4_early_demux()") added a
      performance regression for non GRO traffic, basically disabling
      IP early demux.
      
      IPv6 stack resets transport header in ip6_rcv() before calling
      IP early demux in ip6_rcv_finish(), while IPv4 does this only in
      ip_local_deliver_finish(), _after_ IP early demux.
      
      GRO traffic happened to enable IP early demux because transport header
      is also set in inet_gro_receive()
      
      Instead of reverting the faulty commit, we can make IPv4/IPv6 behave the
      same : transport_header should be set in ip_rcv() instead of
      ip_local_deliver_finish()
      
      ip_local_deliver_finish() can also use skb_network_header_len() which is
      faster than ip_hdrlen()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21d1196a
  17. 13 7月, 2013 1 次提交
  18. 12 7月, 2013 3 次提交
    • A
      gre: Fix MTU sizing check for gretap tunnels · 8c91e162
      Alexander Duyck 提交于
      This change fixes an MTU sizing issue seen with gretap tunnels when non-gso
      packets are sent from the interface.
      
      In my case I was able to reproduce the issue by simply sending a ping of
      1421 bytes with the gretap interface created on a device with a standard
      1500 mtu.
      
      This fix is based on the fact that the tunnel mtu is already adjusted by
      dev->hard_header_len so it would make sense that any packets being compared
      against that mtu should also be adjusted by hard_header_len and the tunnel
      header instead of just the tunnel header.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Reported-by: NCong Wang <amwang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c91e162
    • A
      gso: Update tunnel segmentation to support Tx checksum offload · cdbaa0bb
      Alexander Duyck 提交于
      This change makes it so that the GRE and VXLAN tunnels can make use of Tx
      checksum offload support provided by some drivers via the hw_enc_features.
      Without this fix enabling GSO means sacrificing Tx checksum offload and
      this actually leads to a performance regression as shown below:
      
                  Utilization
                  Send
      Throughput  local         GSO
      10^6bits/s  % S           state
        6276.51   8.39          enabled
        7123.52   8.42          disabled
      
      To resolve this it was necessary to address two items.  First
      netif_skb_features needed to be updated so that it would correctly handle
      the Trans Ether Bridging protocol without impacting the need to check for
      Q-in-Q tagging.  To do this it was necessary to update harmonize_features
      so that it used skb_network_protocol instead of just using the outer
      protocol.
      
      Second it was necessary to update the GRE and UDP tunnel segmentation
      offloads so that they would reset the encapsulation bit and inner header
      offsets after the offload was complete.
      
      As a result of this change I have seen the following results on a interface
      with Tx checksum enabled for encapsulated frames:
      
                  Utilization
                  Send
      Throughput  local         GSO
      10^6bits/s  % S           state
        7123.52   8.42          disabled
        8321.75   5.43          enabled
      
      v2: Instead of replacing refrence to skb->protocol with
          skb_network_protocol just replace the protocol reference in
          harmonize_features to allow for double VLAN tag checks.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cdbaa0bb
    • C
      inet: fix spacing in assignment · 3b8ccd44
      Camelia Groza 提交于
      Found using checkpatch.pl
      Signed-off-by: NCamelia Groza <camelia.groza@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b8ccd44
  19. 11 7月, 2013 2 次提交
  20. 09 7月, 2013 1 次提交
  21. 04 7月, 2013 1 次提交