1. 04 9月, 2013 4 次提交
  2. 03 9月, 2013 1 次提交
  3. 01 9月, 2013 1 次提交
  4. 31 8月, 2013 1 次提交
    • Y
      tcp: do not use cached RTT for RTT estimation · 1b7fdd2a
      Yuchung Cheng 提交于
      RTT cached in the TCP metrics are valuable for the initial timeout
      because SYN RTT usually does not account for serialization delays
      on low BW path.
      
      However using it to seed the RTT estimator maybe disruptive because
      other components (e.g., pacing) require the smooth RTT to be obtained
      from actual connection.
      
      The solution is to use the higher cached RTT to set the first RTO
      conservatively like tcp_rtt_estimator(), but avoid seeding the other
      RTT estimator variables such as srtt.  It is also a good idea to
      keep RTO conservative to obtain the first RTT sample, and the
      performance is insured by TCP loss probe if SYN RTT is available.
      
      To keep the seeding formula consistent across SYN RTT and cached RTT,
      the rttvar is twice the cached RTT instead of cached RTTVAR value. The
      reason is because cached variation may be too small (near min RTO)
      which defeats the purpose of being conservative on first RTO. However
      the metrics still keep the RTT variations as they might be useful for
      user applications (through ip).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b7fdd2a
  5. 30 8月, 2013 1 次提交
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
  6. 28 8月, 2013 5 次提交
    • P
      netfilter: add SYNPROXY core/target · 48b1de4c
      Patrick McHardy 提交于
      Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
      core with common functions and an address family specific target.
      
      The SYNPROXY receives the connection request from the client, responds with
      a SYN/ACK containing a SYN cookie and announcing a zero window and checks
      whether the final ACK from the client contains a valid cookie.
      
      It then establishes a connection to the original destination and, if
      successful, sends a window update to the client with the window size
      announced by the server.
      
      Support for timestamps, SACK, window scaling and MSS options can be
      statically configured as target parameters if the features of the server
      are known. If timestamps are used, the timestamp value sent back to
      the client in the SYN/ACK will be different from the real timestamp of
      the server. In order to now break PAWS, the timestamps are translated in
      the direction server->client.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      48b1de4c
    • P
      net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b
      Patrick McHardy 提交于
      Extract the local TCP stack independant parts of tcp_v4_init_sequence()
      and cookie_v4_check() and export them for use by the upcoming SYNPROXY
      target.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0198230b
    • P
      netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0
      Patrick McHardy 提交于
      Split out sequence number adjustments from NAT and move them to the conntrack
      core to make them usable for SYN proxying. The sequence number adjustment
      information is moved to a seperate extend. The extend is added to new
      conntracks when a NAT mapping is set up for a connection using a helper.
      
      As a side effect, this saves 24 bytes per connection with NAT in the common
      case that a connection does not have a helper assigned.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      41d73ec0
    • P
      netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d
      Phil Oester 提交于
      As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
      with the tcp-reset option sends out reset packets with the src MAC address
      of the local bridge interface, instead of the MAC address of the intended
      destination.  This causes some routers/firewalls to drop the reset packet
      as it appears to be spoofed.  Fix this by bypassing ip[6]_local_out and
      setting the MAC of the sender in the tcp reset packet.
      
      This closes netfilter bugzilla #531.
      Signed-off-by: NPhil Oester <kernel@linuxace.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      affe759d
    • D
      net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68
      Daniel Borkmann 提交于
      Currently, the tcp_probe snooper can either filter packets by a given
      port (handed to the module via module parameter e.g. port=80) or lets
      all TCP traffic pass (port=0, default). When a port is specified, the
      port number is tested against the sk's source/destination port. Thus,
      if one of them matches, the information will be further processed for
      the log.
      
      As this is quite limited, allow for more advanced filtering possibilities
      which can facilitate debugging/analysis with the help of the tcp_probe
      snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
      ("pkt_sched: ingress socket filter by mark"), add the possibility to
      use skb->mark as a filter.
      
      If the mark is not being used otherwise, this allows ingress filtering
      by flow (e.g. in order to track updates from only a single flow, or a
      subset of all flows for a given port) and other things such as dynamic
      logging and reconfiguration without removing/re-inserting the tcp_probe
      module, etc. Simple example:
      
        insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
        ...
        iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
        [... sampling interval ...]
        iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
      
      The current option to filter by a given port is still being preserved. A
      similar approach could be done for the sctp_probe module as a follow-up.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1dcdc68
  7. 26 8月, 2013 1 次提交
  8. 23 8月, 2013 4 次提交
    • D
      net: tcp_probe: add IPv6 support · f925d0a6
      Daniel Borkmann 提交于
      The tcp_probe currently only supports analysis of IPv4 connections.
      Therefore, it would be nice to have IPv6 supported as well. Since we
      have the recently added %pISpc specifier that is IPv4/IPv6 generic,
      build related sockaddress structures from the flow information and
      pass this to our format string. Tested with SSH and HTTP sessions
      on IPv4 and IPv6.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f925d0a6
    • D
      net: tcp_probe: kprobes: adapt jtcp_rcv_established signature · d8cdeda6
      Daniel Borkmann 提交于
      This patches fixes a rather unproblematic function signature mismatch
      as the const specifier was missing for the th variable; and next to
      that it adds a build-time assertion so that future function signature
      mismatches for kprobes will not end badly, similarly as commit 22222997
      ("net: sctp: add build check for sctp_sf_eat_sack_6_2/jsctp_sf_eat_sack")
      did it for SCTP.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8cdeda6
    • D
      net: tcp_probe: also include rcv_wnd next to snd_wnd · b4c1c1d0
      Daniel Borkmann 提交于
      It is helpful to sometimes know the TCP window sizes of an established
      socket e.g. to confirm that window scaling is working or to tweak the
      window size to improve high-latency connections, etc etc. Currently the
      TCP snooper only exports the send window size, but not the receive window
      size. Therefore, also add the receive window size to the end of the
      output line.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4c1c1d0
    • Y
      tcp: increase throughput when reordering is high · 0f7cc9a3
      Yuchung Cheng 提交于
      The stack currently detects reordering and avoid spurious
      retransmission very well. However the throughput is sub-optimal under
      high reordering because cwnd is increased only if the data is deliverd
      in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
      are reordered the worse the throughput is.
      
      Therefore when reordering is proven high, cwnd should advance whenever
      the data is delivered regardless of its ordering. If reordering is low,
      conservatively advance cwnd only on ordered deliveries in Open state,
      and retain cwnd in Disordered state (RFC5681).
      
      Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
      to 55ms (for reordering effect). This change increases TCP throughput
      by 20 - 25% to near bottleneck BW.
      
      A special case is the stretched ACK with new SACK and/or ECE mark.
      For example, a receiver may receive an out of order or ECN packet with
      unacked data buffered because of LRO or delayed ACK. The principle on
      such an ACK is to advance cwnd on the cummulative acked part first,
      then reduce cwnd in tcp_fastretrans_alert().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f7cc9a3
  9. 21 8月, 2013 4 次提交
  10. 16 8月, 2013 1 次提交
  11. 15 8月, 2013 3 次提交
  12. 14 8月, 2013 2 次提交
    • P
      ip_tunnel: Do not use inner ip-header-id for tunnel ip-header-id. · 4221f405
      Pravin B Shelar 提交于
      Using inner-id for tunnel id is not safe in some rare cases.
      E.g. packets coming from multiple sources entering same tunnel
      can have same id. Therefore on tunnel packet receive we
      could have packets from two different stream but with same
      source and dst IP with same ip-id which could confuse ip packet
      reassembly.
      
      Following patch reverts optimization from commit
      490ab081 (IP_GRE: Fix IP-Identification.)
      
      CC: Jarno Rajahalme <jrajahalme@nicira.com>
      CC: Ansis Atteka <aatteka@nicira.com>
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4221f405
    • Y
      tcp: reset reordering est. selectively on timeout · 74c181d5
      Yuchung Cheng 提交于
      On timeout the TCP sender unconditionally resets the estimated degree
      of network reordering (tp->reordering). The idea behind this is that
      the estimate is too large to trigger fast recovery (e.g., due to a IP
      path change).
      
      But for example if the sender only had 2 packets outstanding, then a
      timeout doesn't tell much about reordering. A sender that learns about
      reordering on big writes and loses packets on small writes will end up
      falsely retransmitting again and again, especially when reordering is
      more likely on big writes.
      
      Therefore the sender should only suspect that tp->reordering is too
      high if it could have gone into fast recovery with the (lower) default
      estimate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74c181d5
  13. 10 8月, 2013 6 次提交
  14. 09 8月, 2013 2 次提交
  15. 08 8月, 2013 3 次提交
    • S
      ip_tunnel: embed hash list head · 6261d983
      stephen hemminger 提交于
      The IP tunnel hash heads can be embedded in the per-net structure
      since it is a fixed size. Reduce the size so that the total structure
      fits in a page size. The original size was overly large, even NETDEV_HASHBITS
      is only 8 bits!
      
      Also, add some white space for readability.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Acked-by: Pravin B Shelar <pshelar@nicira.com>.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6261d983
    • E
      tcp: cubic: fix bug in bictcp_acked() · cd6b423a
      Eric Dumazet 提交于
      While investigating about strange increase of retransmit rates
      on hosts ~24 days after boot, Van found hystart was disabled
      if ca->epoch_start was 0, as following condition is true
      when tcp_time_stamp high order bit is set.
      
      (s32)(tcp_time_stamp - ca->epoch_start) < HZ
      
      Quoting Van :
      
       At initialization & after every loss ca->epoch_start is set to zero so
       I believe that the above line will turn off hystart as soon as the 2^31
       bit is set in tcp_time_stamp & hystart will stay off for 24 days.
       I think we've observed that cubic's restart is too aggressive without
       hystart so this might account for the higher drop rate we observe.
      Diagnosed-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6b423a
    • E
      tcp: cubic: fix overflow error in bictcp_update() · 2ed0edf9
      Eric Dumazet 提交于
      commit 17a6e9f1 ("tcp_cubic: fix clock dependency") added an
      overflow error in bictcp_update() in following code :
      
      /* change the unit from HZ to bictcp_HZ */
      t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
            ca->epoch_start) << BICTCP_HZ) / HZ;
      
      Because msecs_to_jiffies() being unsigned long, compiler does
      implicit type promotion.
      
      We really want to constrain (tcp_time_stamp - ca->epoch_start)
      to a signed 32bit value, or else 't' has unexpected high values.
      
      This bugs triggers an increase of retransmit rates ~24 days after
      boot [1], as the high order bit of tcp_time_stamp flips.
      
      [1] for hosts with HZ=1000
      
      Big thanks to Van Jacobson for spotting this problem.
      Diagnosed-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ed0edf9
  16. 06 8月, 2013 1 次提交