1. 01 10月, 2013 2 次提交
    • E
      tcp: TSQ can use a dynamic limit · c9eeec26
      Eric Dumazet 提交于
      When TCP Small Queues was added, we used a sysctl to limit amount of
      packets queues on Qdisc/device queues for a given TCP flow.
      
      Problem is this limit is either too big for low rates, or too small
      for high rates.
      
      Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
      auto sizing, it can better control number of packets in Qdisc/device
      queues.
      
      New limit is two packets or at least 1 to 2 ms worth of packets.
      
      Low rates flows benefit from this patch by having even smaller
      number of packets in queues, allowing for faster recovery,
      better RTT estimations.
      
      High rates flows benefit from this patch by allowing more than 2 packets
      in flight as we had reports this was a limiting factor to reach line
      rate. [ In particular if TX completion is delayed because of coalescing
      parameters ]
      
      Example for a single flow on 10Gbp link controlled by FQ/pacing
      
      14 packets in flight instead of 2
      
      $ tc -s -d qd
      qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
      buckets 1024 quantum 3028 initial_quantum 15140
       Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
      requeues 6822476)
       rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
        2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
        2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
      
      Note that sk_pacing_rate is currently set to twice the actual rate, but
      this might be refined in the future when a flow is in congestion
      avoidance.
      
      Additional change : skb->destructor should be set to tcp_wfree().
      
      A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9eeec26
    • P
      ip_tunnel: Do not use stale inner_iph pointer. · d4a71b15
      Pravin B Shelar 提交于
      While sending packet skb_cow_head() can change skb header which
      invalidates inner_iph pointer to skb header. Following patch
      avoid using it. Found by code inspection.
      
      This bug was introduced by commit 0e6fbc5b (ip_tunnels: extend
      iptunnel_xmit()).
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4a71b15
  2. 29 9月, 2013 1 次提交
  3. 24 9月, 2013 2 次提交
  4. 20 9月, 2013 2 次提交
  5. 18 9月, 2013 1 次提交
  6. 13 9月, 2013 1 次提交
  7. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  8. 06 9月, 2013 1 次提交
  9. 05 9月, 2013 1 次提交
  10. 04 9月, 2013 9 次提交
  11. 03 9月, 2013 1 次提交
  12. 01 9月, 2013 1 次提交
  13. 31 8月, 2013 3 次提交
  14. 30 8月, 2013 4 次提交
    • C
      ipv4: sendto/hdrincl: don't use destination address found in header · c27c9322
      Chris Clark 提交于
      ipv4: raw_sendmsg: don't use header's destination address
      
      A sendto() regression was bisected and found to start with commit
      f8126f1d (ipv4: Adjust semantics of rt->rt_gateway.)
      
      The problem is that it tries to ARP-lookup the constructed packet's
      destination address rather than the explicitly provided address.
      
      Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.
      
      cf. commit 2ad5b9e4Reported-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Bisected-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Tested-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Suggested-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c27c9322
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028
      Andrew Vagin 提交于
      The zero value means that tsecr is not valid, so it's a special case.
      
      tsoffset is used to customize tcp_time_stamp for one socket.
      tsoffset is usually zero, it's used when a socket was moved from one
      host to another host.
      
      Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
      incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
      TCP_RTO_MAX.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e12028
    • A
      tcp: initialize rcv_tstamp for restored sockets · c7781a6e
      Andrew Vagin 提交于
      u32 rcv_tstamp;     /* timestamp of last received ACK */
      
      Its value used in tcp_retransmit_timer, which closes socket
      if the last ack was received more then TCP_RTO_MAX ago.
      
      Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
      is called before receiving a first ack, the connection is closed.
      
      This patch initializes rcv_tstamp to a timestamp, when a socket was
      restored.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7781a6e
  15. 28 8月, 2013 5 次提交
    • P
      netfilter: add SYNPROXY core/target · 48b1de4c
      Patrick McHardy 提交于
      Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
      core with common functions and an address family specific target.
      
      The SYNPROXY receives the connection request from the client, responds with
      a SYN/ACK containing a SYN cookie and announcing a zero window and checks
      whether the final ACK from the client contains a valid cookie.
      
      It then establishes a connection to the original destination and, if
      successful, sends a window update to the client with the window size
      announced by the server.
      
      Support for timestamps, SACK, window scaling and MSS options can be
      statically configured as target parameters if the features of the server
      are known. If timestamps are used, the timestamp value sent back to
      the client in the SYN/ACK will be different from the real timestamp of
      the server. In order to now break PAWS, the timestamps are translated in
      the direction server->client.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      48b1de4c
    • P
      net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b
      Patrick McHardy 提交于
      Extract the local TCP stack independant parts of tcp_v4_init_sequence()
      and cookie_v4_check() and export them for use by the upcoming SYNPROXY
      target.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0198230b
    • P
      netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0
      Patrick McHardy 提交于
      Split out sequence number adjustments from NAT and move them to the conntrack
      core to make them usable for SYN proxying. The sequence number adjustment
      information is moved to a seperate extend. The extend is added to new
      conntracks when a NAT mapping is set up for a connection using a helper.
      
      As a side effect, this saves 24 bytes per connection with NAT in the common
      case that a connection does not have a helper assigned.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      41d73ec0
    • P
      netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d
      Phil Oester 提交于
      As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
      with the tcp-reset option sends out reset packets with the src MAC address
      of the local bridge interface, instead of the MAC address of the intended
      destination.  This causes some routers/firewalls to drop the reset packet
      as it appears to be spoofed.  Fix this by bypassing ip[6]_local_out and
      setting the MAC of the sender in the tcp reset packet.
      
      This closes netfilter bugzilla #531.
      Signed-off-by: NPhil Oester <kernel@linuxace.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      affe759d
    • D
      net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68
      Daniel Borkmann 提交于
      Currently, the tcp_probe snooper can either filter packets by a given
      port (handed to the module via module parameter e.g. port=80) or lets
      all TCP traffic pass (port=0, default). When a port is specified, the
      port number is tested against the sk's source/destination port. Thus,
      if one of them matches, the information will be further processed for
      the log.
      
      As this is quite limited, allow for more advanced filtering possibilities
      which can facilitate debugging/analysis with the help of the tcp_probe
      snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
      ("pkt_sched: ingress socket filter by mark"), add the possibility to
      use skb->mark as a filter.
      
      If the mark is not being used otherwise, this allows ingress filtering
      by flow (e.g. in order to track updates from only a single flow, or a
      subset of all flows for a given port) and other things such as dynamic
      logging and reconfiguration without removing/re-inserting the tcp_probe
      module, etc. Simple example:
      
        insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
        ...
        iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
        [... sampling interval ...]
        iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
      
      The current option to filter by a given port is still being preserved. A
      similar approach could be done for the sctp_probe module as a follow-up.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1dcdc68
  16. 26 8月, 2013 2 次提交
  17. 23 8月, 2013 2 次提交