1. 24 9月, 2013 2 次提交
  2. 20 9月, 2013 2 次提交
  3. 18 9月, 2013 1 次提交
  4. 13 9月, 2013 1 次提交
  5. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  6. 06 9月, 2013 1 次提交
  7. 05 9月, 2013 1 次提交
  8. 04 9月, 2013 9 次提交
  9. 03 9月, 2013 1 次提交
  10. 01 9月, 2013 1 次提交
  11. 31 8月, 2013 3 次提交
  12. 30 8月, 2013 4 次提交
    • C
      ipv4: sendto/hdrincl: don't use destination address found in header · c27c9322
      Chris Clark 提交于
      ipv4: raw_sendmsg: don't use header's destination address
      
      A sendto() regression was bisected and found to start with commit
      f8126f1d (ipv4: Adjust semantics of rt->rt_gateway.)
      
      The problem is that it tries to ARP-lookup the constructed packet's
      destination address rather than the explicitly provided address.
      
      Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.
      
      cf. commit 2ad5b9e4Reported-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Bisected-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Tested-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Suggested-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c27c9322
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028
      Andrew Vagin 提交于
      The zero value means that tsecr is not valid, so it's a special case.
      
      tsoffset is used to customize tcp_time_stamp for one socket.
      tsoffset is usually zero, it's used when a socket was moved from one
      host to another host.
      
      Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
      incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
      TCP_RTO_MAX.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e12028
    • A
      tcp: initialize rcv_tstamp for restored sockets · c7781a6e
      Andrew Vagin 提交于
      u32 rcv_tstamp;     /* timestamp of last received ACK */
      
      Its value used in tcp_retransmit_timer, which closes socket
      if the last ack was received more then TCP_RTO_MAX ago.
      
      Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
      is called before receiving a first ack, the connection is closed.
      
      This patch initializes rcv_tstamp to a timestamp, when a socket was
      restored.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7781a6e
  13. 28 8月, 2013 5 次提交
    • P
      netfilter: add SYNPROXY core/target · 48b1de4c
      Patrick McHardy 提交于
      Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
      core with common functions and an address family specific target.
      
      The SYNPROXY receives the connection request from the client, responds with
      a SYN/ACK containing a SYN cookie and announcing a zero window and checks
      whether the final ACK from the client contains a valid cookie.
      
      It then establishes a connection to the original destination and, if
      successful, sends a window update to the client with the window size
      announced by the server.
      
      Support for timestamps, SACK, window scaling and MSS options can be
      statically configured as target parameters if the features of the server
      are known. If timestamps are used, the timestamp value sent back to
      the client in the SYN/ACK will be different from the real timestamp of
      the server. In order to now break PAWS, the timestamps are translated in
      the direction server->client.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      48b1de4c
    • P
      net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b
      Patrick McHardy 提交于
      Extract the local TCP stack independant parts of tcp_v4_init_sequence()
      and cookie_v4_check() and export them for use by the upcoming SYNPROXY
      target.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0198230b
    • P
      netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0
      Patrick McHardy 提交于
      Split out sequence number adjustments from NAT and move them to the conntrack
      core to make them usable for SYN proxying. The sequence number adjustment
      information is moved to a seperate extend. The extend is added to new
      conntracks when a NAT mapping is set up for a connection using a helper.
      
      As a side effect, this saves 24 bytes per connection with NAT in the common
      case that a connection does not have a helper assigned.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      41d73ec0
    • P
      netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d
      Phil Oester 提交于
      As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
      with the tcp-reset option sends out reset packets with the src MAC address
      of the local bridge interface, instead of the MAC address of the intended
      destination.  This causes some routers/firewalls to drop the reset packet
      as it appears to be spoofed.  Fix this by bypassing ip[6]_local_out and
      setting the MAC of the sender in the tcp reset packet.
      
      This closes netfilter bugzilla #531.
      Signed-off-by: NPhil Oester <kernel@linuxace.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      affe759d
    • D
      net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68
      Daniel Borkmann 提交于
      Currently, the tcp_probe snooper can either filter packets by a given
      port (handed to the module via module parameter e.g. port=80) or lets
      all TCP traffic pass (port=0, default). When a port is specified, the
      port number is tested against the sk's source/destination port. Thus,
      if one of them matches, the information will be further processed for
      the log.
      
      As this is quite limited, allow for more advanced filtering possibilities
      which can facilitate debugging/analysis with the help of the tcp_probe
      snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
      ("pkt_sched: ingress socket filter by mark"), add the possibility to
      use skb->mark as a filter.
      
      If the mark is not being used otherwise, this allows ingress filtering
      by flow (e.g. in order to track updates from only a single flow, or a
      subset of all flows for a given port) and other things such as dynamic
      logging and reconfiguration without removing/re-inserting the tcp_probe
      module, etc. Simple example:
      
        insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
        ...
        iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
        [... sampling interval ...]
        iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
      
      The current option to filter by a given port is still being preserved. A
      similar approach could be done for the sctp_probe module as a follow-up.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1dcdc68
  14. 26 8月, 2013 2 次提交
  15. 23 8月, 2013 4 次提交
    • D
      net: tcp_probe: add IPv6 support · f925d0a6
      Daniel Borkmann 提交于
      The tcp_probe currently only supports analysis of IPv4 connections.
      Therefore, it would be nice to have IPv6 supported as well. Since we
      have the recently added %pISpc specifier that is IPv4/IPv6 generic,
      build related sockaddress structures from the flow information and
      pass this to our format string. Tested with SSH and HTTP sessions
      on IPv4 and IPv6.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f925d0a6
    • D
      net: tcp_probe: kprobes: adapt jtcp_rcv_established signature · d8cdeda6
      Daniel Borkmann 提交于
      This patches fixes a rather unproblematic function signature mismatch
      as the const specifier was missing for the th variable; and next to
      that it adds a build-time assertion so that future function signature
      mismatches for kprobes will not end badly, similarly as commit 22222997
      ("net: sctp: add build check for sctp_sf_eat_sack_6_2/jsctp_sf_eat_sack")
      did it for SCTP.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d8cdeda6
    • D
      net: tcp_probe: also include rcv_wnd next to snd_wnd · b4c1c1d0
      Daniel Borkmann 提交于
      It is helpful to sometimes know the TCP window sizes of an established
      socket e.g. to confirm that window scaling is working or to tweak the
      window size to improve high-latency connections, etc etc. Currently the
      TCP snooper only exports the send window size, but not the receive window
      size. Therefore, also add the receive window size to the end of the
      output line.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4c1c1d0
    • Y
      tcp: increase throughput when reordering is high · 0f7cc9a3
      Yuchung Cheng 提交于
      The stack currently detects reordering and avoid spurious
      retransmission very well. However the throughput is sub-optimal under
      high reordering because cwnd is increased only if the data is deliverd
      in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
      are reordered the worse the throughput is.
      
      Therefore when reordering is proven high, cwnd should advance whenever
      the data is delivered regardless of its ordering. If reordering is low,
      conservatively advance cwnd only on ordered deliveries in Open state,
      and retain cwnd in Disordered state (RFC5681).
      
      Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
      to 55ms (for reordering effect). This change increases TCP throughput
      by 20 - 25% to near bottleneck BW.
      
      A special case is the stretched ACK with new SACK and/or ECE mark.
      For example, a receiver may receive an out of order or ECN packet with
      unacked data buffered because of LRO or delayed ACK. The principle on
      such an ACK is to advance cwnd on the cummulative acked part first,
      then reduce cwnd in tcp_fastretrans_alert().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f7cc9a3
  16. 21 8月, 2013 1 次提交