1. 29 9月, 2013 3 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
    • F
      ipv4: processing ancillary IP_TOS or IP_TTL · aa661581
      Francesco Fusco 提交于
      If IP_TOS or IP_TTL are specified as ancillary data, then sendmsg() sends out
      packets with the specified TTL or TOS overriding the socket values specified
      with the traditional setsockopt().
      
      The struct inet_cork stores the values of TOS, TTL and priority that are
      passed through the struct ipcm_cookie. If there are user-specified TOS
      (tos != -1) or TTL (ttl != 0) in the struct ipcm_cookie, these values are
      used to override the per-socket values. In case of TOS also the priority
      is changed accordingly.
      
      Two helper functions get_rttos and get_rtconn_flags are defined to take
      into account the presence of a user specified TOS value when computing
      RT_TOS and RT_CONN_FLAGS.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa661581
    • F
      ipv4: IP_TOS and IP_TTL can be specified as ancillary data · f02db315
      Francesco Fusco 提交于
      This patch enables the IP_TTL and IP_TOS values passed from userspace to
      be stored in the ipcm_cookie struct. Three fields are added to the struct:
      
      - the TTL, expressed as __u8.
        The allowed values are in the [1-255].
        A value of 0 means that the TTL is not specified.
      
      - the TOS, expressed as __s16.
        The allowed values are in the range [0,255].
        A value of -1 means that the TOS is not specified.
      
      - the priority, expressed as a char and computed when
        handling the ancillary data.
      Signed-off-by: NFrancesco Fusco <ffusco@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f02db315
  2. 24 9月, 2013 3 次提交
    • E
      tcp: fix dynamic right sizing · b0983d3c
      Eric Dumazet 提交于
      Dynamic Right Sizing (DRS) is supposed to open TCP receive window
      automatically, but suffers from two bugs, presented by order
      of importance.
      
      1) tcp_rcv_space_adjust() fix :
      
      Using twice the last received amount is very pessimistic,
      because it doesn't allow fast recovery or proper slow start
      ramp up, if sender wants to increase cwin by 100% every RTT.
      
      copied = bytes received in previous RTT
      
      2*copied = bytes we expect to receive in next RTT
      
      4*copied = bytes we need to advertise in rwin at end of next RTT
      
      DRS is one RTT late, it needs a 4x factor.
      
      If sender is not using ABC, and increases cwin by 50% every rtt,
      then we needed 1.5*1.5 = 2.25 factor.
      This is probably why this bug was not really noticed.
      
      2) There is no window adjustment after first RTT. DRS triggers only
        after the second RTT.
        DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
        sk_rcvbuf to allow proper window grow for first two RTT.
      
      This patch increases TCP efficiency particularly for large RTT flows
      when autotuning is used at the receiver, and more particularly
      in presence of packet losses.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0983d3c
    • F
      tcp: syncookies: reduce mss table to four values · 08629354
      Florian Westphal 提交于
      Halve mss table size to make blind cookie guessing more difficult.
      This is sad since the tables were already small, but there
      is little alternative except perhaps adding more precise mss information
      in the tcp timestamp.  Timestamps are unfortunately not ubiquitous.
      
      Guessing all possible cookie values still has 8-in 2**32 chance.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08629354
    • F
      tcp: syncookies: reduce cookie lifetime to 128 seconds · 8c27bd75
      Florian Westphal 提交于
      We currently accept cookies that were created less than 4 minutes ago
      (ie, cookies with counter delta 0-3).  Combined with the 8 mss table
      values, this yields 32 possible values (out of 2**32) that will be valid.
      
      Reducing the lifetime to < 2 minutes halves the guessing chance while
      still providing a large enough period.
      
      While at it, get rid of jiffies value -- they overflow too quickly on
      32 bit platforms.
      
      getnstimeofday is used to create a counter that increments every 64s.
      perf shows getnstimeofday cost is negible compared to sha_transform;
      normal tcp initial sequence number generation uses getnstimeofday, too.
      Reported-by: NJakob Lell <jakob@jakoblell.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c27bd75
  3. 20 9月, 2013 2 次提交
  4. 18 9月, 2013 1 次提交
  5. 13 9月, 2013 1 次提交
  6. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  7. 06 9月, 2013 1 次提交
  8. 05 9月, 2013 1 次提交
  9. 04 9月, 2013 9 次提交
  10. 03 9月, 2013 1 次提交
  11. 01 9月, 2013 1 次提交
  12. 31 8月, 2013 3 次提交
  13. 30 8月, 2013 4 次提交
    • C
      ipv4: sendto/hdrincl: don't use destination address found in header · c27c9322
      Chris Clark 提交于
      ipv4: raw_sendmsg: don't use header's destination address
      
      A sendto() regression was bisected and found to start with commit
      f8126f1d (ipv4: Adjust semantics of rt->rt_gateway.)
      
      The problem is that it tries to ARP-lookup the constructed packet's
      destination address rather than the explicitly provided address.
      
      Fix this using FLOWI_FLAG_KNOWN_NH so that given nexthop is used.
      
      cf. commit 2ad5b9e4Reported-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Bisected-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Tested-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Suggested-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NChris Clark <chris.clark@alcatel-lucent.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c27c9322
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028
      Andrew Vagin 提交于
      The zero value means that tsecr is not valid, so it's a special case.
      
      tsoffset is used to customize tcp_time_stamp for one socket.
      tsoffset is usually zero, it's used when a socket was moved from one
      host to another host.
      
      Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
      incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
      TCP_RTO_MAX.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e12028
    • A
      tcp: initialize rcv_tstamp for restored sockets · c7781a6e
      Andrew Vagin 提交于
      u32 rcv_tstamp;     /* timestamp of last received ACK */
      
      Its value used in tcp_retransmit_timer, which closes socket
      if the last ack was received more then TCP_RTO_MAX ago.
      
      Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
      is called before receiving a first ack, the connection is closed.
      
      This patch initializes rcv_tstamp to a timestamp, when a socket was
      restored.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7781a6e
  14. 28 8月, 2013 5 次提交
    • P
      netfilter: add SYNPROXY core/target · 48b1de4c
      Patrick McHardy 提交于
      Add a SYNPROXY for netfilter. The code is split into two parts, the synproxy
      core with common functions and an address family specific target.
      
      The SYNPROXY receives the connection request from the client, responds with
      a SYN/ACK containing a SYN cookie and announcing a zero window and checks
      whether the final ACK from the client contains a valid cookie.
      
      It then establishes a connection to the original destination and, if
      successful, sends a window update to the client with the window size
      announced by the server.
      
      Support for timestamps, SACK, window scaling and MSS options can be
      statically configured as target parameters if the features of the server
      are known. If timestamps are used, the timestamp value sent back to
      the client in the SYN/ACK will be different from the real timestamp of
      the server. In order to now break PAWS, the timestamps are translated in
      the direction server->client.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      48b1de4c
    • P
      net: syncookies: export cookie_v4_init_sequence/cookie_v4_check · 0198230b
      Patrick McHardy 提交于
      Extract the local TCP stack independant parts of tcp_v4_init_sequence()
      and cookie_v4_check() and export them for use by the upcoming SYNPROXY
      target.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      0198230b
    • P
      netfilter: nf_conntrack: make sequence number adjustments usuable without NAT · 41d73ec0
      Patrick McHardy 提交于
      Split out sequence number adjustments from NAT and move them to the conntrack
      core to make them usable for SYN proxying. The sequence number adjustment
      information is moved to a seperate extend. The extend is added to new
      conntracks when a NAT mapping is set up for a connection using a helper.
      
      As a side effect, this saves 24 bytes per connection with NAT in the common
      case that a connection does not have a helper assigned.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Tested-by: NMartin Topholm <mph@one.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      41d73ec0
    • P
      netfilter: ip[6]t_REJECT: tcp-reset using wrong MAC source if bridged · affe759d
      Phil Oester 提交于
      As reported by Casper Gripenberg, in a bridged setup, using ip[6]t_REJECT
      with the tcp-reset option sends out reset packets with the src MAC address
      of the local bridge interface, instead of the MAC address of the intended
      destination.  This causes some routers/firewalls to drop the reset packet
      as it appears to be spoofed.  Fix this by bypassing ip[6]_local_out and
      setting the MAC of the sender in the tcp reset packet.
      
      This closes netfilter bugzilla #531.
      Signed-off-by: NPhil Oester <kernel@linuxace.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      affe759d
    • D
      net: tcp_probe: allow more advanced ingress filtering by mark · b1dcdc68
      Daniel Borkmann 提交于
      Currently, the tcp_probe snooper can either filter packets by a given
      port (handed to the module via module parameter e.g. port=80) or lets
      all TCP traffic pass (port=0, default). When a port is specified, the
      port number is tested against the sk's source/destination port. Thus,
      if one of them matches, the information will be further processed for
      the log.
      
      As this is quite limited, allow for more advanced filtering possibilities
      which can facilitate debugging/analysis with the help of the tcp_probe
      snooper. Therefore, similarly as added to BPF machine in commit 7e75f93e
      ("pkt_sched: ingress socket filter by mark"), add the possibility to
      use skb->mark as a filter.
      
      If the mark is not being used otherwise, this allows ingress filtering
      by flow (e.g. in order to track updates from only a single flow, or a
      subset of all flows for a given port) and other things such as dynamic
      logging and reconfiguration without removing/re-inserting the tcp_probe
      module, etc. Simple example:
      
        insmod net/ipv4/tcp_probe.ko fwmark=8888 full=1
        ...
        iptables -A INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
        [... sampling interval ...]
        iptables -D INPUT -i eth4 -t mangle -p tcp --dport 22 \
                 --sport 60952 -j MARK --set-mark 8888
      
      The current option to filter by a given port is still being preserved. A
      similar approach could be done for the sctp_probe module as a follow-up.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1dcdc68
  15. 26 8月, 2013 2 次提交
  16. 23 8月, 2013 1 次提交
新手
引导
客服 返回
顶部