1. 16 2月, 2013 1 次提交
    • P
      v4 GRE: Add TCP segmentation offload for GRE · 68c33163
      Pravin B Shelar 提交于
      Following patch adds GRE protocol offload handler so that
      skb_gso_segment() can segment GRE packets.
      SKB GSO CB is added to keep track of total header length so that
      skb_segment can push entire header. e.g. in case of GRE, skb_segment
      need to push inner and outer headers to every segment.
      New NETIF_F_GRE_GSO feature is added for devices which support HW
      GRE TSO offload. Currently none of devices support it therefore GRE GSO
      always fall backs to software GSO.
      
      [ Compute pkt_len before ip_local_out() invocation. -DaveM ]
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68c33163
  2. 14 2月, 2013 3 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
    • A
      tcp: set and get per-socket timestamp · 93be6ce0
      Andrey Vagin 提交于
      A timestamp can be set, only if a socket is in the repair mode.
      
      This patch adds a new socket option TCP_TIMESTAMP, which allows to
      get and set current tcp times stamp.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93be6ce0
    • A
      tcp: adding a per-socket timestamp offset · ceaa1fef
      Andrey Vagin 提交于
      This functionality is used for restoring tcp sockets. A tcp timestamp
      depends on how long a system has been running, so it's differ for each
      host. The solution is to set a per-socket offset.
      
      A per-socket offset for a TIME_WAIT socket is inherited from a proper
      tcp socket.
      
      tcp_request_sock doesn't have a timestamp offset, because the repair
      mode for them are not implemented.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceaa1fef
  3. 06 2月, 2013 1 次提交
  4. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  5. 23 1月, 2013 1 次提交
  6. 11 1月, 2013 2 次提交
    • E
      tcp: fix splice() and tcp collapsing interaction · f26845b4
      Eric Dumazet 提交于
      Under unusual circumstances, TCP collapse can split a big GRO TCP packet
      while its being used in a splice(socket->pipe) operation.
      
      skb_splice_bits() releases the socket lock before calling
      splice_to_pipe().
      
      [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
      [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
      [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13
      
      To fix this problem, we must eat skbs in tcp_recv_skb().
      
      Remove the inline keyword from tcp_recv_skb() definition since
      it has three call sites.
      Reported-by: NChristian Becker <c.becker@traviangames.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f26845b4
    • E
      tcp: splice: fix an infinite loop in tcp_read_sock() · ff905b1e
      Eric Dumazet 提交于
      commit 02275a2e (tcp: don't abort splice() after small transfers)
      added a regression.
      
      [   83.843570] INFO: rcu_sched self-detected stall on CPU
      [   83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
      [   83.844582] Task dump for CPU 6:
      [   83.844584] netperf         R  running task        0  8966   8952 0x0000000c
      [   83.844587]  0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
      [   83.844589]  000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
      [   83.844592]  ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
      [   83.844594] Call Trace:
      [   83.844596]  [<ffffffff81088679>] ? vprintk_emit+0x1c9/0x4c0
      [   83.844601]  [<ffffffff815ad449>] ? schedule+0x29/0x70
      [   83.844606]  [<ffffffff81537bd2>] ? tcp_splice_data_recv+0x42/0x50
      [   83.844610]  [<ffffffff8153beaa>] ? tcp_read_sock+0xda/0x260
      [   83.844613]  [<ffffffff81537b90>] ? tcp_prequeue_process+0xb0/0xb0
      [   83.844615]  [<ffffffff8153c0f0>] ? tcp_splice_read+0xc0/0x250
      [   83.844618]  [<ffffffff814dc0c2>] ? sock_splice_read+0x22/0x30
      [   83.844622]  [<ffffffff811b820b>] ? do_splice_to+0x7b/0xa0
      [   83.844627]  [<ffffffff811ba4bc>] ? sys_splice+0x59c/0x5d0
      [   83.844630]  [<ffffffff8119745b>] ? putname+0x2b/0x40
      [   83.844633]  [<ffffffff8118bcb4>] ? do_sys_open+0x174/0x1e0
      [   83.844636]  [<ffffffff815b6202>] ? system_call_fastpath+0x16/0x1b
      
      if recv_actor() returns 0, we should stop immediately,
      because looping wont give a chance to drain the pipe.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ff905b1e
  7. 08 1月, 2013 2 次提交
  8. 03 12月, 2012 1 次提交
    • W
      tcp: don't abort splice() after small transfers · 02275a2e
      Willy Tarreau 提交于
      TCP coalescing added a regression in splice(socket->pipe) performance,
      for some workloads because of the way tcp_read_sock() is implemented.
      
      The reason for this is the break when (offset + 1 != skb->len).
      
      As we released the socket lock, this condition is possible if TCP stack
      added a fragment to the skb, which can happen with TCP coalescing.
      
      So let's go back to the beginning of the loop when this happens,
      to give a chance to splice more frags per system call.
      
      Doing so fixes the issue and makes GRO 10% faster than LRO
      on CPU-bound splice() workloads instead of the opposite.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02275a2e
  9. 02 12月, 2012 2 次提交
  10. 19 11月, 2012 1 次提交
    • E
      net: Allow userns root to control ipv4 · 52e804c6
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow creating raw sockets.
      Allow the SIOCSARP ioctl to control the arp cache.
      Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
      Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
      Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
      Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
      Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
      Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting gre tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipip tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipsec virtual tunnel interfaces.
      
      Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
      MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
      sockets.
      
      Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
      arbitrary ip options.
      
      Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
      Allow setting the IP_TRANSPARENT ipv4 socket option.
      Allow setting the TCP_REPAIR socket option.
      Allow setting the TCP_CONGESTION socket option.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52e804c6
  11. 16 11月, 2012 1 次提交
    • A
      tcp: fix retransmission in repair mode · ec342325
      Andrew Vagin 提交于
      Currently if a socket was repaired with a few packet in a write queue,
      a kernel bug may be triggered:
      
      kernel BUG at net/ipv4/tcp_output.c:2330!
      RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610
      
      According to the initial realization v3.4-rc2-963-gc0e88ff0,
      all skb-s should look like already posted. This patch fixes code
      according with this sentence.
      
      Here are three points, which were not done in the initial patch:
      1. A tcp send head should not be changed
      2. Initialize TSO state of a skb
      3. Reset the retransmission time
      
      This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
      passes the ussual way, but isn't sent to network. This patch solves
      all described problems and handles tcp_sendpages.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec342325
  12. 23 10月, 2012 2 次提交
  13. 19 10月, 2012 1 次提交
  14. 25 9月, 2012 1 次提交
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
  15. 21 9月, 2012 2 次提交
  16. 20 9月, 2012 1 次提交
    • M
      tcp: flush DMA queue before sk_wait_data if rcv_wnd is zero · 15c04175
      Michal Kubeček 提交于
      If recv() syscall is called for a TCP socket so that
        - IOAT DMA is used
        - MSG_WAITALL flag is used
        - requested length is bigger than sk_rcvbuf
        - enough data has already arrived to bring rcv_wnd to zero
      then when tcp_recvmsg() gets to calling sk_wait_data(), receive
      window can be still zero while sk_async_wait_queue exhausts
      enough space to keep it zero. As this queue isn't cleaned until
      the tcp_service_net_dma() call, sk_wait_data() cannot receive
      any data and blocks forever.
      
      If zero receive window and non-empty sk_async_wait_queue is
      detected before calling sk_wait_data(), process the queue first.
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15c04175
  17. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
  18. 02 8月, 2012 1 次提交
  19. 28 7月, 2012 1 次提交
  20. 20 7月, 2012 1 次提交
    • Y
      net-tcp: Fast Open client - sendmsg(MSG_FASTOPEN) · cf60af03
      Yuchung Cheng 提交于
      sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
      and write(2). The application should replace connect() with it to
      send data in the opening SYN packet.
      
      For blocking socket, sendmsg() blocks until all the data are buffered
      locally and the handshake is completed like connect() call. It
      returns similar errno like connect() if the TCP handshake fails.
      
      For non-blocking socket, it returns the number of bytes queued (and
      transmitted in the SYN-data packet) if cookie is available. If cookie
      is not available, it transmits a data-less SYN packet with Fast Open
      cookie request option and returns -EINPROGRESS like connect().
      
      Using MSG_FASTOPEN on connecting or connected socket will result in
      simlar errno like repeating connect() calls. Therefore the application
      should only use this flag on new sockets.
      
      The buffer size of sendmsg() is independent of the MSS of the connection.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf60af03
  21. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  22. 11 7月, 2012 2 次提交
  23. 24 5月, 2012 1 次提交
  24. 20 5月, 2012 1 次提交
  25. 18 5月, 2012 2 次提交
    • W
      tcp: do_tcp_sendpages() must try to push data out on oom conditions · bad115cf
      Willy Tarreau 提交于
      Since recent changes on TCP splicing (starting with commits 2f533844
      "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
      tcp_sendpages() should call tcp_push() once"), I started seeing
      massive stalls when forwarding traffic between two sockets using
      splice() when pipe buffers were larger than socket buffers.
      
      Latest changes (net: netdev_alloc_skb() use build_skb()) made the
      problem even more apparent.
      
      The reason seems to be that if do_tcp_sendpages() fails on out of memory
      condition without being able to send at least one byte, tcp_push() is not
      called and the buffers cannot be flushed.
      
      After applying the attached patch, I cannot reproduce the stalls at all
      and the data rate it perfectly stable and steady under any condition
      which previously caused the problem to be permanent.
      
      The issue seems to have been there since before the kernel migrated to
      git, which makes me think that the stalls I occasionally experienced
      with tux during stress-tests years ago were probably related to the
      same issue.
      
      This issue was first encountered on 3.0.31 and 3.2.17, so please backport
      to -stable.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      bad115cf
    • E
      tcp: bool conversions · a2a385d6
      Eric Dumazet 提交于
      bool conversions where possible.
      
      __inline__ -> inline
      
      space cleanups
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2a385d6
  26. 17 5月, 2012 1 次提交
  27. 16 5月, 2012 1 次提交
  28. 11 5月, 2012 1 次提交
  29. 03 5月, 2012 3 次提交
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • E
      tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0
      Eric Dumazet 提交于
      tcp_adv_win_scale default value is 2, meaning we expect a good citizen
      skb to have skb->len / skb->truesize ratio of 75% (3/4)
      
      In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
      1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
      So these skbs were considered as not bloated.
      
      With recent truesize fixes, a typical MSS=1460 frame truesize is now the
      more precise :
      2048 + 256 = 2304. But 2304 * 3/4 = 1728.
      So these skb are not good citizen anymore, because 1460 < 1728
      
      (GRO can escape this problem because it build skbs with a too low
      truesize.)
      
      This also means tcp advertises a too optimistic window for a given
      allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
      sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
      especially when application is slow to drain its receive queue or in
      case of losses (netperf is fast, scp is slow). This is a major latency
      source.
      
      We should adjust the len/truesize ratio to 50% instead of 75%
      
      This patch :
      
      1) changes tcp_adv_win_scale default to 1 instead of 2
      
      2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
      better truesize tracking and to allow autotuning tcp receive window to
      reach same value than before. Note that same amount of kernel memory is
      consumed compared to 2.6 kernels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b49960a0
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6