1. 03 11月, 2012 1 次提交
  2. 23 10月, 2012 2 次提交
  3. 23 9月, 2012 3 次提交
    • N
      tcp: TCP Fast Open Server - record retransmits after 3WHS · 30099b2e
      Neal Cardwell 提交于
      When recording the number of SYNACK retransmits for servers using TCP
      Fast Open, fix the code to ensure that we copy over the retransmit
      count from the request_sock after we receive the ACK that completes
      the 3-way handshake.
      
      The story here is similar to that of SYNACK RTT
      measurements. Previously we were always doing this in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must copy the final SYNACK retransmit
      count in tcp_rcv_state_process().
      
      Note that copying over the SYNACK retransmit count will give us the
      correct count since, as is mentioned in a comment in
      tcp_retransmit_timer(), before we receive an ACK for our SYN-ACK a TFO
      passive connection does not retransmit anything else (e.g., data or
      FIN segments).
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30099b2e
    • N
      tcp: TCP Fast Open Server - call tcp_validate_incoming() for all packets · e69bebde
      Neal Cardwell 提交于
      A TCP Fast Open (TFO) passive connection must call both
      tcp_check_req() and tcp_validate_incoming() for all incoming ACKs that
      are attempting to complete the 3WHS.
      
      This is needed to parallel all the action that happens for a non-TFO
      connection, where for an ACK that is attempting to complete the 3WHS
      we call both tcp_check_req() and tcp_validate_incoming().
      
      For example, upon receiving the ACK that completes the 3WHS, we need
      to call tcp_fast_parse_options() and update ts_recent based on the
      incoming timestamp value in the ACK.
      
      One symptom of the problem with the previous code was that for passive
      TFO connections using TCP timestamps, the outgoing TS ecr values
      ignored the incoming TS val value on the ACK that completed the 3WHS.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e69bebde
    • N
      tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS · 016818d0
      Neal Cardwell 提交于
      When taking SYNACK RTT samples for servers using TCP Fast Open, fix
      the code to ensure that we only call tcp_valid_rtt_meas() after we
      receive the ACK that completes the 3-way handshake.
      
      Previously we were always taking an RTT sample in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
      to take the RTT sample.
      
      To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
      before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
      already ensures that we only take a SYNACK RTT sample if snt_synack is
      non-zero. To be careful, we only take a snt_synack timestamp when
      a SYNACK transmit or retransmit succeeds.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      016818d0
  4. 19 9月, 2012 1 次提交
  5. 04 9月, 2012 3 次提交
  6. 01 9月, 2012 2 次提交
    • J
      tcp: TCP Fast Open Server - main code path · 168a8f58
      Jerry Chu 提交于
      This patch adds the main processing path to complete the TFO server
      patches.
      
      A TFO request (i.e., SYN+data packet with a TFO cookie option) first
      gets processed in tcp_v4_conn_request(). If it passes the various TFO
      checks by tcp_fastopen_check(), a child socket will be created right
      away to be accepted by applications, rather than waiting for the 3WHS
      to finish.
      
      In additon to the use of TFO cookie, a simple max_qlen based scheme
      is put in place to fend off spoofed TFO attack.
      
      When a valid ACK comes back to tcp_rcv_state_process(), it will cause
      the state of the child socket to switch from either TCP_SYN_RECV to
      TCP_ESTABLISHED, or TCP_FIN_WAIT1 to TCP_FIN_WAIT2. At this time
      retransmission will resume for any unack'ed (data, FIN,...) segments.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      168a8f58
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
  7. 25 8月, 2012 1 次提交
  8. 16 8月, 2012 1 次提交
  9. 07 8月, 2012 2 次提交
    • E
      tcp: ecn: dont delay ACKS after CE · aae06bf5
      Eric Dumazet 提交于
      While playing with CoDel and ECN marking, I discovered a
      non optimal behavior of receiver of CE (Congestion Encountered)
      segments.
      
      In pathological cases, sender has reduced its cwnd to low values,
      and receiver delays its ACK (by 40 ms).
      
      While RFC 3168 6.1.3 (The TCP Receiver) doesn't explicitly recommend
      to send immediate ACKS, we believe its better to not delay ACKS, because
      a CE segment should give same signal than a dropped segment, and its
      quite important to reduce RTT to give ECE/CWR signals as fast as
      possible.
      
      Note we already call tcp_enter_quickack_mode() from TCP_ECN_check_ce()
      if we receive a retransmit, for the same reason.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aae06bf5
    • E
      net: ipv6: fix TCP early demux · 5d299f3d
      Eric Dumazet 提交于
      IPv6 needs a cookie in dst_check() call.
      
      We need to add rx_dst_cookie and provide a family independent
      sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d299f3d
  10. 01 8月, 2012 1 次提交
    • M
      netvm: prevent a stream-specific deadlock · c76562b6
      Mel Gorman 提交于
      This patch series is based on top of "Swap-over-NBD without deadlocking
      v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.
      
      When a user or administrator requires swap for their application, they
      create a swap partition and file, format it with mkswap and activate it
      with swapon.  In diskless systems this is not an option so if swap if
      required then swapping over the network is considered.  The two likely
      scenarios are when blade servers are used as part of a cluster where the
      form factor or maintenance costs do not allow the use of disks and thin
      clients.
      
      The Linux Terminal Server Project recommends the use of the Network Block
      Device (NBD) for swap but this is not always an option.  There is no
      guarantee that the network attached storage (NAS) device is running Linux
      or supports NBD.  However, it is likely that it supports NFS so there are
      users that want support for swapping over NFS despite any performance
      concern.  Some distributions currently carry patches that support swapping
      over NFS but it would be preferable to support it in the mainline kernel.
      
      Patch 1 avoids a stream-specific deadlock that potentially affects TCP.
      
      Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
      	reserves.
      
      Patch 3 adds three helpers for filesystems to handle swap cache pages.
      	For example, page_file_mapping() returns page->mapping for
      	file-backed pages and the address_space of the underlying
      	swap file for swap cache pages.
      
      Patch 4 adds two address_space_operations to allow a filesystem
      	to pin all metadata relevant to a swapfile in memory. Upon
      	successful activation, the swapfile is marked SWP_FILE and
      	the address space operation ->direct_IO is used for writing
      	and ->readpage for reading in swap pages.
      
      Patch 5 notes that patch 3 is bolting
      	filesystem-specific-swapfile-support onto the side and that
      	the default handlers have different information to what
      	is available to the filesystem. This patch refactors the
      	code so that there are generic handlers for each of the new
      	address_space operations.
      
      Patch 6 adds an API to allow a vector of kernel addresses to be
      	translated to struct pages and pinned for IO.
      
      Patch 7 adds support for using highmem pages for swap by kmapping
      	the pages before calling the direct_IO handler.
      
      Patch 8 updates NFS to use the helpers from patch 3 where necessary.
      
      Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.
      
      Patch 10 implements the new swapfile-related address_space operations
      	for NFS and teaches the direct IO handler how to manage
      	kernel addresses.
      
      Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
      	where appropriate.
      
      Patch 12 fixes a NULL pointer dereference that occurs when using
      	swap-over-NFS.
      
      With the patches applied, it is possible to mount a swapfile that is on an
      NFS filesystem.  Swap performance is not great with a swap stress test
      taking roughly twice as long to complete than if the swap device was
      backed by NBD.
      
      This patch: netvm: prevent a stream-specific deadlock
      
      It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
      that we're over the global rmem limit.  This will prevent SOCK_MEMALLOC
      buffers from receiving data, which will prevent userspace from running,
      which is needed to reduce the buffered data.
      
      Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit.  Once
      this change it applied, it is important that sockets that set
      SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
      If this happens, a warning is generated and the tokens reclaimed to avoid
      accounting errors until the bug is fixed.
      
      [davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c76562b6
  11. 31 7月, 2012 1 次提交
    • E
      net: ipv4: fix RCU races on dst refcounts · 404e0a8b
      Eric Dumazet 提交于
      commit c6cffba4 (ipv4: Fix input route performance regression.)
      added various fatal races with dst refcounts.
      
      crashes happen on tcp workloads if routes are added/deleted at the same
      time.
      
      The dst_free() calls from free_fib_info_rcu() are clearly racy.
      
      We need instead regular dst refcounting (dst_release()) and make
      sure dst_release() is aware of RCU grace periods :
      
      Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
      before dst destruction for cached dst
      
      Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
      to make sure we dont increase a zero refcount (On a dst currently
      waiting an rcu grace period before destruction)
      
      rt_cache_route() must take a reference on the new cached route, and
      release it if was not able to install it.
      
      With this patch, my machines survive various benchmarks.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      404e0a8b
  12. 28 7月, 2012 2 次提交
    • J
      tcp: perform DMA to userspace only if there is a task waiting for it · 59ea33a6
      Jiri Kosina 提交于
      Back in 2006, commit 1a2449a8 ("[I/OAT]: TCP recv offload to I/OAT")
      added support for receive offloading to IOAT dma engine if available.
      
      The code in tcp_rcv_established() tries to perform early DMA copy if
      applicable. It however does so without checking whether the userspace
      task is actually expecting the data in the buffer.
      
      This is not a problem under normal circumstances, but there is a corner
      case where this doesn't work -- and that's when MSG_TRUNC flag to
      recvmsg() is used.
      
      If the IOAT dma engine is not used, the code properly checks whether
      there is a valid ucopy.task and the socket is owned by userspace, but
      misses the check in the dmaengine case.
      
      This problem can be observed in real trivially -- for example 'tbench' is a
      good reproducer, as it makes a heavy use of MSG_TRUNC. On systems utilizing
      IOAT, you will soon find tbench waiting indefinitely in sk_wait_data(), as they
      have been already early-copied in tcp_rcv_established() using dma engine.
      
      This patch introduces the same check we are performing in the simple
      iovec copy case to the IOAT case as well. It fixes the indefinite
      recvmsg(MSG_TRUNC) hangs.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59ea33a6
    • E
      ipv4: fix TCP early demux · 505fbcf0
      Eric Dumazet 提交于
      commit 92101b3b (ipv4: Prepare for change of rt->rt_iif encoding.)
      invalidated TCP early demux, because rx_dst_ifindex is not properly
      initialized and checked.
      
      Also remove the use of inet_iif(skb) in favor or skb->skb_iif
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      505fbcf0
  13. 24 7月, 2012 1 次提交
    • D
      ipv4: Prepare for change of rt->rt_iif encoding. · 92101b3b
      David S. Miller 提交于
      Use inet_iif() consistently, and for TCP record the input interface of
      cached RX dst in inet sock.
      
      rt->rt_iif is going to be encoded differently, so that we can
      legitimately cache input routes in the FIB info more aggressively.
      
      When the input interface is "use SKB device index" the rt->rt_iif will
      be set to zero.
      
      This forces us to move the TCP RX dst cache installation into the ipv4
      specific code, and as well it should since doing the route caching for
      ipv6 is pointless at the moment since it is not inspected in the ipv6
      input paths yet.
      
      Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
      obsolete set to a non-zero value to force invocation of the check
      callback.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92101b3b
  14. 21 7月, 2012 1 次提交
  15. 20 7月, 2012 4 次提交
  16. 19 7月, 2012 1 次提交
  17. 17 7月, 2012 3 次提交
    • E
      tcp: implement RFC 5961 4.2 · 0c24604b
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using SYN bit.
      
      Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
      incoming packet, instead of resetting the session.
      
      Add a new SNMP counter to count number of challenge acks sent
      in response to SYN packets.
      (netstat -s | grep TCPSYNChallenge)
      
      Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
      because of a SYN flag.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c24604b
    • E
      tcp: implement RFC 5961 3.2 · 282f23c6
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using RST bit.
      
      Idea is to validate incoming RST sequence,
      to match RCV.NXT value, instead of previouly accepted
      window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)
      
      If sequence is in window but not an exact match, send
      a "challenge ACK", so that the other part can resend an
      RST with the appropriate sequence.
      
      Add a new sysctl, tcp_challenge_ack_limit, to limit
      number of challenge ACK sent per second.
      
      Add a new SNMP counter to count number of challenge acks sent.
      (netstat -s | grep TCPChallengeACK)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      282f23c6
    • E
      tcp: add OFO snmp counters · a6df1ae9
      Eric Dumazet 提交于
      Add three SNMP TCP counters, to better track TCP behavior
      at global stage (netstat -s), when packets are received
      Out Of Order (OFO)
      
      TCPOFOQueue : Number of packets queued in OFO queue
      
      TCPOFODrop  : Number of packets meant to be queued in OFO
                    but dropped because socket rcvbuf limit hit.
      
      TCPOFOMerge : Number of packets in OFO that were merged with
                    other packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6df1ae9
  18. 11 7月, 2012 1 次提交
  19. 05 7月, 2012 1 次提交
    • D
      net: Do delayed neigh confirmation. · 5110effe
      David S. Miller 提交于
      When a dst_confirm() happens, mark the confirmation as pending in the
      dst.  Then on the next packet out, when we have the neigh in-hand, do
      the update.
      
      This removes the dependency in dst_confirm() of dst's having an
      attached neigh.
      
      While we're here, remove the explicit 'dst' NULL check, all except 2
      or 3 call sites ensure it's not NULL.  So just fix those cases up.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5110effe
  20. 20 6月, 2012 1 次提交
    • D
      ipv4: Early TCP socket demux. · 41063e9d
      David S. Miller 提交于
      Input packet processing for local sockets involves two major demuxes.
      One for the route and one for the socket.
      
      But we can optimize this down to one demux for certain kinds of local
      sockets.
      
      Currently we only do this for established TCP sockets, but it could
      at least in theory be expanded to other kinds of connections.
      
      If a TCP socket is established then it's identity is fully specified.
      
      This means that whatever input route was used during the three-way
      handshake must work equally well for the rest of the connection since
      the keys will not change.
      
      Once we move to established state, we cache the receive packet's input
      route to use later.
      
      Like the existing cached route in sk->sk_dst_cache used for output
      packets, we have to check for route invalidations using dst->obsolete
      and dst->ops->check().
      
      Early demux occurs outside of a socket locked section, so when a route
      invalidation occurs we defer the fixup of sk->sk_rx_dst until we are
      actually inside of established state packet processing and thus have
      the socket locked.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41063e9d
  21. 24 5月, 2012 1 次提交
    • E
      tcp: take care of overlaps in tcp_try_coalesce() · 1ca7ee30
      Eric Dumazet 提交于
      Sergio Correia reported following warning :
      
      WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
      
      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
      
      It appears TCP coalescing, and more specifically commit b081f85c
      (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
      possible segment overlaps in receive queue. This was properly done in
      the case of out_or_order_queue by the caller.
      
      For example, segment at tail of queue have sequence 1000-2000, and we
      add a segment with sequence 1500-2500.
      This can happen in case of retransmits.
      
      In this case, just don't do the coalescing.
      Reported-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ca7ee30
  22. 20 5月, 2012 1 次提交
    • E
      net: introduce skb_try_coalesce() · bad43ca8
      Eric Dumazet 提交于
      Move tcp_try_coalesce() protocol independent part to
      skb_try_coalesce().
      
      skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
      to build optimized skbs (less sk_buff, and possibly less 'headers')
      
      skb_try_coalesce() is zero copy, unless the copy can fit in destination
      header (its a rare case)
      
      kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
      because IPv6 will need it in patch (ipv6: use skb coalescing in
      reassembly).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad43ca8
  23. 18 5月, 2012 1 次提交
  24. 16 5月, 2012 2 次提交
  25. 11 5月, 2012 2 次提交