1. 19 7月, 2012 1 次提交
    • P
      cipso: don't follow a NULL pointer when setsockopt() is called · 89d7ae34
      Paul Moore 提交于
      As reported by Alan Cox, and verified by Lin Ming, when a user
      attempts to add a CIPSO option to a socket using the CIPSO_V4_TAG_LOCAL
      tag the kernel dies a terrible death when it attempts to follow a NULL
      pointer (the skb argument to cipso_v4_validate() is NULL when called via
      the setsockopt() syscall).
      
      This patch fixes this by first checking to ensure that the skb is
      non-NULL before using it to find the incoming network interface.  In
      the unlikely case where the skb is NULL and the user attempts to add
      a CIPSO option with the _TAG_LOCAL tag we return an error as this is
      not something we want to allow.
      
      A simple reproducer, kindly supplied by Lin Ming, although you must
      have the CIPSO DOI #3 configure on the system first or you will be
      caught early in cipso_v4_validate():
      
      	#include <sys/types.h>
      	#include <sys/socket.h>
      	#include <linux/ip.h>
      	#include <linux/in.h>
      	#include <string.h>
      
      	struct local_tag {
      		char type;
      		char length;
      		char info[4];
      	};
      
      	struct cipso {
      		char type;
      		char length;
      		char doi[4];
      		struct local_tag local;
      	};
      
      	int main(int argc, char **argv)
      	{
      		int sockfd;
      		struct cipso cipso = {
      			.type = IPOPT_CIPSO,
      			.length = sizeof(struct cipso),
      			.local = {
      				.type = 128,
      				.length = sizeof(struct local_tag),
      			},
      		};
      
      		memset(cipso.doi, 0, 4);
      		cipso.doi[3] = 3;
      
      		sockfd = socket(AF_INET, SOCK_DGRAM, 0);
      		#define SOL_IP 0
      		setsockopt(sockfd, SOL_IP, IP_OPTIONS,
      			&cipso, sizeof(struct cipso));
      
      		return 0;
      	}
      
      CC: Lin Ming <mlin@ss.pku.edu.cn>
      Reported-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89d7ae34
  2. 08 6月, 2012 1 次提交
    • V
      snmp: fix OutOctets counter to include forwarded datagrams · 2d8dbb04
      Vincent Bernat 提交于
      RFC 4293 defines ipIfStatsOutOctets (similar definition for
      ipSystemStatsOutOctets):
      
         The total number of octets in IP datagrams delivered to the lower
         layers for transmission.  Octets from datagrams counted in
         ipIfStatsOutTransmits MUST be counted here.
      
      And ipIfStatsOutTransmits:
      
         The total number of IP datagrams that this entity supplied to the
         lower layers for transmission.  This includes datagrams generated
         locally and those forwarded by this entity.
      
      Therefore, IPSTATS_MIB_OUTOCTETS must be incremented when incrementing
      IPSTATS_MIB_OUTFORWDATAGRAMS.
      
      IP_UPD_PO_STATS is not used since ipIfStatsOutRequests must not
      include forwarded datagrams:
      
         The total number of IP datagrams that local IP user-protocols
         (including ICMP) supplied to IP in requests for transmission.  Note
         that this counter does not include any datagrams counted in
         ipIfStatsOutForwDatagrams.
      Signed-off-by: NVincent Bernat <bernat@luffy.cx>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d8dbb04
  3. 07 6月, 2012 1 次提交
  4. 02 6月, 2012 2 次提交
    • E
      tcp: reflect SYN queue_mapping into SYNACK packets · fff32699
      Eric Dumazet 提交于
      While testing how linux behaves on SYNFLOOD attack on multiqueue device
      (ixgbe), I found that SYNACK messages were dropped at Qdisc level
      because we send them all on a single queue.
      
      Obvious choice is to reflect incoming SYN packet @queue_mapping to
      SYNACK packet.
      
      Under stress, my machine could only send 25.000 SYNACK per second (for
      200.000 incoming SYN per second). NIC : ixgbe with 16 rx/tx queues.
      
      After patch, not a single SYNACK is dropped.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fff32699
    • E
      tcp: do not create inetpeer on SYNACK message · 7433819a
      Eric Dumazet 提交于
      Another problem on SYNFLOOD/DDOS attack is the inetpeer cache getting
      larger and larger, using lots of memory and cpu time.
      
      tcp_v4_send_synack()
      ->inet_csk_route_req()
       ->ip_route_output_flow()
        ->rt_set_nexthop()
         ->rt_init_metrics()
          ->inet_getpeer( create = true)
      
      This is a side effect of commit a4daad6b (net: Pre-COW metrics for
      TCP) added in 2.6.39
      
      Possible solution :
      
      Instruct inet_csk_route_req() to remove FLOWI_FLAG_PRECOW_METRICS
      
      Before patch :
      
      # grep peer /proc/slabinfo
      inet_peer_cache   4175430 4175430    192   42    2 : tunables    0    0    0 : slabdata  99415  99415      0
      
      Samples: 41K of event 'cycles', Event count (approx.): 30716565122
      +  20,24%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_getpeer
      +   8,19%      ksoftirqd/0  [kernel.kallsyms]           [k] peer_avl_rebalance.isra.1
      +   4,81%      ksoftirqd/0  [kernel.kallsyms]           [k] sha_transform
      +   3,64%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_table_lookup
      +   2,36%      ksoftirqd/0  [ixgbe]                     [k] ixgbe_poll
      +   2,16%      ksoftirqd/0  [kernel.kallsyms]           [k] __ip_route_output_key
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] kernel_map_pages
      +   2,11%      ksoftirqd/0  [kernel.kallsyms]           [k] ip_route_input_common
      +   2,01%      ksoftirqd/0  [kernel.kallsyms]           [k] __inet_lookup_established
      +   1,83%      ksoftirqd/0  [kernel.kallsyms]           [k] md5_transform
      +   1,75%      ksoftirqd/0  [kernel.kallsyms]           [k] check_leaf.isra.9
      +   1,49%      ksoftirqd/0  [kernel.kallsyms]           [k] ipt_do_table
      +   1,46%      ksoftirqd/0  [kernel.kallsyms]           [k] hrtimer_interrupt
      +   1,45%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_alloc
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] inet_csk_search_req
      +   1,29%      ksoftirqd/0  [kernel.kallsyms]           [k] __netif_receive_skb
      +   1,16%      ksoftirqd/0  [kernel.kallsyms]           [k] copy_user_generic_string
      +   1,15%      ksoftirqd/0  [kernel.kallsyms]           [k] kmem_cache_free
      +   1,02%      ksoftirqd/0  [kernel.kallsyms]           [k] tcp_make_synack
      +   0,93%      ksoftirqd/0  [kernel.kallsyms]           [k] _raw_spin_lock_bh
      +   0,87%      ksoftirqd/0  [kernel.kallsyms]           [k] __call_rcu
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] rt_garbage_collect
      +   0,84%      ksoftirqd/0  [kernel.kallsyms]           [k] fib_rules_lookup
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hans Schillstrom <hans.schillstrom@ericsson.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7433819a
  5. 30 5月, 2012 1 次提交
    • G
      memcg: decrement static keys at real destroy time · 3f134619
      Glauber Costa 提交于
      We call the destroy function when a cgroup starts to be removed, such as
      by a rmdir event.
      
      However, because of our reference counters, some objects are still
      inflight.  Right now, we are decrementing the static_keys at destroy()
      time, meaning that if we get rid of the last static_key reference, some
      objects will still have charges, but the code to properly uncharge them
      won't be run.
      
      This becomes a problem specially if it is ever enabled again, because now
      new charges will be added to the staled charges making keeping it pretty
      much impossible.
      
      We just need to be careful with the static branch activation: since there
      is no particular preferred order of their activation, we need to make sure
      that we only start using it after all call sites are active.  This is
      achieved by having a per-memcg flag that is only updated after
      static_key_slow_inc() returns.  At this time, we are sure all sites are
      active.
      
      This is made per-memcg, not global, for a reason: it also has the effect
      of making socket accounting more consistent.  The first memcg to be
      limited will trigger static_key() activation, therefore, accounting.  But
      all the others will then be accounted no matter what.  After this patch,
      only limited memcgs will have its sockets accounted.
      
      [akpm@linux-foundation.org: move enum sock_flag_bits into sock.h,
                                  document enum sock_flag_bits,
                                  convert memcg_proto_active() and memcg_proto_activated() to test_bit(),
                                  redo tcp_update_limit() comment to 80 cols]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f134619
  6. 27 5月, 2012 1 次提交
  7. 24 5月, 2012 3 次提交
    • E
      tcp: take care of overlaps in tcp_try_coalesce() · 1ca7ee30
      Eric Dumazet 提交于
      Sergio Correia reported following warning :
      
      WARNING: at net/ipv4/tcp.c:1301 tcp_cleanup_rbuf+0x4f/0x110()
      
      WARN(skb && !before(tp->copied_seq, TCP_SKB_CB(skb)->end_seq),
           "cleanup rbuf bug: copied %X seq %X rcvnxt %X\n",
           tp->copied_seq, TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt);
      
      It appears TCP coalescing, and more specifically commit b081f85c
      (net: implement tcp coalescing in tcp_queue_rcv()) should take care of
      possible segment overlaps in receive queue. This was properly done in
      the case of out_or_order_queue by the caller.
      
      For example, segment at tail of queue have sequence 1000-2000, and we
      add a segment with sequence 1500-2500.
      This can happen in case of retransmits.
      
      In this case, just don't do the coalescing.
      Reported-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Tested-by: NSergio Correia <lists@uece.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ca7ee30
    • Y
      ipv4: fix the rcu race between free_fib_info and ip_route_output_slow · e49cc0da
      Yanmin Zhang 提交于
      We hit a kernel OOPS.
      
      <3>[23898.789643] BUG: sleeping function called from invalid context at
      /data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
      <3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
      Thread-6683
      <4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
      3.0.8-137685-ge7742f9 #1
      <4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
      to suspend...
      <4>[23899.904225] Call Trace:
      <4>[23899.989209]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.000416]  [<c1238c2a>] __might_sleep+0x10a/0x110
      <4>[23900.007357]  [<c1228021>] do_page_fault+0xd1/0x3c0
      <4>[23900.013764]  [<c18e9ba9>] ? restore_all+0xf/0xf
      <4>[23900.024024]  [<c17c007b>] ? napi_complete+0x8b/0x690
      <4>[23900.029297]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.123739]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.128955]  [<c18ea0c3>] error_code+0x5f/0x64
      <4>[23900.133466]  [<c1227f50>] ? pgtable_bad+0x130/0x130
      <4>[23900.138450]  [<c17f6298>] ? __ip_route_output_key+0x698/0x7c0
      <4>[23900.144312]  [<c17f5f8d>] ? __ip_route_output_key+0x38d/0x7c0
      <4>[23900.150730]  [<c17f63df>] ip_route_output_flow+0x1f/0x60
      <4>[23900.156261]  [<c181de58>] ip4_datagram_connect+0x188/0x2b0
      <4>[23900.161960]  [<c18e981f>] ? _raw_spin_unlock_bh+0x1f/0x30
      <4>[23900.167834]  [<c18298d6>] inet_dgram_connect+0x36/0x80
      <4>[23900.173224]  [<c14f9e88>] ? _copy_from_user+0x48/0x140
      <4>[23900.178817]  [<c17ab9da>] sys_connect+0x9a/0xd0
      <4>[23900.183538]  [<c132e93c>] ? alloc_file+0xdc/0x240
      <4>[23900.189111]  [<c123925d>] ? sub_preempt_count+0x3d/0x50
      
      Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
      fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.
      
      With the patch, we ran MTBF testing on Android mobile for 12 hours
      and didn't trigger the issue.
      
      Thank Eric for very detailed review/checking the issue.
      Signed-off-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Signed-off-by: NKun Jiang <kunx.jiang@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e49cc0da
    • T
      mm: add a low limit to alloc_large_system_hash · 31fe62b9
      Tim Bird 提交于
      UDP stack needs a minimum hash size value for proper operation and also
      uses alloc_large_system_hash() for proper NUMA distribution of its hash
      tables and automatic sizing depending on available system memory.
      
      On some low memory situations, udp_table_init() must ignore the
      alloc_large_system_hash() result and reallocs a bigger memory area.
      
      As we cannot easily free old hash table, we leak it and kmemleak can
      issue a warning.
      
      This patch adds a low limit parameter to alloc_large_system_hash() to
      solve this problem.
      
      We then specify UDP_HTABLE_SIZE_MIN for UDP/UDPLite hash table
      allocation.
      Reported-by: NMark Asselstine <mark.asselstine@windriver.com>
      Reported-by: NTim Bird <tim.bird@am.sony.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31fe62b9
  8. 20 5月, 2012 4 次提交
  9. 18 5月, 2012 3 次提交
    • E
      ip_frag: struct inet_frags match() method returns a bool · cbc264ca
      Eric Dumazet 提交于
      - match() method returns a boolean
      - return (A && B && C && D) -> return A && B && C && D
      - fix indentation
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      cbc264ca
    • W
      tcp: do_tcp_sendpages() must try to push data out on oom conditions · bad115cf
      Willy Tarreau 提交于
      Since recent changes on TCP splicing (starting with commits 2f533844
      "tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
      tcp_sendpages() should call tcp_push() once"), I started seeing
      massive stalls when forwarding traffic between two sockets using
      splice() when pipe buffers were larger than socket buffers.
      
      Latest changes (net: netdev_alloc_skb() use build_skb()) made the
      problem even more apparent.
      
      The reason seems to be that if do_tcp_sendpages() fails on out of memory
      condition without being able to send at least one byte, tcp_push() is not
      called and the buffers cannot be flushed.
      
      After applying the attached patch, I cannot reproduce the stalls at all
      and the data rate it perfectly stable and steady under any condition
      which previously caused the problem to be permanent.
      
      The issue seems to have been there since before the kernel migrated to
      git, which makes me think that the stalls I occasionally experienced
      with tux during stress-tests years ago were probably related to the
      same issue.
      
      This issue was first encountered on 3.0.31 and 3.2.17, so please backport
      to -stable.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Cc: <stable@vger.kernel.org>
      bad115cf
    • E
      tcp: bool conversions · a2a385d6
      Eric Dumazet 提交于
      bool conversions where possible.
      
      __inline__ -> inline
      
      space cleanups
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2a385d6
  10. 17 5月, 2012 1 次提交
  11. 16 5月, 2012 4 次提交
  12. 11 5月, 2012 4 次提交
  13. 09 5月, 2012 1 次提交
    • P
      netfilter: remove ip_queue support · d16cf20e
      Pablo Neira Ayuso 提交于
      This patch removes ip_queue support which was marked as obsolete
      years ago. The nfnetlink_queue modules provides more advanced
      user-space packet queueing mechanism.
      
      This patch also removes capability code included in SELinux that
      refers to ip_queue. Otherwise, we break compilation.
      
      Several warning has been sent regarding this to the mailing list
      in the past month without anyone rising the hand to stop this
      with some strong argument.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      d16cf20e
  14. 08 5月, 2012 1 次提交
  15. 05 5月, 2012 1 次提交
    • E
      tcp: be more strict before accepting ECN negociation · bd14b1b2
      Eric Dumazet 提交于
      It appears some networks play bad games with the two bits reserved for
      ECN. This can trigger false congestion notifications and very slow
      transferts.
      
      Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
      disable TCP ECN negociation if it happens we receive mangled CT bits in
      the SYN packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Perry Lorier <perryl@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Wilmer van der Gaast <wilmer@google.com>
      Cc: Ankur Jain <jankur@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Dave Täht <dave.taht@bufferbloat.net>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd14b1b2
  16. 04 5月, 2012 1 次提交
  17. 03 5月, 2012 10 次提交
    • E
      userns: Convert group_info values from gid_t to kgid_t. · ae2975bc
      Eric W. Biederman 提交于
      As a first step to converting struct cred to be all kuid_t and kgid_t
      values convert the group values stored in group_info to always be
      kgid_t values.   Unless user namespaces are used this change should
      have no effect.
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      ae2975bc
    • A
      tcp: move stats merge to the end of tcp_try_coalesce · 34a802a5
      Alexander Duyck 提交于
      This change cleans up the last bits of tcp_try_coalesce so that we only
      need one goto which jumps to the end of the function.  The idea is to make
      the code more readable by putting things in a linear order so that we start
      execution at the top of the function, and end it at the bottom.
      
      I also made a slight tweak to the code for handling frags when we are a
      clone.  Instead of making it an if (clone) loop else nr_frags = 0 I changed
      the logic so that if (!clone) we just set the number of frags to 0 which
      disables the for loop anyway.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34a802a5
    • A
      tcp: Move code related to head frag in tcp_try_coalesce · 57b55a7e
      Alexander Duyck 提交于
      This change reorders the code related to the use of an skb->head_frag so it
      is placed before we check the rest of the frags.  This allows the code to
      read more linearly instead of like some sort of loop.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57b55a7e
    • A
      tcp: Fix truesize accounting in tcp_try_coalesce · c73c3d9c
      Alexander Duyck 提交于
      This patch addresses several issues in the way we were tracking the
      truesize in tcp_try_coalesce.
      
      First it was using ksize which prevents us from having a 0 sized head frag
      and getting a usable result.  To resolve that this patch uses the end
      pointer which is set based off either ksize, or the frag_size supplied in
      build_skb.  This allows us to compute the original truesize of the entire
      buffer and remove that value leaving us with just what was added as pages.
      
      The second issue was the use of skb->len if there is a mergeable head frag.
      We should only need to remove the size of an data aligned sk_buff from our
      current skb->truesize to compute the delta for a buffer with a reused head.
      By using skb->len the value of truesize was being artificially reduced
      which means that head frags could use more memory than buffers using
      standard allocations.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c73c3d9c
    • A
      net: Stop decapitating clones that have a head_frag · 2996d31f
      Alexander Duyck 提交于
      This change is meant ot prevent stealing the skb->head to use as a page in
      the event that the skb->head was cloned.  This allows the other clones to
      track each other via shinfo->dataref.
      
      Without this we break down to two methods for tracking the reference count,
      one being dataref, the other being the page count.  As a result it becomes
      difficult to track how many references there are to skb->head.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2996d31f
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • E
      net: take care of cloned skbs in tcp_try_coalesce() · 923dd347
      Eric Dumazet 提交于
      Before stealing fragments or skb head, we must make sure skbs are not
      cloned.
      
      Alexander was worried about destination skb being cloned : In bridge
      setups, a driver could be fooled if skb->data_len would not match skb
      nr_frags.
      
      If source skb is cloned, we must take references on pages instead.
      
      Bug happened using tcpdump (if not using mmap())
      
      Introduce kfree_skb_partial() helper to cleanup code.
      Reported-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      923dd347
    • E
      tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0
      Eric Dumazet 提交于
      tcp_adv_win_scale default value is 2, meaning we expect a good citizen
      skb to have skb->len / skb->truesize ratio of 75% (3/4)
      
      In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
      1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
      So these skbs were considered as not bloated.
      
      With recent truesize fixes, a typical MSS=1460 frame truesize is now the
      more precise :
      2048 + 256 = 2304. But 2304 * 3/4 = 1728.
      So these skb are not good citizen anymore, because 1460 < 1728
      
      (GRO can escape this problem because it build skbs with a too low
      truesize.)
      
      This also means tcp advertises a too optimistic window for a given
      allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
      sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
      especially when application is slow to drain its receive queue or in
      case of losses (netperf is fast, scp is slow). This is a major latency
      source.
      
      We should adjust the len/truesize ratio to 50% instead of 75%
      
      This patch :
      
      1) changes tcp_adv_win_scale default to 1 instead of 2
      
      2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
      better truesize tracking and to allow autotuning tcp receive window to
      reach same value than before. Note that same amount of kernel memory is
      consumed compared to 2.6 kernels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b49960a0
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6