1. 30 1月, 2009 1 次提交
    • H
      gro: Avoid copying headers of unmerged packets · 86911732
      Herbert Xu 提交于
      Unfortunately simplicity isn't always the best.  The fraginfo
      interface turned out to be suboptimal.  The problem was quite
      obvious.  For every packet, we have to copy the headers from
      the frags structure into skb->head, even though for 99% of the
      packets this part is immediately thrown away after the merge.
      
      LRO didn't have this problem because it directly read the headers
      from the frags structure.
      
      This patch attempts to address this by creating an interface
      that allows GRO to access the headers in the first frag without
      having to copy it.  Because all drivers that use frags place the
      headers in the first frag this optimisation should be enough.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86911732
  2. 27 1月, 2009 1 次提交
  3. 23 1月, 2009 9 次提交
  4. 22 1月, 2009 2 次提交
    • T
      gre: strict physical device binding · 749c10f9
      Timo Teras 提交于
      Check the device on receive path and allow otherwise identical devices
      as long as the physical device differs.
      
      This is useful for NBMA tunnels, where you want to use different gre IP
      for each public IP available via different physical devices.
      Signed-off-by: NTimo Teras <timo.teras@iki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      749c10f9
    • E
      inet: Allowing more than 64k connections and heavily optimize bind(0) time. · a9d8f911
      Evgeniy Polyakov 提交于
      With simple extension to the binding mechanism, which allows to bind more
      than 64k sockets (or smaller amount, depending on sysctl parameters),
      we have to traverse the whole bind hash table to find out empty bucket.
      And while it is not a problem for example for 32k connections, bind()
      completion time grows exponentially (since after each successful binding
      we have to traverse one bucket more to find empty one) even if we start
      each time from random offset inside the hash table.
      
      So, when hash table is full, and we want to add another socket, we have
      to traverse the whole table no matter what, so effectivelly this will be
      the worst case performance and it will be constant.
      
      Attached picture shows bind() time depending on number of already bound
      sockets.
      
      Green area corresponds to the usual binding to zero port process, which
      turns on kernel port selection as described above. Red area is the bind
      process, when number of reuse-bound sockets is not limited by 64k (or
      sysctl parameters). The same exponential growth (hidden by the green
      area) before number of ports reaches sysctl limit.
      
      At this time bind hash table has exactly one reuse-enbaled socket in a
      bucket, but it is possible that they have different addresses. Actually
      kernel selects the first port to try randomly, so at the beginning bind
      will take roughly constant time, but with time number of port to check
      after random start will increase. And that will have exponential growth,
      but because of above random selection, not every next port selection
      will necessary take longer time than previous. So we have to consider
      the area below in the graph (if you could zoom it, you could find, that
      there are many different times placed there), so area can hide another.
      
      Blue area corresponds to the port selection optimization.
      
      This is rather simple design approach: hashtable now maintains (unprecise
      and racely updated) number of currently bound sockets, and when number
      of such sockets becomes greater than predefined value (I use maximum
      port range defined by sysctls), we stop traversing the whole bind hash
      table and just stop at first matching bucket after random start. Above
      limit roughly corresponds to the case, when bind hash table is full and
      we turned on mechanism of allowing to bind more reuse-enabled sockets,
      so it does not change behaviour of other sockets.
      Signed-off-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Tested-by: NDenys Fedoryschenko <denys@visp.net.lb>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d8f911
  5. 15 1月, 2009 1 次提交
  6. 14 1月, 2009 1 次提交
    • W
      tcp: splice as many packets as possible at once · 33966dd0
      Willy Tarreau 提交于
      As spotted by Willy Tarreau, current splice() from tcp socket to pipe is not
      optimal. It processes at most one segment per call.
      This results in low performance and very high overhead due to syscall rate
      when splicing from interfaces which do not support LRO.
      
      Willy provided a patch inside tcp_splice_read(), but a better fix
      is to let tcp_read_sock() process as many segments as possible, so
      that tcp_rcv_space_adjust() and tcp_cleanup_rbuf() are called less
      often.
      
      With this change, splice() behaves like tcp_recvmsg(), being able
      to consume many skbs in one system call. With typical 1460 bytes
      of payload per frame, that means splice(SPLICE_F_NONBLOCK) can return
      16*1460 = 23360 bytes.
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33966dd0
  7. 13 1月, 2009 2 次提交
  8. 09 1月, 2009 1 次提交
  9. 07 1月, 2009 2 次提交
  10. 05 1月, 2009 3 次提交
  11. 01 1月, 2009 1 次提交
  12. 30 12月, 2008 2 次提交
  13. 26 12月, 2008 4 次提交
  14. 19 12月, 2008 1 次提交
  15. 16 12月, 2008 3 次提交
    • I
      ipmr: merge common code · b1879204
      Ilpo Järvinen 提交于
      Also removes redundant skb->len < x check which can't
      be true once pskb_may_pull(skb, x) succeeded.
      
      $ diff-funcs pim_rcv ipmr.c ipmr.c pim_rcv_v1
        --- ipmr.c:pim_rcv()
        +++ ipmr.c:pim_rcv_v1()
      @@ -1,22 +1,27 @@
      -static int pim_rcv(struct sk_buff * skb)
      +int pim_rcv_v1(struct sk_buff * skb)
       {
      -	struct pimreghdr *pim;
      +	struct igmphdr *pim;
       	struct iphdr   *encap;
       	struct net_device  *reg_dev = NULL;
      
       	if (!pskb_may_pull(skb, sizeof(*pim) + sizeof(*encap)))
       		goto drop;
      
      -	pim = (struct pimreghdr *)skb_transport_header(skb);
      -	if (pim->type != ((PIM_VERSION<<4)|(PIM_REGISTER)) ||
      -	    (pim->flags&PIM_NULL_REGISTER) ||
      -	    (ip_compute_csum((void *)pim, sizeof(*pim)) != 0 &&
      -	     csum_fold(skb_checksum(skb, 0, skb->len, 0))))
      +	pim = igmp_hdr(skb);
      +
      +	if (!mroute_do_pim ||
      +	    skb->len < sizeof(*pim) + sizeof(*encap) ||
      +	    pim->group != PIM_V1_VERSION || pim->code != PIM_V1_REGISTER)
       		goto drop;
      
      -	/* check if the inner packet is destined to mcast group */
       	encap = (struct iphdr *)(skb_transport_header(skb) +
      -				 sizeof(struct pimreghdr));
      +				 sizeof(struct igmphdr));
      +	/*
      +	   Check that:
      +	   a. packet is really destinted to a multicast group
      +	   b. packet is not a NULL-REGISTER
      +	   c. packet is not truncated
      +	 */
       	if (!ipv4_is_multicast(encap->daddr) ||
       	    encap->tot_len == 0 ||
       	    ntohs(encap->tot_len) + sizeof(*pim) > skb->len)
      @@ -40,9 +45,9 @@
       	skb->ip_summed = 0;
       	skb->pkt_type = PACKET_HOST;
       	dst_release(skb->dst);
      +	skb->dst = NULL;
       	reg_dev->stats.rx_bytes += skb->len;
       	reg_dev->stats.rx_packets++;
      -	skb->dst = NULL;
       	nf_reset(skb);
       	netif_rx(skb);
       	dev_put(reg_dev);
      
      $ codiff net/ipv4/ipmr.o.old net/ipv4/ipmr.o.new
      
      net/ipv4/ipmr.c:
        pim_rcv_v1 | -283
        pim_rcv    | -284
       2 functions changed, 567 bytes removed
      
      net/ipv4/ipmr.c:
        __pim_rcv | +307
       1 function changed, 307 bytes added
      
      net/ipv4/ipmr.o.new:
       3 functions changed, 307 bytes added, 567 bytes removed, diff: -260
      
      (Tested on x86_64).
      
      It seems that pimlen arg could be left out as well and
      eq-sizedness of structs trapped with BUILD_BUG_ON but
      I don't think that's more than a cosmetic flaw since there
      aren't that many args anyway.
      
      Compile tested.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1879204
    • H
      tcp: Add GRO support · bf296b12
      Herbert Xu 提交于
      This patch adds the TCP-specific portion of GRO.  The criterion for
      merging is extremely strict (the TCP header must match exactly apart
      from the checksum) so as to allow refragmentation.  Otherwise this
      is pretty much identical to LRO, except that we support the merging
      of ECN packets.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf296b12
    • H
      ipv4: Add GRO infrastructure · 73cc19f1
      Herbert Xu 提交于
      This patch adds GRO support for IPv4.
      
      The criteria for merging is more stringent than LRO, in particular,
      we require all fields in the IP header to be identical except for
      the length, ID and checksum.  In addition, the ID must form an
      arithmetic sequence with a difference of one.
      
      The ID requirement might seem overly strict, however, most hardware
      TSO solutions already obey this rule.  Linux itself also obeys this
      whether GSO is in use or not.
      
      In future we could relax this rule by storing the IDs (or rather
      making sure that we don't drop them when pulling the aggregate
      skb's tail).
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73cc19f1
  16. 15 12月, 2008 2 次提交
  17. 09 12月, 2008 1 次提交
    • D
      tcp: tcp_vegas cong avoid fix · 8d3a564d
      Doug Leith 提交于
      This patch addresses a book-keeping issue in tcp_vegas.c.  At present
      tcp_vegas does separate book-keeping of cwnd based on packet sequence
      numbers.  A mismatch can develop between this book-keeping and
      tp->snd_cwnd due, for example, to delayed acks acking multiple
      packets.  When vegas transitions to reno operation (e.g. following
      loss), then this mismatch leads to incorrect behaviour (akin to a cwnd
      backoff).  This seems mostly to affect operation at low cwnds where
      delayed acking can lead to a significant fraction of cwnd being
      covered by a single ack, leading to the book-keeping mismatch.  This
      patch modifies the congestion avoidance update to avoid the need for
      separate book-keeping while leaving vegas congestion avoidance
      functionally unchanged.  A secondary advantage of this modification is
      that the use of fixed-point (via V_PARAM_SHIFT) and 64 bit arithmetic
      is no longer necessary, simplifying the code.
      
      Some example test measurements with the patched code (confirming no functional
      change in the congestion avoidance algorithm) can be seen at:
      
      http://www.hamilton.ie/doug/vegaspatch/Signed-off-by: NDoug Leith <doug.leith@nuim.ie>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d3a564d
  18. 06 12月, 2008 3 次提交