1. 26 12月, 2008 1 次提交
  2. 19 12月, 2008 1 次提交
  3. 16 12月, 2008 3 次提交
    • I
      ipmr: merge common code · b1879204
      Ilpo Järvinen 提交于
      Also removes redundant skb->len < x check which can't
      be true once pskb_may_pull(skb, x) succeeded.
      
      $ diff-funcs pim_rcv ipmr.c ipmr.c pim_rcv_v1
        --- ipmr.c:pim_rcv()
        +++ ipmr.c:pim_rcv_v1()
      @@ -1,22 +1,27 @@
      -static int pim_rcv(struct sk_buff * skb)
      +int pim_rcv_v1(struct sk_buff * skb)
       {
      -	struct pimreghdr *pim;
      +	struct igmphdr *pim;
       	struct iphdr   *encap;
       	struct net_device  *reg_dev = NULL;
      
       	if (!pskb_may_pull(skb, sizeof(*pim) + sizeof(*encap)))
       		goto drop;
      
      -	pim = (struct pimreghdr *)skb_transport_header(skb);
      -	if (pim->type != ((PIM_VERSION<<4)|(PIM_REGISTER)) ||
      -	    (pim->flags&PIM_NULL_REGISTER) ||
      -	    (ip_compute_csum((void *)pim, sizeof(*pim)) != 0 &&
      -	     csum_fold(skb_checksum(skb, 0, skb->len, 0))))
      +	pim = igmp_hdr(skb);
      +
      +	if (!mroute_do_pim ||
      +	    skb->len < sizeof(*pim) + sizeof(*encap) ||
      +	    pim->group != PIM_V1_VERSION || pim->code != PIM_V1_REGISTER)
       		goto drop;
      
      -	/* check if the inner packet is destined to mcast group */
       	encap = (struct iphdr *)(skb_transport_header(skb) +
      -				 sizeof(struct pimreghdr));
      +				 sizeof(struct igmphdr));
      +	/*
      +	   Check that:
      +	   a. packet is really destinted to a multicast group
      +	   b. packet is not a NULL-REGISTER
      +	   c. packet is not truncated
      +	 */
       	if (!ipv4_is_multicast(encap->daddr) ||
       	    encap->tot_len == 0 ||
       	    ntohs(encap->tot_len) + sizeof(*pim) > skb->len)
      @@ -40,9 +45,9 @@
       	skb->ip_summed = 0;
       	skb->pkt_type = PACKET_HOST;
       	dst_release(skb->dst);
      +	skb->dst = NULL;
       	reg_dev->stats.rx_bytes += skb->len;
       	reg_dev->stats.rx_packets++;
      -	skb->dst = NULL;
       	nf_reset(skb);
       	netif_rx(skb);
       	dev_put(reg_dev);
      
      $ codiff net/ipv4/ipmr.o.old net/ipv4/ipmr.o.new
      
      net/ipv4/ipmr.c:
        pim_rcv_v1 | -283
        pim_rcv    | -284
       2 functions changed, 567 bytes removed
      
      net/ipv4/ipmr.c:
        __pim_rcv | +307
       1 function changed, 307 bytes added
      
      net/ipv4/ipmr.o.new:
       3 functions changed, 307 bytes added, 567 bytes removed, diff: -260
      
      (Tested on x86_64).
      
      It seems that pimlen arg could be left out as well and
      eq-sizedness of structs trapped with BUILD_BUG_ON but
      I don't think that's more than a cosmetic flaw since there
      aren't that many args anyway.
      
      Compile tested.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1879204
    • H
      tcp: Add GRO support · bf296b12
      Herbert Xu 提交于
      This patch adds the TCP-specific portion of GRO.  The criterion for
      merging is extremely strict (the TCP header must match exactly apart
      from the checksum) so as to allow refragmentation.  Otherwise this
      is pretty much identical to LRO, except that we support the merging
      of ECN packets.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf296b12
    • H
      ipv4: Add GRO infrastructure · 73cc19f1
      Herbert Xu 提交于
      This patch adds GRO support for IPv4.
      
      The criteria for merging is more stringent than LRO, in particular,
      we require all fields in the IP header to be identical except for
      the length, ID and checksum.  In addition, the ID must form an
      arithmetic sequence with a difference of one.
      
      The ID requirement might seem overly strict, however, most hardware
      TSO solutions already obey this rule.  Linux itself also obeys this
      whether GSO is in use or not.
      
      In future we could relax this rule by storing the IDs (or rather
      making sure that we don't drop them when pulling the aggregate
      skb's tail).
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73cc19f1
  4. 15 12月, 2008 2 次提交
  5. 09 12月, 2008 1 次提交
    • D
      tcp: tcp_vegas cong avoid fix · 8d3a564d
      Doug Leith 提交于
      This patch addresses a book-keeping issue in tcp_vegas.c.  At present
      tcp_vegas does separate book-keeping of cwnd based on packet sequence
      numbers.  A mismatch can develop between this book-keeping and
      tp->snd_cwnd due, for example, to delayed acks acking multiple
      packets.  When vegas transitions to reno operation (e.g. following
      loss), then this mismatch leads to incorrect behaviour (akin to a cwnd
      backoff).  This seems mostly to affect operation at low cwnds where
      delayed acking can lead to a significant fraction of cwnd being
      covered by a single ack, leading to the book-keeping mismatch.  This
      patch modifies the congestion avoidance update to avoid the need for
      separate book-keeping while leaving vegas congestion avoidance
      functionally unchanged.  A secondary advantage of this modification is
      that the use of fixed-point (via V_PARAM_SHIFT) and 64 bit arithmetic
      is no longer necessary, simplifying the code.
      
      Some example test measurements with the patched code (confirming no functional
      change in the congestion avoidance algorithm) can be seen at:
      
      http://www.hamilton.ie/doug/vegaspatch/Signed-off-by: NDoug Leith <doug.leith@nuim.ie>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d3a564d
  6. 06 12月, 2008 10 次提交
  7. 05 12月, 2008 1 次提交
    • D
      tcp: tcp_vegas ssthresh bug fix · a6af2d6b
      Doug Leith 提交于
      This patch fixes a bug in tcp_vegas.c.  At the moment this code leaves
      ssthresh untouched.  However, this means that the vegas congestion
      control algorithm is effectively unable to reduce cwnd below the
      ssthresh value (if the vegas update lowers the cwnd below ssthresh,
      then slow start is activated to raise it back up).  One example where
      this matters is when during slow start cwnd overshoots the link
      capacity and a flow then exits slow start with ssthresh set to a value
      above where congestion avoidance would like to adjust it.
      Signed-off-by: NDoug Leith <doug.leith@nuim.ie>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6af2d6b
  8. 04 12月, 2008 3 次提交
    • B
      net: /proc/net/ip_mr_cache, display Iif as a signed short · 999890b2
      Benjamin Thery 提交于
      Today, iproute2 fails to show multicast forwarding unresolved cache
      entries while scanning /proc/net/ip_mr_cache.
      
      Indeed, it expects to see -1 in 'Iif' column to identify unresolved
      entries but the kernel outputs 65535. It's a signed/unsigned issue:
      
      'Iif', the source interface, is retrieved from member mfc_parent in
      struct mfc_cache. mfc_parent is a vifi_t: unsigned short, but is
      displayed in ipmr_mfc_seq_show() as "%-3d", signed integer.
      
      In unresolevd entries, the 65535 value (0xFFFF) comes from this define:
      #define ALL_VIFS    ((vifi_t)(-1))
      
      That may explains why the guy who added support for this in iproute2
      thought a -1 should be expected.
      
      I don't know if this must be fixed in kernel or in iproute2. Who is
      right? What is the correct API? How was it designed originally?
      
      I let you decide if it should goes in the kernel or be fixed in iproute2.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      999890b2
    • B
      net: fix /proc/net/ip_mr_cache display - V2 · 1ea472e2
      Benjamin Thery 提交于
      /proc/net/ip_mr_cache and /proc/net/ip6_mr_cache displays garbage when
      showing unresolved mfc_cache entries.
      
      [root@qemu tests]# cat /proc/net/ip_mr_cache
      Group    Origin   Iif     Pkts    Bytes    Wrong Oifs
      014C00EF 010014AC 1         10    10050        0  2:1    3:1
      024C00EF 010014AC 65535      514        2 -559067475
      
      The first line is correct. It is a resolved cache entry, 10 packets used it...
      The second line represents an unresolved entry, and the columns Pkts(4th),
      Bytes(5th) and Wrong(6th) just show garbage.
      
      In struct mfc_cache, there's an union to store data for resolved and
      unresolved cases. And what ipmr_mfc_seq_show() is printing in these 
      columns for the unresolved entries is some bytes from mfc_cache.mfc_un.res.
      Bad.
      (eg. In our case -559067475 is in fact 0xdead4ead which is the spinlock
      magic from mfc_cache.mfc_un.unres.unresolved.lock.magic).
      
      This patch replaces the garbage data written in these columns for the
      unresolved entries by '0' (zeros) which is more correct.
      This change doesn't break the ABI.
      
      Also, mfc->mfc_un.res.pkt, mfc->mfc_un.res.bytes, mfc->mfc_un.res.wrong_if
      are unsigned long.
      
      It applies on top of net-next-2.6.
      
      The patch for net-2.6 is slightly different because of the NIP6_FMT to
      %pI6 conversion that was made in the seq_printf.
      
      Changelog:
      ==========
      V2:
      * Instead of breaking the ABI by suppressing the columns that have no
        meaning for unresolved entries, fill them with 0 values.
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ea472e2
    • I
      tcp: make urg+gso work for real this time · f8269a49
      Ilpo Järvinen 提交于
      I should have noticed this earlier... :-) The previous solution
      to URG+GSO/TSO will cause SACK block tcp_fragment to do zig-zig
      patterns, or even worse, a steep downward slope into packet
      counting because each skb pcount would be truncated to pcount
      of 2 and then the following fragments of the later portion would
      restore the window again.
      
      Basically this reverts "tcp: Do not use TSO/GSO when there is
      urgent data" (33cf71ce). It also removes some unnecessary code
      from tcp_current_mss that didn't work as intented either (could
      be that something was changed down the road, or it might have
      been broken since the dawn of time) because it only works once
      urg is already written while this bug shows up starting from
      ~64k before the urg point.
      
      The retransmissions already are split to mss sized chunks, so
      only new data sending paths need splitting in case they have
      a segment otherwise suitable for gso/tso. The actually check
      can be improved to be more narrow but since this is late -rc
      already, I'll postpone thinking the more fine-grained things.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8269a49
  9. 02 12月, 2008 1 次提交
  10. 26 11月, 2008 11 次提交
  11. 25 11月, 2008 6 次提交
    • E
      netfilter: nfmark routing in OUTPUT, mangle, NFQUEUE · 5f145e44
      Eric Leblond 提交于
      This patch let nfmark to be evaluated for routing decision for OUTPUT
      packet, in mangle table, when process paquet in NFQUEUE
      Until now, only change (in NFQUEUE process) on fields src_addr,
      dest_addr and tos could make netfilter to reevalute the routing.
      
      From: Laurent Licour <laurent@licour.com>
      Signed-off-by: NEric Leblond <eric@inl.fr>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      5f145e44
    • A
      fb7e0674
    • A
      ah4/ah6: remove useless NULL assignments · 6daad372
      Alexey Dobriyan 提交于
      struct will be kfreed in a moment, so...
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6daad372
    • I
      111cc8b9
    • I
      tcp: Make shifting not clear the hints · 92ee76b6
      Ilpo Järvinen 提交于
      The earlier version was just very basic one which is "playing
      safe" by always clearing the hints. However, clearing of a hint
      is extremely costly operation with large windows, so it must be
      avoided at all cost whenever possible, there is a way with
      shifting too achieve not-clearing.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92ee76b6
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5