1. 26 11月, 2008 32 次提交
  2. 25 11月, 2008 6 次提交
    • J
      DCB: fix kconfig option · 7a6b6f51
      Jeff Kirsher 提交于
      Since the netlink option for DCB is necessary to actually be useful,
      simplified the Kconfig option.  In addition, added useful help text for the
      Kconfig option.
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a6b6f51
    • S
      netdev: add HAVE_NET_DEVICE_OPS · 47fd5b83
      Stephen Hemminger 提交于
      As a concession to vendors who have to deal with one source for different
      kernel versions, add a HAVE_NET_DEVICE_OPS so they don't end up hard
      coding ifdef against kernel version.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47fd5b83
    • I
      111cc8b9
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • I
      e1aa680f
    • E
      net: avoid a pair of dst_hold()/dst_release() in ip_append_data() · 2e77d89b
      Eric Dumazet 提交于
      We can reduce pressure on dst entry refcount that slowdown UDP transmit
      path on SMP machines. This pressure is visible on RTP servers when
      delivering content to mediagateways, especially big ones, handling
      thousand of streams. Several cpus send UDP frames to the same
      destination, hence use the same dst entry.
      
      This patch makes ip_append_data() eventually steal the refcount its
      callers had to take on the dst entry.
      
      This doesnt avoid all refcounting, but still gives speedups on SMP,
      on UDP/RAW transmit path
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e77d89b
  3. 24 11月, 2008 2 次提交
    • E
      eth: Declare an optimized compare_ether_addr_64bits() function · 1f87e235
      Eric Dumazet 提交于
      Linus mentioned we could try to perform long word operations, even
      on potentially unaligned addresses, on x86 at least. David mentioned
      the HAVE_EFFICIENT_UNALIGNED_ACCESS test to handle this on all
      arches that have efficient unailgned accesses.
      
      I tried this idea and got nice assembly on 32 bits:
      
      158:   33 82 38 01 00 00       xor    0x138(%edx),%eax
      15e:   33 8a 34 01 00 00       xor    0x134(%edx),%ecx
      164:   c1 e0 10                shl    $0x10,%eax
      167:   09 c1                   or     %eax,%ecx
      169:   74 0b                   je     176 <eth_type_trans+0x87>
      
      And very nice assembly on 64 bits of course (one xor, one shl)
      
      Nice oprofile improvement in eth_type_trans(), 0.17 % instead of 0.41 %,
      expected since we remove 8 instructions on a fast path.
      
      This patch implements a compare_ether_addr_64bits() function, that
      uses the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS ifdef to efficiently
      perform the 6 bytes comparison on all capable arches.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f87e235
    • E
      net: Convert TCP/DCCP listening hash tables to use RCU · c25eb3bf
      Eric Dumazet 提交于
      This is the last step to be able to perform full RCU lookups
      in __inet_lookup() : After established/timewait tables, we
      add RCU lookups to listening hash table.
      
      The only trick here is that a socket of a given type (TCP ipv4,
      TCP ipv6, ...) can now flight between two different tables
      (established and listening) during a RCU grace period, so we
      must use different 'nulls' end-of-chain values for two tables.
      
      We define a large value :
      
      #define LISTENING_NULLS_BASE (1U << 29)
      
      So that slots in listening table are guaranteed to have different
      end-of-chain values than slots in established table. A reader can
      still detect it finished its lookup in the right chain.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c25eb3bf