1. 30 1月, 2013 8 次提交
    • D
      Merge branch 'ipfrags' · 5a1dc317
      David S. Miller 提交于
      Jesper Dangaard Brouer says:
      
      ====================
      This patchset is V2, with some trivial code fixes, which were noticed
      by DaveM. It is still a partly respin of my fragmentation optimization
      patches: http://thread.gmane.org/gmane.linux.network/250914
      
      This is not the complete patchset, from the gmane link above. In this
      patchset, I primarily focus on adjusting cacheline for better SMP/NUMA
      performance.
      
      Once this patchset have been agreed upon, I will continue and respin
      the rest of my patches.
      
      This time around, I have created a frag DoS generator, via the tool
      trafgen (http://netsniff-ng.org/).  To create a stable DoS scenario
      (no longer relying on frame dropping due to disabled flow-control).
      
      Two 10G interfaces are under-test, and uses Ethernet flow-control.  A
      third interface is used for generating the DoS attack (this interface
      is also 10G, but it does not need to be, as 500Kpps DoS is enough).
      
      Test types summary (netperf):
       Test-20G64K     == 2x10G with 65K fragments
       Test-20G3F      == 2x10G with 3x fragments (3*1472 bytes)
       Test-20G64K+DoS == Same as 20G64K with frag DoS
       Test-20G3F+DoS  == Same as 20G3F  with frag DoS
      
      Patch list:
       Patch-01 - net: cacheline adjust struct netns_frags for better frag performance
       Patch-02 - net: cacheline adjust struct inet_frags for better frag performance
       Patch-03 - net: cacheline adjust struct inet_frag_queue
       Patch-04 - net: frag helper functions for mem limit tracking
       Patch-05 - net: use lib/percpu_counter API for fragmentation mem accounting
       Patch-06 - net: frag, move LRU list maintenance outside of rwlock
      
      Performance table summary:
      
       Test-type:  Test-20G64K    Test-20G3F  20G64K+DoS   20G3F+DoS
       ----------  -----------    ----------  ----------   ---------
        net-next:  15114.5 Mbit/s   8954.21     2444.28     3918.01 Mbit/s
        Patch-01:  16075.8 Mbit/s   8976.18     2621.49     4072.79 Mbit/s
        Patch-02:  17806.9 Mbit/s   9280.32     2478.62     4274.59 Mbit/s
        Patch-03:  17317.4 Mbit/s   9308.62     2546.05     4336.59 Mbit/s
        Patch-04:  17635.9 Mbit/s   9256.16     2535.25     4327.63 Mbit/s
        Patch-05:  18027.0 Mbit/s   9918.99     2492.62     3621.68 Mbit/s
        Patch-06:  18486.7 Mbit/s  10723.20     3657.85     4560.64 Mbit/s
      
       I cannot explain the under-DoS regression that patch-05/percpu_counter
       introduces.  But patch-06/LRU-lock corrects the situation again.
      
      Below is a testlab setup description, with links to the trafgen DoS
      packet config used.
      
      Testlab
      =======
      
      Server setup
      ------------
      The machine acting as a server:
       - 2x CPU (E5-2630)
       - Thus a NUMA arch/machine
       - 4x 10Gbit/s ports
       - NICs 2x Intel Dual port 82599 based (driver ixgbe)
      
      Setup:
       - Interfaces uses Ethernet flow control
       - Flush all iptables
       - Remove all iptables related module.
       - Kill irqbalance
       - Pin each 10G NIC port to a *single* CPU each
      
      Pinning can easily be done by command hacks::
      
       for x in /proc/irq/*/eth8*/../smp_affinity_list ; do echo 1 > $x; done
       for x in /proc/irq/*/eth9*/../smp_affinity_list ; do echo 3 > $x; done
       for x in /proc/irq/*/eth31*/../smp_affinity_list; do echo 6 > $x; done
       for x in /proc/irq/*/eth32*/../smp_affinity_list; do echo 8 > $x; done
      
      Notice NUMA setting: The CPU to NIC tying is carefully choosen
      according to the NUMA node setup.  Thus, NICs connected to a PCI-e
      slot that is connected to a physical CPU socket are tied together.
      
      Choosing only a single CPU per NIC (port) is just to ease provoking
      and debugging this performance issue. (In real setups, you can choose
      more CPU, just remember the NUMA node in the equation).
      
      Tools
      -----
      
      Netperf is used, with option -T to ensure CPU binding.
      The netserver processes, are NAPI pinned::
      
       numactl -m0 -c0 netserver
       numactl -m1 -c 1 netserver -p 1337
      
      I now have a frag DoS generator, created via the tool:
        trafgen (see: http://netsniff-ng.org/)
      
      Trafgen packet config file:
       http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet03_small_frag.txf
      
      Notice, I'm using features of trafgen, recently developed by Daniel
      Borkmann, thus you need the latest git tree to use my trafgen packet
      config.
      
       git://github.com/borkmann/netsniff-ng.git
      
      Command line:
       trafgen --dev eth51 --conf frag_packet03_small_frag.txf -V -k 100 --cpus 2
      
      Tests types
      -----------
      
      Test(20G64K) UDP-64K 2x 10Gbit/s with no DoS traffic:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
       export SIZE=$((65507)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
       netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
       netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
       wait $! && tail -n3 ${LOG}.* && \
       tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      
      Test(20G3F) UDP-3xfrags 2x 10Gbit/s with no DoS traffic:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
       export SIZE=$((3*1472)); export TIME=$((20)); export LOG=/tmp/netperf.log ;\
       netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.31 &\
       netperf         -H 192.168.81.2 -T2,2 -t UDP_STREAM -l $TIME -- -m $SIZE >> ${LOG}.81 && \
       wait $! && tail -n3 ${LOG}.* && \
      tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      
      Awk script for summming results:
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      tail -n3 ${LOG}.{31,81} | awk 'BEGIN{sum=0;} /212992        / {sum+=$4; print " +"$4} /==/ {print " file:"$2} END{print "sum:"sum" Mbit/s"}'
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a1dc317
    • J
      net: frag, move LRU list maintenance outside of rwlock · 3ef0eb0d
      Jesper Dangaard Brouer 提交于
      Updating the fragmentation queues LRU (Least-Recently-Used) list,
      required taking the hash writer lock.  However, the LRU list isn't
      tied to the hash at all, so we can use a separate lock for it.
      Original-idea-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ef0eb0d
    • J
      net: use lib/percpu_counter API for fragmentation mem accounting · 6d7b857d
      Jesper Dangaard Brouer 提交于
      Replace the per network namespace shared atomic "mem" accounting
      variable, in the fragmentation code, with a lib/percpu_counter.
      
      Getting percpu_counter to scale to the fragmentation code usage
      requires some tweaks.
      
      At first view, percpu_counter looks superfast, but it does not
      scale on multi-CPU/NUMA machines, because the default batch size
      is too small, for frag code usage.  Thus, I have adjusted the
      batch size by using __percpu_counter_add() directly, instead of
      percpu_counter_sub() and percpu_counter_add().
      
      The batch size is increased to 130.000, based on the largest 64K
      fragment memory usage.  This does introduce some imprecise
      memory accounting, but its does not need to be strict for this
      use-case.
      
      It is also essential, that the percpu_counter, does not
      share cacheline with other writers, to make this scale.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d7b857d
    • J
      net: frag helper functions for mem limit tracking · d433673e
      Jesper Dangaard Brouer 提交于
      This change is primarily a preparation to ease the extension of memory
      limit tracking.
      
      The change does reduce the number atomic operation, during freeing of
      a frag queue.  This does introduce a some performance improvement, as
      these atomic operations are at the core of the performance problems
      seen on NUMA systems.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d433673e
    • J
      net: cacheline adjust struct inet_frag_queue · 6e34a8b3
      Jesper Dangaard Brouer 提交于
      Fragmentation code cacheline adjusting of struct inet_frag_queue.
      
      Take advantage of the size of struct timer_list, and move all but
      spinlock_t lock, below the timer struct.  On 64-bit 'lru_list',
      'list' and 'refcnt', fits exactly into the next cacheline, and a
      new cacheline starts at 'fragments'.
      
      The netns_frags *net pointer is moved to the end of the struct,
      because its used in a compare, with "next/close-by" elements of
      which this struct is embedded into.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e34a8b3
    • J
      net: cacheline adjust struct inet_frags for better frag performance · 5f8e1e8b
      Jesper Dangaard Brouer 提交于
      The globally shared rwlock, of struct inet_frags, shares
      cacheline with the 'rnd' number, which is used by the hash
      calculations.  Fix this, as this obviously is a bad idea, as
      unnecessary cache-misses will occur when accessing the 'rnd'
      number.
      
      Also small note that, moving function ptr (*match) up in struct,
      is to avoid it lands on the next cacheline (on 64-bit).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f8e1e8b
    • J
      net: cacheline adjust struct netns_frags for better frag performance · cd39a789
      Jesper Dangaard Brouer 提交于
      This small cacheline adjustment of struct netns_frags improves
      performance significantly for the fragmentation code.
      
      Struct members 'lru_list' and 'mem' are both hot elements, and it
      hurts performance, due to cacheline bouncing at every call point,
      when they share a cacheline.  Also notice, how mem is placed
      together with 'high_thresh' and 'low_thresh', as they are used in
      the compare operations together.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd39a789
    • F
      net: ks8851: convert to threaded IRQ · 656a05c8
      Felipe Balbi 提交于
      just as it should have been. It also helps
      removing the, now unnecessary, workqueue.
      Signed-off-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      656a05c8
  2. 29 1月, 2013 13 次提交
  3. 28 1月, 2013 19 次提交