1. 29 10月, 2008 1 次提交
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
  2. 08 10月, 2008 1 次提交
  3. 23 9月, 2008 1 次提交
  4. 19 9月, 2008 1 次提交
  5. 24 7月, 2008 1 次提交
  6. 17 7月, 2008 1 次提交
  7. 18 6月, 2008 1 次提交
  8. 12 6月, 2008 1 次提交
  9. 14 5月, 2008 1 次提交
  10. 03 5月, 2008 1 次提交
  11. 22 4月, 2008 2 次提交
  12. 16 4月, 2008 1 次提交
  13. 14 4月, 2008 1 次提交
  14. 01 4月, 2008 2 次提交
  15. 29 3月, 2008 3 次提交
  16. 26 3月, 2008 1 次提交
  17. 21 3月, 2008 1 次提交
    • P
      [NET]: Add per-connection option to set max TSO frame size · 82cc1a7a
      Peter P Waskiewicz Jr 提交于
      Update: My mailer ate one of Jarek's feedback mails...  Fixed the
      parameter in netif_set_gso_max_size() to be u32, not u16.  Fixed the
      whitespace issue due to a patch import botch.  Changed the types from
      u32 to unsigned int to be more consistent with other variables in the
      area.  Also brought the patch up to the latest net-2.6.26 tree.
      
      Update: Made gso_max_size container 32 bits, not 16.  Moved the
      location of gso_max_size within netdev to be less hotpath.  Made more
      consistent names between the sock and netdev layers, and added a
      define for the max GSO size.
      
      Update: Respun for net-2.6.26 tree.
      
      Update: changed max_gso_frame_size and sk_gso_max_size from signed to
      unsigned - thanks Stephen!
      
      This patch adds the ability for device drivers to control the size of
      the TSO frames being sent to them, per TCP connection.  By setting the
      netdevice's gso_max_size value, the socket layer will set the GSO
      frame size based on that value.  This will propogate into the TCP
      layer, and send TSO's of that size to the hardware.
      
      This can be desirable to help tune the bursty nature of TSO on a
      per-adapter basis, where one may have 1 GbE and 10 GbE devices
      coexisting in a system, one running multiqueue and the other not, etc.
      
      This can also be desirable for devices that cannot support full 64 KB
      TSO's, but still want to benefit from some level of segmentation
      offloading.
      Signed-off-by: NPeter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82cc1a7a
  18. 19 3月, 2008 1 次提交
  19. 06 3月, 2008 1 次提交
  20. 01 3月, 2008 2 次提交
  21. 14 2月, 2008 1 次提交
  22. 01 2月, 2008 1 次提交
  23. 29 1月, 2008 9 次提交
  24. 13 11月, 2007 1 次提交
  25. 07 11月, 2007 2 次提交
    • P
      [NET]: Clean proto_(un)register from in-code ifdefs · b733c007
      Pavel Emelyanov 提交于
      The struct proto has the per-cpu "inuse" counter, which is handled
      with a special care. All the handling code hides under the ifdef
      CONFIG_SMP and it introduces some code duplication and makes it
      look worse than it could.
      
      Clean this.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b733c007
    • E
      [NET]: Define infrastructure to keep 'inuse' changes in an efficent SMP/NUMA way. · 286ab3d4
      Eric Dumazet 提交于
      "struct proto" currently uses an array stats[NR_CPUS] to track change on
      'inuse' sockets per protocol.
      
      If NR_CPUS is big, this means we use a big memory area for this.
      Moreover, all this memory area is located on a single node on NUMA
      machines, increasing memory pressure on the boot node.
      
      In this patch, I tried to :
      
      - Keep a fast !CONFIG_SMP implementation
      - Keep a fast CONFIG_SMP implementation for often used protocols
      (tcp,udp,raw,...)
      - Introduce a NUMA efficient implementation
      
      Some helper macros are defined in include/net/sock.h
      These macros take into account CONFIG_SMP
      
      If a "struct proto" is declared without using DEFINE_PROTO_INUSE /
      REF_PROTO_INUSE
      macros, it will automatically use a default implementation, using a
      dynamically allocated percpu zone.
      This default implementation will be NUMA efficient, but might use 32/64
      bytes per possible cpu
      because of current alloc_percpu() implementation.
      However it still should be better than previous implementation based on
      stats[NR_CPUS] field.
      
      When a "struct proto" is changed to use the new macros, we use a single
      static "int" percpu variable,
      lowering the memory and cpu costs, still preserving NUMA efficiency.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286ab3d4
  26. 01 11月, 2007 1 次提交