1. 17 7月, 2009 1 次提交
    • E
      net: sock_copy() fixes · 4dc6dc71
      Eric Dumazet 提交于
      Commit e912b114
      (net: sk_prot_alloc() should not blindly overwrite memory)
      took care of not zeroing whole new socket at allocation time.
      
      sock_copy() is another spot where we should be very careful.
      We should not set refcnt to a non null value, until
      we are sure other fields are correctly setup, or
      a lockless reader could catch this socket by mistake,
      while not fully (re)initialized.
      
      This patch puts sk_node & sk_refcnt to the very beginning
      of struct sock to ease sock_copy() & sk_prot_alloc() job.
      
      We add appropriate smp_wmb() before sk_refcnt initializations
      to match our RCU requirements (changes to sock keys should
      be committed to memory before sk_refcnt setting)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dc6dc71
  2. 12 7月, 2009 1 次提交
  3. 10 7月, 2009 1 次提交
    • J
      net: adding memory barrier to the poll and receive callbacks · a57de0b4
      Jiri Olsa 提交于
      Adding memory barrier after the poll_wait function, paired with
      receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
      to wrap the memory barrier.
      
      Without the memory barrier, following race can happen.
      The race fires, when following code paths meet, and the tp->rcv_nxt
      and __add_wait_queue updates stay in CPU caches.
      
      CPU1                         CPU2
      
      sys_select                   receive packet
        ...                        ...
        __add_wait_queue           update tp->rcv_nxt
        ...                        ...
        tp->rcv_nxt check          sock_def_readable
        ...                        {
        schedule                      ...
                                      if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                              wake_up_interruptible(sk->sk_sleep)
                                      ...
                                   }
      
      If there was no cache the code would work ok, since the wait_queue and
      rcv_nxt are opposit to each other.
      
      Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
      passed the tp->rcv_nxt check and sleeps, or will get the new value for
      tp->rcv_nxt and will return with new data mask.
      In both cases the process (CPU1) is being added to the wait queue, so the
      waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
      
      The bad case is when the __add_wait_queue changes done by CPU1 stay in its
      cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
      endup calling schedule and sleep forever if there are no more data on the
      socket.
      
      Calls to poll_wait in following modules were ommited:
      	net/bluetooth/af_bluetooth.c
      	net/irda/af_irda.c
      	net/irda/irnet/irnet_ppp.c
      	net/mac80211/rc80211_pid_debugfs.c
      	net/phonet/socket.c
      	net/rds/af_rds.c
      	net/rfkill/core.c
      	net/sunrpc/cache.c
      	net/sunrpc/rpc_pipe.c
      	net/tipc/socket.c
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a57de0b4
  4. 15 6月, 2009 1 次提交
    • V
      net: annotate struct sock bitfield · a98b65a3
      Vegard Nossum 提交于
      2009/2/24 Ingo Molnar <mingo@elte.hu>:
      > ok, this is the last warning i have from today's overnight -tip
      > testruns - a 32-bit system warning in sock_init_data():
      >
      > [    2.610389] NET: Registered protocol family 16
      > [    2.616138] initcall netlink_proto_init+0x0/0x170 returned 0 after 7812 usecs
      > [    2.620010] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f642c184)
      > [    2.624002] 010000000200000000000000604990c000000000000000000000000000000000
      > [    2.634076]  i i i i i i u u i i i i i i i i i i i i i i i i i i i i i i i i
      > [    2.641038]          ^
      > [    2.643376]
      > [    2.644004] Pid: 1, comm: swapper Not tainted (2.6.29-rc6-tip-01751-g4d1c22c-dirty #885)
      > [    2.648003] EIP: 0060:[<c07141a1>] EFLAGS: 00010282 CPU: 0
      > [    2.652008] EIP is at sock_init_data+0xa1/0x190
      > [    2.656003] EAX: 0001a800 EBX: f6836c00 ECX: 00463000 EDX: c0e46fe0
      > [    2.660003] ESI: f642c180 EDI: c0b83088 EBP: f6863ed8 ESP: c0c412ec
      > [    2.664003]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
      > [    2.668003] CR0: 8005003b CR2: f682c400 CR3: 00b91000 CR4: 000006f0
      > [    2.672003] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      > [    2.676003] DR6: ffff4ff0 DR7: 00000400
      > [    2.680002]  [<c07423e5>] __netlink_create+0x35/0xa0
      > [    2.684002]  [<c07443cc>] netlink_kernel_create+0x4c/0x140
      > [    2.688002]  [<c072755e>] rtnetlink_net_init+0x1e/0x40
      > [    2.696002]  [<c071b601>] register_pernet_operations+0x11/0x30
      > [    2.700002]  [<c071b72c>] register_pernet_subsys+0x1c/0x30
      > [    2.704002]  [<c0bf3c8c>] rtnetlink_init+0x4c/0x100
      > [    2.708002]  [<c0bf4669>] netlink_proto_init+0x159/0x170
      > [    2.712002]  [<c0101124>] do_one_initcall+0x24/0x150
      > [    2.716002]  [<c0bbf3c7>] do_initcalls+0x27/0x40
      > [    2.723201]  [<c0bbf3fc>] do_basic_setup+0x1c/0x20
      > [    2.728002]  [<c0bbfb8a>] kernel_init+0x5a/0xa0
      > [    2.732002]  [<c0103e47>] kernel_thread_helper+0x7/0x10
      > [    2.736002]  [<ffffffff>] 0xffffffff
      
      We fix this false positive by annotating the bitfield in struct
      sock.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      a98b65a3
  5. 11 6月, 2009 1 次提交
    • E
      net: No more expensive sock_hold()/sock_put() on each tx · 2b85a34e
      Eric Dumazet 提交于
      One of the problem with sock memory accounting is it uses
      a pair of sock_hold()/sock_put() for each transmitted packet.
      
      This slows down bidirectional flows because the receive path
      also needs to take a refcount on socket and might use a different
      cpu than transmit path or transmit completion path. So these
      two atomic operations also trigger cache line bounces.
      
      We can see this in tx or tx/rx workloads (media gateways for example),
      where sock_wfree() can be in top five functions in profiles.
      
      We use this sock_hold()/sock_put() so that sock freeing
      is delayed until all tx packets are completed.
      
      As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
      by one unit at init time, until sk_free() is called.
      Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
      to decrement initial offset and atomicaly check if any packets
      are in flight.
      
      skb_set_owner_w() doesnt call sock_hold() anymore
      
      sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
      reached 0 to perform the final freeing.
      
      Drawback is that a skb->truesize error could lead to unfreeable sockets, or
      even worse, prematurely calling __sk_free() on a live socket.
      
      Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
      on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
      contention point. 5 % speedup on a UDP transmit workload (depends
      on number of flows), lowering TX completion cpu usage.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b85a34e
  6. 09 6月, 2009 1 次提交
  7. 28 5月, 2009 1 次提交
  8. 01 4月, 2009 1 次提交
  9. 27 2月, 2009 1 次提交
  10. 24 2月, 2009 1 次提交
  11. 18 2月, 2009 1 次提交
    • D
      net: Kill skb_truesize_check(), it only catches false-positives. · 92a0acce
      David S. Miller 提交于
      A long time ago we had bugs, primarily in TCP, where we would modify
      skb->truesize (for TSO queue collapsing) in ways which would corrupt
      the socket memory accounting.
      
      skb_truesize_check() was added in order to try and catch this error
      more systematically.
      
      However this debugging check has morphed into a Frankenstein of sorts
      and these days it does nothing other than catch false-positives.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92a0acce
  12. 16 2月, 2009 1 次提交
  13. 13 2月, 2009 1 次提交
  14. 05 2月, 2009 1 次提交
  15. 18 12月, 2008 1 次提交
  16. 26 11月, 2008 2 次提交
  17. 22 11月, 2008 1 次提交
  18. 20 11月, 2008 1 次提交
  19. 17 11月, 2008 1 次提交
    • E
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet 提交于
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab5aee7
  20. 14 11月, 2008 1 次提交
    • I
      lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.c · e8f6fbf6
      Ingo Molnar 提交于
      fix this warning:
      
        net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
        net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used
      
      this is a lockdep macro problem in the !LOCKDEP case.
      
      We cannot convert it to an inline because the macro works on multiple types,
      but we can mark the parameter used.
      
      [ also clean up a misaligned tab in sock_lock_init_class_and_name() ]
      
      [ also remove #ifdefs from around af_family_clock_key strings - which
        were certainly added to get rid of the ugly build warnings. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8f6fbf6
  21. 12 11月, 2008 1 次提交
    • I
      lockdep: include/linux/lockdep.h - fix warning in net/bluetooth/af_bluetooth.c · e25cf3db
      Ingo Molnar 提交于
      fix this warning:
      
        net/bluetooth/af_bluetooth.c:60: warning: ‘bt_key_strings’ defined but not used
        net/bluetooth/af_bluetooth.c:71: warning: ‘bt_slock_key_strings’ defined but not used
      
      this is a lockdep macro problem in the !LOCKDEP case.
      
      We cannot convert it to an inline because the macro works on multiple types,
      but we can mark the parameter used.
      
      [ also clean up a misaligned tab in sock_lock_init_class_and_name() ]
      
      [ also remove #ifdefs from around af_family_clock_key strings - which
        were certainly added to get rid of the ugly build warnings. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e25cf3db
  22. 29 10月, 2008 1 次提交
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
  23. 08 10月, 2008 1 次提交
  24. 23 9月, 2008 1 次提交
  25. 19 9月, 2008 1 次提交
  26. 24 7月, 2008 1 次提交
  27. 17 7月, 2008 1 次提交
  28. 18 6月, 2008 1 次提交
  29. 12 6月, 2008 1 次提交
  30. 14 5月, 2008 1 次提交
  31. 03 5月, 2008 1 次提交
  32. 22 4月, 2008 2 次提交
  33. 16 4月, 2008 1 次提交
  34. 14 4月, 2008 1 次提交
  35. 01 4月, 2008 2 次提交
  36. 29 3月, 2008 2 次提交