1. 13 10月, 2009 1 次提交
  2. 18 6月, 2009 1 次提交
  3. 28 4月, 2009 1 次提交
  4. 24 11月, 2008 1 次提交
    • E
      net: Convert TCP/DCCP listening hash tables to use RCU · c25eb3bf
      Eric Dumazet 提交于
      This is the last step to be able to perform full RCU lookups
      in __inet_lookup() : After established/timewait tables, we
      add RCU lookups to listening hash table.
      
      The only trick here is that a socket of a given type (TCP ipv4,
      TCP ipv6, ...) can now flight between two different tables
      (established and listening) during a RCU grace period, so we
      must use different 'nulls' end-of-chain values for two tables.
      
      We define a large value :
      
      #define LISTENING_NULLS_BASE (1U << 29)
      
      So that slots in listening table are guaranteed to have different
      end-of-chain values than slots in established table. A reader can
      still detect it finished its lookup in the right chain.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c25eb3bf
  5. 22 11月, 2008 1 次提交
  6. 20 11月, 2008 1 次提交
  7. 17 11月, 2008 1 次提交
    • E
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet 提交于
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ab5aee7
  8. 17 10月, 2008 1 次提交
  9. 28 8月, 2008 1 次提交
  10. 12 6月, 2008 1 次提交
  11. 01 2月, 2008 3 次提交
  12. 29 1月, 2008 1 次提交
  13. 03 12月, 2007 1 次提交
  14. 29 11月, 2007 1 次提交
    • P
      [INET]: Fix inet_diag register vs rcv race · 07693198
      Pavel Emelyanov 提交于
      The following race is possible when one cpu unregisters the handler
      while other one is trying to receive a message and call this one:
      
      CPU1:                                                 CPU2:
      inet_diag_rcv()                                       inet_diag_unregister()
        mutex_lock(&inet_diag_mutex);
        netlink_rcv_skb(skb, &inet_diag_rcv_msg);
          if (inet_diag_table[nlh->nlmsg_type] == 
                                     NULL) /* false handler is still registered */
          ...
          netlink_dump_start(idiagnl, skb, nlh,
                                 inet_diag_dump, NULL);
                 cb = kzalloc(sizeof(*cb), GFP_KERNEL);
                         /* sleep here freeing memory 
                          * or preempt
                          * or sleep later on nlk->cb_mutex
                          */
                                                               spin_lock(&inet_diag_register_lock);
                                                               inet_diag_table[type] = NULL;
          ...                                                  spin_unlock(&inet_diag_register_lock);
                                                               synchronize_rcu();
                                                               /* CPU1 is sleeping - RCU quiescent
                                                                * state is passed
                                                                */
                                                               return;
          /* inet_diag_dump is finally called: */
          inet_diag_dump()
            handler = inet_diag_table[cb->nlh->nlmsg_type];
            BUG_ON(handler == NULL); 
            /* OOPS! While we slept the unregister has set
             * handler to NULL :(
             */
      
      Grep showed, that the register/unregister functions are called
      from init/fini module callbacks for tcp_/dccp_diag, so it's OK
      to use the inet_diag_mutex to synchronize manipulations with the
      inet_diag_table and the access to it.
      
      Besides, as Herbert pointed out, asynchronous dumps should hold 
      this mutex as well, and thus, we provide the mutex as cb_mutex one.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      07693198
  15. 07 11月, 2007 1 次提交
    • E
      [INET]: Remove per bucket rwlock in tcp/dccp ehash table. · 230140cf
      Eric Dumazet 提交于
      As done two years ago on IP route cache table (commit
      22c047cc) , we can avoid using one
      lock per hash bucket for the huge TCP/DCCP hash tables.
      
      On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for
      litle performance differences. (we hit a different cache line for the
      rwlock, but then the bucket cache line have a better sharing factor
      among cpus, since we dirty it less often). For netstat or ss commands
      that want a full scan of hash table, we perform fewer memory accesses.
      
      Using a 'small' table of hashed rwlocks should be more than enough to
      provide correct SMP concurrency between different buckets, without
      using too much memory. Sizing of this table depends on
      num_possible_cpus() and various CONFIG settings.
      
      This patch provides some locking abstraction that may ease a future
      work using a different model for TCP/DCCP table.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      230140cf
  16. 22 10月, 2007 1 次提交
  17. 11 10月, 2007 4 次提交
    • D
      [NET]: make netlink user -> kernel interface synchronious · cd40b7d3
      Denis V. Lunev 提交于
      This patch make processing netlink user -> kernel messages synchronious.
      This change was inspired by the talk with Alexey Kuznetsov about current
      netlink messages processing. He says that he was badly wrong when introduced 
      asynchronious user -> kernel communication.
      
      The call netlink_unicast is the only path to send message to the kernel
      netlink socket. But, unfortunately, it is also used to send data to the
      user.
      
      Before this change the user message has been attached to the socket queue
      and sk->sk_data_ready was called. The process has been blocked until all
      pending messages were processed. The bad thing is that this processing
      may occur in the arbitrary process context.
      
      This patch changes nlk->data_ready callback to get 1 skb and force packet
      processing right in the netlink_unicast.
      
      Kernel -> user path in netlink_unicast remains untouched.
      
      EINTR processing for in netlink_run_queue was changed. It forces rtnl_lock
      drop, but the process remains in the cycle until the message will be fully
      processed. So, there is no need to use this kludges now.
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Acked-by: NAlexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd40b7d3
    • H
      [NETLINK]: Avoid pointer in netlink_run_queue · 0cfad075
      Herbert Xu 提交于
      I was looking at Patrick's fix to inet_diag and it occured
      to me that we're using a pointer argument to return values
      unnecessarily in netlink_run_queue.  Changing it to return
      the value will allow the compiler to generate better code
      since the value won't have to be memory-backed.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cfad075
    • E
      [NET]: Support multiple network namespaces with netlink · b4b51029
      Eric W. Biederman 提交于
      Each netlink socket will live in exactly one network namespace,
      this includes the controlling kernel sockets.
      
      This patch updates all of the existing netlink protocols
      to only support the initial network namespace.  Request
      by clients in other namespaces will get -ECONREFUSED.
      As they would if the kernel did not have the support for
      that netlink protocol compiled in.
      
      As each netlink protocol is updated to be multiple network
      namespace safe it can register multiple kernel sockets
      to acquire a presence in the rest of the network namespaces.
      
      The implementation in af_netlink is a simple filter implementation
      at hash table insertion and hash table look up time.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4b51029
    • I
      [NET]: DIV_ROUND_UP cleanup (part two) · 172589cc
      Ilpo Järvinen 提交于
      Hopefully captured all single statement cases under net/. I'm
      not too sure if there is some policy about #includes that are
      "guaranteed" (ie., in the current tree) to be available through
      some other #included header, so I just added linux/kernel.h to
      each changed file that didn't #include it previously.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      172589cc
  18. 11 9月, 2007 1 次提交
    • P
      [INET_DIAG]: Fix oops in netlink_rcv_skb · 0a9c7301
      Patrick McHardy 提交于
      netlink_run_queue() doesn't handle multiple processes processing the
      queue concurrently. Serialize queue processing in inet_diag to fix
      a oops in netlink_rcv_skb caused by netlink_run_queue passing a
      NULL for the skb.
      
      BUG: unable to handle kernel NULL pointer dereference at virtual address 00000054
      [349587.500454]  printing eip:
      [349587.500457] c03318ae
      [349587.500459] *pde = 00000000
      [349587.500464] Oops: 0000 [#1]
      [349587.500466] PREEMPT SMP
      [349587.500474] Modules linked in: w83627hf hwmon_vid i2c_isa
      [349587.500483] CPU:    0
      [349587.500485] EIP:    0060:[<c03318ae>]    Not tainted VLI
      [349587.500487] EFLAGS: 00010246   (2.6.22.3 #1)
      [349587.500499] EIP is at netlink_rcv_skb+0xa/0x7e
      [349587.500506] eax: 00000000   ebx: 00000000   ecx: c148d2a0   edx: c0398819
      [349587.500510] esi: 00000000   edi: c0398819   ebp: c7a21c8c   esp: c7a21c80
      [349587.500517] ds: 007b   es: 007b   fs: 00d8  gs: 0033  ss: 0068
      [349587.500521] Process oidentd (pid: 17943, ti=c7a20000 task=cee231c0 task.ti=c7a20000)
      [349587.500527] Stack: 00000000 c7a21cac f7c8ba78 c7a21ca4 c0331962 c0398819 f7c8ba00 0000004c
      [349587.500542]        f736f000 c7a21cb4 c03988e3 00000001 f7c8ba00 c7a21cc4 c03312a5 0000004c
      [349587.500558]        f7c8ba00 c7a21cd4 c0330681 f7c8ba00 e4695280 c7a21d00 c03307c6 7fffffff
      [349587.500578] Call Trace:
      [349587.500581]  [<c010361a>] show_trace_log_lvl+0x1c/0x33
      [349587.500591]  [<c01036d4>] show_stack_log_lvl+0x8d/0xaa
      [349587.500595]  [<c010390e>] show_registers+0x1cb/0x321
      [349587.500604]  [<c0103bff>] die+0x112/0x1e1
      [349587.500607]  [<c01132d2>] do_page_fault+0x229/0x565
      [349587.500618]  [<c03c8d3a>] error_code+0x72/0x78
      [349587.500625]  [<c0331962>] netlink_run_queue+0x40/0x76
      [349587.500632]  [<c03988e3>] inet_diag_rcv+0x1f/0x2c
      [349587.500639]  [<c03312a5>] netlink_data_ready+0x57/0x59
      [349587.500643]  [<c0330681>] netlink_sendskb+0x24/0x45
      [349587.500651]  [<c03307c6>] netlink_unicast+0x100/0x116
      [349587.500656]  [<c0330f83>] netlink_sendmsg+0x1c2/0x280
      [349587.500664]  [<c02fcce9>] sock_sendmsg+0xba/0xd5
      [349587.500671]  [<c02fe4d1>] sys_sendmsg+0x17b/0x1e8
      [349587.500676]  [<c02fe92d>] sys_socketcall+0x230/0x24d
      [349587.500684]  [<c01028d2>] syscall_call+0x7/0xb
      [349587.500691]  =======================
      [349587.500693] Code: f0 ff 4e 18 0f 94 c0 84 c0 0f 84 66 ff ff ff 89 f0 e8 86 e2 fc ff e9 5a ff ff ff f0 ff 40 10 eb be 55 89 e5 57 89 d7 56 89 c6 53 <8b> 50 54 83 fa 10 72 55 8b 9e 9c 00 00 00 31 c9 8b 03 83 f8 0f
      
      Reported by Athanasius <link@miggy.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9c7301
  19. 26 4月, 2007 6 次提交
  20. 11 2月, 2007 1 次提交
  21. 09 2月, 2007 2 次提交
    • E
      [NET]: change layout of ehash table · dbca9b27
      Eric Dumazet 提交于
      ehash table layout is currently this one :
      
      First half of this table is used by sockets not in TIME_WAIT state
      Second half of it is used by sockets in TIME_WAIT state.
      
      This is non optimal because of for a given hash or socket, the two chain heads 
      are located in separate cache lines.
      Moreover the locks of the second half are never used.
      
      If instead of this halving, we use two list heads in inet_ehash_bucket instead 
      of only one, we probably can avoid one cache miss, and reduce ram usage, 
      particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, 
      CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep 
      together related chains to speedup lookups and socket state change.
      
      In this patch I did not try to align struct inet_ehash_bucket, but a future 
      patch could try to make this structure have a convenient size (a power of two 
      or a multiple of L1_CACHE_SIZE).
      I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so 
      maybe we dont need to scratch our heads to align the bucket...
      
      Note : In case struct inet_ehash_bucket is not a power of two, we could 
      probably change alloc_large_system_hash() (in case it use __get_free_pages()) 
      to free the unused space. It currently allocates a big zone, but the last 
      quarter of it could be freed. Again, this should be a temporary 'problem'.
      
      Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbca9b27
    • P
      [NETLINK]: Don't BUG on undersized allocations · 26932566
      Patrick McHardy 提交于
      Currently netlink users BUG when the allocated skb for an event
      notification is undersized. While this is certainly a kernel bug,
      its not critical and crashing the kernel is too drastic, especially
      when considering that these errors have appeared multiple times in
      the past and it BUGs even if no listeners are present.
      
      This patch replaces BUG by WARN_ON and changes the notification
      functions to inform potential listeners of undersized allocations
      using a unique error code (EMSGSIZE).
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26932566
  22. 29 9月, 2006 1 次提交
  23. 22 7月, 2006 1 次提交
  24. 01 7月, 2006 1 次提交
  25. 10 1月, 2006 4 次提交
  26. 04 1月, 2006 1 次提交