1. 05 10月, 2010 2 次提交
  2. 04 10月, 2010 2 次提交
    • E
      net: introduce DST_NOCACHE flag · c7d4426a
      Eric Dumazet 提交于
      While doing stress tests with IP route cache disabled, and multi queue
      devices, I noticed a very high contention on one rwlock used in
      neighbour code.
      
      When many cpus are trying to send frames (possibly using a high
      performance multiqueue device) to the same neighbour, they fight for the
      neigh->lock rwlock in order to call neigh_hh_init(), and fight on
      hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())
      
      But we dont need to call neigh_hh_init() for dst that are used only
      once. It costs four atomic operations at least, on two contended cache
      lines, plus the high contention on neigh->lock rwlock.
      
      Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
      inserted in route cache.
      
      With the stress test bench, sending 160000000 frames on one neighbour,
      results are :
      
      Before patch:
      
      real	2m28.406s
      user	0m11.781s
      sys	36m17.964s
      
      
      After patch:
      
      real	1m26.532s
      user	0m12.185s
      sys	20m3.903s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d4426a
    • N
      net: Fix the condition passed to sk_wait_event() · 482964e5
      Nagendra Tomar 提交于
      This patch fixes the condition (3rd arg) passed to sk_wait_event() in
      sk_stream_wait_memory(). The incorrect check in sk_stream_wait_memory()
      causes the following soft lockup in tcp_sendmsg() when the global tcp
      memory pool has exhausted.
      
      >>> snip <<<
      
      localhost kernel: BUG: soft lockup - CPU#3 stuck for 11s! [sshd:6429]
      localhost kernel: CPU 3:
      localhost kernel: RIP: 0010:[sk_stream_wait_memory+0xcd/0x200]  [sk_stream_wait_memory+0xcd/0x200] sk_stream_wait_memory+0xcd/0x200
      localhost kernel:
      localhost kernel: Call Trace:
      localhost kernel:  [sk_stream_wait_memory+0x1b1/0x200] sk_stream_wait_memory+0x1b1/0x200
      localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
      localhost kernel:  [ipv6:tcp_sendmsg+0x6e6/0xe90] tcp_sendmsg+0x6e6/0xce0
      localhost kernel:  [sock_aio_write+0x126/0x140] sock_aio_write+0x126/0x140
      localhost kernel:  [xfs:do_sync_write+0xf1/0x130] do_sync_write+0xf1/0x130
      localhost kernel:  [<ffffffff802557c0>] autoremove_wake_function+0x0/0x40
      localhost kernel:  [hrtimer_start+0xe3/0x170] hrtimer_start+0xe3/0x170
      localhost kernel:  [vfs_write+0x185/0x190] vfs_write+0x185/0x190
      localhost kernel:  [sys_write+0x50/0x90] sys_write+0x50/0x90
      localhost kernel:  [system_call+0x7e/0x83] system_call+0x7e/0x83
      
      >>> snip <<<
      
      What is happening is, that the sk_wait_event() condition passed from
      sk_stream_wait_memory() evaluates to true for the case of tcp global memory
      exhaustion. This is because both sk_stream_memory_free() and vm_wait are true
      which causes sk_wait_event() to *not* call schedule_timeout().
      Hence sk_stream_wait_memory() returns immediately to the caller w/o sleeping.
      This causes the caller to again try allocation, which again fails and again
      calls sk_stream_wait_memory(), and so on.
      
      [ Bug introduced by commit c1cbe4b7
        ("[NET]: Avoid atomic xchg() for non-error case") -DaveM ]
      Signed-off-by: NNagendra Singh Tomar <tomer_iisc@yahoo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      482964e5
  3. 30 9月, 2010 2 次提交
  4. 29 9月, 2010 1 次提交
    • T
      ipv4: Allow configuring subnets as local addresses · 4465b469
      Tom Herbert 提交于
      This patch allows a host to be configured to respond to any address in
      a specified range as if it were local, without actually needing to
      configure the address on an interface.  This is done through routing
      table configuration.  For instance, to configure a host to respond
      to any address in 10.1/16 received on eth0 as a local address we can do:
      
      ip rule add from all iif eth0 lookup 200
      ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200
      
      This host is now reachable by any 10.1/16 address (route lookup on
      input for packets received on eth0 can find the route).  On output, the
      rule will not be matched so that this host can still send packets to
      10.1/16 (not sent on loopback).  Presumably, external routing can be
      configured to make sense out of this.
      
      To make this work, we needed to modify the logic in finding the
      interface which is assigned a given source address for output
      (dev_ip_find).  We perform a normal fib_lookup instead of just a
      lookup on the local table, and in the lookup we ignore the input
      interface for matching.
      
      This patch is useful to implement IP-anycast for subnets of virtual
      addresses.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4465b469
  5. 28 9月, 2010 4 次提交
  6. 27 9月, 2010 2 次提交
  7. 25 9月, 2010 1 次提交
    • E
      net: fix a lockdep splat · f064af1e
      Eric Dumazet 提交于
      We have for each socket :
      
      One spinlock (sk_slock.slock)
      One rwlock (sk_callback_lock)
      
      Possible scenarios are :
      
      (A) (this is used in net/sunrpc/xprtsock.c)
      read_lock(&sk->sk_callback_lock) (without blocking BH)
      <BH>
      spin_lock(&sk->sk_slock.slock);
      ...
      read_lock(&sk->sk_callback_lock);
      ...
      
      (B)
      write_lock_bh(&sk->sk_callback_lock)
      stuff
      write_unlock_bh(&sk->sk_callback_lock)
      
      (C)
      spin_lock_bh(&sk->sk_slock)
      ...
      write_lock_bh(&sk->sk_callback_lock)
      stuff
      write_unlock_bh(&sk->sk_callback_lock)
      spin_unlock_bh(&sk->sk_slock)
      
      This (C) case conflicts with (A) :
      
      CPU1 [A]                         CPU2 [C]
      read_lock(callback_lock)
      <BH>                             spin_lock_bh(slock)
      <wait to spin_lock(slock)>
                                       <wait to write_lock_bh(callback_lock)>
      
      We have one problematic (C) use case in inet_csk_listen_stop() :
      
      local_bh_disable();
      bh_lock_sock(child); // spin_lock_bh(&sk->sk_slock)
      WARN_ON(sock_owned_by_user(child));
      ...
      sock_orphan(child); // write_lock_bh(&sk->sk_callback_lock)
      
      lockdep is not happy with this, as reported by Tetsuo Handa
      
      It seems only way to deal with this is to use read_lock_bh(callbacklock)
      everywhere.
      
      Thanks to Jarek for pointing a bug in my first attempt and suggesting
      this solution.
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f064af1e
  8. 24 9月, 2010 1 次提交
  9. 22 9月, 2010 3 次提交
  10. 18 9月, 2010 3 次提交
  11. 17 9月, 2010 1 次提交
  12. 16 9月, 2010 3 次提交
  13. 15 9月, 2010 1 次提交
  14. 14 9月, 2010 1 次提交
  15. 11 9月, 2010 1 次提交
  16. 10 9月, 2010 2 次提交
  17. 09 9月, 2010 2 次提交
  18. 08 9月, 2010 1 次提交
    • H
      net: fix tx queue selection for bridged devices implementing select_queue · deabc772
      Helmut Schaa 提交于
      When a net device is implementing the select_queue callback and is part of
      a bridge, frames coming from the bridge already have a tx queue associated
      to the socket (introduced in commit a4ee3ce3,
      "net: Use sk_tx_queue_mapping for connected sockets"). The call to
      sk_tx_queue_get will then return the tx queue used by the bridge instead
      of calling the select_queue callback.
      
      In case of mac80211 this broke QoS which is implemented by using the
      select_queue callback. Furthermore it introduced problems with rt2x00
      because frames with the same TID and RA sometimes appeared on different
      tx queues which the hw cannot handle correctly.
      
      Fix this by always calling select_queue first if it is available and only
      afterwards use the socket tx queue mapping.
      Signed-off-by: NHelmut Schaa <helmut.schaa@googlemail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deabc772
  19. 07 9月, 2010 2 次提交
  20. 04 9月, 2010 1 次提交
  21. 03 9月, 2010 2 次提交
    • J
      pkt_sched: Fix lockdep warning on est_tree_lock in gen_estimator · 0b5d404e
      Jarek Poplawski 提交于
      This patch fixes a lockdep warning:
      
      [  516.287584] =========================================================
      [  516.288386] [ INFO: possible irq lock inversion dependency detected ]
      [  516.288386] 2.6.35b #7
      [  516.288386] ---------------------------------------------------------
      [  516.288386] swapper/0 just changed the state of lock:
      [  516.288386]  (&qdisc_tx_lock){+.-...}, at: [<c12eacda>] est_timer+0x62/0x1b4
      [  516.288386] but this lock took another, SOFTIRQ-unsafe lock in the past:
      [  516.288386]  (est_tree_lock){+.+...}
      [  516.288386] 
      [  516.288386] and interrupts could create inverse lock ordering between them.
      ...
      
      So, est_tree_lock needs BH protection because it's taken by
      qdisc_tx_lock, which is used both in BH and process contexts.
      (Full warning with this patch at netdev, 02 Sep 2010.)
      
      Fixes commit: ae638c47
      ("pkt_sched: gen_estimator: add a new lock")
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5d404e
    • E
      net: dev_add_pack() & __dev_remove_pack() changes · c07b68e8
      Eric Dumazet 提交于
      Add a small helper ptype_head() to get the head to manipulate
      
      dev_add_pack() & __dev_remove_pack() can use a spinlock without
      blocking BH, since softirq use RCU, and these functions are run from
      process context only.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c07b68e8
  22. 02 9月, 2010 2 次提交