1. 12 11月, 2008 2 次提交
  2. 11 11月, 2008 3 次提交
  3. 08 11月, 2008 3 次提交
  4. 07 11月, 2008 3 次提交
  5. 06 11月, 2008 3 次提交
    • E
      net: Don't leak packets when a netns is going down · 0a36b345
      Eric W. Biederman 提交于
      I have been tracking for a while a case where when the
      network namespace exits the cleanup gets stck in an
      endless precessess of:
      
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      unregister_netdevice: waiting for lo to become free. Usage count = 3
      
      It turns out that if you listen on a multicast address an unsubscribe
      packet is sent when the network device goes down.   If you shutdown
      the network namespace without carefully cleaning up this can trigger
      the unsubscribe packet to be sent over the loopback interface while
      the network namespace is going down.
      
      All of which is fine except when we drop the packet and forget to
      free it leaking the skb and the dst entry attached to.  As it
      turns out the dst entry hold a reference to the idev which holds
      the dev and keeps everything from being cleaned up.  Yuck!
      
      By fixing my earlier thinko and add the needed kfree_skb and everything
      cleans up beautifully. 
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a36b345
    • E
      net: Guaranetee the proper ordering of the loopback device. · ae33bc40
      Eric W. Biederman 提交于
      I was recently hunting a bug that occurred in network namespace
      cleanup.  In looking at the code it became apparrent that we have
      and will continue to have cases where if we have anything going
      on in a network namespace there will be assumptions that the
      loopback device is present.   Things like sending igmp unsubscribe
      messages when we bring down network devices invokes the routing
      code which assumes that at least the loopback driver is present.
      
      Therefore to avoid magic initcall ordering hackery that is hard
      to follow and hard to get right insert a call to register the
      loopback device directly from net_dev_init().    This guarantes
      that the loopback device is the first device registered and
      the last network device to go away.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae33bc40
    • E
      netns: Delete virtual interfaces during namespace cleanup · d0c082ce
      Eric W. Biederman 提交于
      When physical devices are inside of network namespace and that
      network namespace terminates we can not make them go away.  We
      have to keep them and moving them to the initial network namespace
      is the best we can do.
      
      For virtual devices left in a network namespace that is exiting
      we have no need to preserve them and we now have the infrastructure
      that allows us to delete them.  So delete virtual devices when we
      exit a network namespace.  Keeping the necessary user space clean up
      after a network namespace exits much more tractable.
      Acked-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c082ce
  6. 05 11月, 2008 2 次提交
    • E
      net: sk_free_datagram() should use sk_mem_reclaim_partial() · 270acefa
      Eric Dumazet 提交于
      I noticed a contention on udp_memory_allocated on regular UDP applications.
      
      While tcp_memory_allocated is seldom used, it appears each incoming UDP frame
      is currently touching udp_memory_allocated when queued, and when received by
      application.
      
      One possible solution is to use sk_mem_reclaim_partial() instead of
      sk_mem_reclaim(), so that we keep a small reserve (less than one page)
      of memory for each UDP socket.
      
      We did something very similar on TCP side in commit
      9993e7d3
      ([TCP]: Do not purge sk_forward_alloc entirely in tcp_delack_timer())
      
      A more complex solution would need to convert prot->memory_allocated to
      use a percpu_counter with batches of 64 or 128 pages.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      270acefa
    • P
      net: fix packet socket delivery in rx irq handler · 9b22ea56
      Patrick McHardy 提交于
      The changes to deliver hardware accelerated VLAN packets to packet
      sockets (commit bc1d0411) caused a warning for non-NAPI drivers.
      The __vlan_hwaccel_rx() function is called directly from the drivers
      RX function, for non-NAPI drivers that means its still in RX IRQ
      context:
      
      [   27.779463] ------------[ cut here ]------------
      [   27.779509] WARNING: at kernel/softirq.c:136 local_bh_enable+0x37/0x81()
      ...
      [   27.782520]  [<c0264755>] netif_nit_deliver+0x5b/0x75
      [   27.782590]  [<c02bba83>] __vlan_hwaccel_rx+0x79/0x162
      [   27.782664]  [<f8851c1d>] atl1_intr+0x9a9/0xa7c [atl1]
      [   27.782738]  [<c0155b17>] handle_IRQ_event+0x23/0x51
      [   27.782808]  [<c015692e>] handle_edge_irq+0xc2/0x102
      [   27.782878]  [<c0105fd5>] do_IRQ+0x4d/0x64
      
      Split hardware accelerated VLAN reception into two parts to fix this:
      
      - __vlan_hwaccel_rx just stores the VLAN TCI and performs the VLAN
        device lookup, then calls netif_receive_skb()/netif_rx()
      
      - vlan_hwaccel_do_receive(), which is invoked by netif_receive_skb()
        in softirq context, performs the real reception and delivery to
        packet sockets.
      Reported-and-tested-by: NRamon Casellas <ramon.casellas@cttc.es>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b22ea56
  7. 04 11月, 2008 2 次提交
  8. 02 11月, 2008 1 次提交
  9. 01 11月, 2008 1 次提交
  10. 31 10月, 2008 1 次提交
  11. 29 10月, 2008 4 次提交
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
    • A
      net: don't use INIT_RCU_HEAD · 93adcc80
      Alexey Dobriyan 提交于
      call_rcu() will unconditionally rewrite RCU head anyway.
      Applies to 
      	struct neigh_parms
      	struct neigh_table
      	struct net
      	struct cipso_v4_doi
      	struct in_ifaddr
      	struct in_device
      	rt->u.dst
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93adcc80
    • A
      net: reduce structures when XFRM=n · def8b4fa
      Alexey Dobriyan 提交于
      ifdef out
      * struct sk_buff::sp		(pointer)
      * struct dst_entry::xfrm	(pointer)
      * struct sock::sk_policy	(2 pointers)
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def8b4fa
    • J
      pktgen: fix multiple queue warning · 88271660
      Jesse Brandeburg 提交于
      when testing the new pktgen module with multiple queues and ixgbe with:
      	pgset "flag QUEUE_MAP_CPU"
      
      I found that I was getting errors in dmesg like:
      pktgen: WARNING: QUEUE_MAP_CPU disabled because CPU count (8) exceeds number
      <4>pktgen: WARNING: of tx queues (8) on eth15
      
      you'll note, 8 really doesn't exceed 8.
      
      This patch seemed to fix the logic errors and also the attempts at
      limiting line length in printk (which didn't work anyway)
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NRobert Olsson <robert.olsson@its.uu.se>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88271660
  12. 28 10月, 2008 2 次提交
  13. 23 10月, 2008 1 次提交
    • H
      net: Fix disjunct computation of netdev features · b63365a2
      Herbert Xu 提交于
      My change
      
          commit e2a6b852
          net: Enable TSO if supported by at least one device
      
      didn't do what was intended because the netdev_compute_features
      function was designed for conjunctions.  So what happened was that
      it would simply take the TSO status of the last constituent device.
      
      This patch extends it to support both conjunctions and disjunctions
      under the new name of netdev_increment_features.
      
      It also adds a new function netdev_fix_features which does the
      sanity checking that usually occurs upon registration.  This ensures
      that the computation doesn't result in an illegal combination
      since this checking is absent when the change is initiated via
      ethtool.
      
      The two users of netdev_compute_features have been converted.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b63365a2
  14. 20 10月, 2008 1 次提交
  15. 17 10月, 2008 1 次提交
  16. 15 10月, 2008 1 次提交
  17. 14 10月, 2008 2 次提交
  18. 13 10月, 2008 1 次提交
  19. 08 10月, 2008 5 次提交
    • A
      netns: export netns list · b76a461f
      Alexey Dobriyan 提交于
      Conntrack code will use it for
      a) removing expectations and helpers when corresponding module is removed, and
      b) removing conntracks when L3 protocol conntrack module is removed.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      b76a461f
    • H
      net: Fix netdev_run_todo dead-lock · 58ec3b4d
      Herbert Xu 提交于
      Benjamin Thery tracked down a bug that explains many instances
      of the error
      
      unregister_netdevice: waiting for %s to become free. Usage count = %d
      
      It turns out that netdev_run_todo can dead-lock with itself if
      a second instance of it is run in a thread that will then free
      a reference to the device waited on by the first instance.
      
      The problem is really quite silly.  We were trying to create
      parallelism where none was required.  As netdev_run_todo always
      follows a RTNL section, and that todo tasks can only be added
      with the RTNL held, by definition you should only need to wait
      for the very ones that you've added and be done with it.
      
      There is no need for a second mutex or spinlock.
      
      This is exactly what the following patch does.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      58ec3b4d
    • P
      net: only invoke dev->change_rx_flags when device is UP · b6c40d68
      Patrick McHardy 提交于
      Jesper Dangaard Brouer <hawk@comx.dk> reported a bug when setting a VLAN
      device down that is in promiscous mode:
      
      When the VLAN device is set down, the promiscous count on the real
      device is decremented by one by vlan_dev_stop(). When removing the
      promiscous flag from the VLAN device afterwards, the promiscous
      count on the real device is decremented a second time by the
      vlan_change_rx_flags() callback.
      
      The root cause for this is that the ->change_rx_flags() callback is
      invoked while the device is down. The synchronization is meant to mirror
      the behaviour of the ->set_rx_mode callbacks, meaning the ->open function
      is responsible for doing a full sync on open, the ->close() function is
      responsible for doing full cleanup on ->stop() and ->change_rx_flags()
      is meant to do incremental changes while the device is UP.
      
      Only invoke ->change_rx_flags() while the device is UP to provide the
      intended behaviour.
      Tested-by: NJesper Dangaard Brouer <jdb@comx.dk>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6c40d68
    • P
      net: packet split receive api · 654bed16
      Peter Zijlstra 提交于
      Add some packet-split receive hooks.
      
      For one this allows to do NUMA node affine page allocs. Later on these
      hooks will be extended to do emergency reserve allocations for
      fragments.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      654bed16
    • P
      net: wrap sk->sk_backlog_rcv() · c57943a1
      Peter Zijlstra 提交于
      Wrap calling sk->sk_backlog_rcv() in a function. This will allow extending the
      generic sk_backlog_rcv behaviour.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c57943a1
  20. 01 10月, 2008 1 次提交