1. 19 7月, 2008 1 次提交
    • D
      pkt_sched: Manage qdisc list inside of root qdisc. · 30723673
      David S. Miller 提交于
      Idea is from Patrick McHardy.
      
      Instead of managing the list of qdiscs on the device level, manage it
      in the root qdisc of a netdev_queue.  This solves all kinds of
      visibility issues during qdisc destruction.
      
      The way to iterate over all qdiscs of a netdev_queue is to visit
      the netdev_queue->qdisc, and then traverse it's list.
      
      The only special case is to ignore builting qdiscs at the root when
      dumping or doing a qdisc_lookup().  That was not needed previously
      because builtin qdiscs were not added to the device's qdisc_list.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30723673
  2. 18 7月, 2008 7 次提交
    • D
      pkt_sched: Kill netdev_queue lock. · 83874000
      David S. Miller 提交于
      We can simply use the qdisc->q.lock for all of the
      qdisc tree synchronization.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83874000
    • D
      netdevice: Move qdisc_list back into net_device proper. · ead81cc5
      David S. Miller 提交于
      And give it it's own lock.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ead81cc5
    • D
      pkt_sched: Schedule qdiscs instead of netdev_queue. · 37437bb2
      David S. Miller 提交于
      When we have shared qdiscs, packets come out of the qdiscs
      for multiple transmit queues.
      
      Therefore it doesn't make any sense to schedule the transmit
      queue when logically we cannot know ahead of time the TX
      queue of the SKB that the qdisc->dequeue() will give us.
      
      Just for sanity I added a BUG check to make sure we never
      get into a state where the noop_qdisc is scheduled.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37437bb2
    • D
      net: Implement simple sw TX hashing. · 8f0f2223
      David S. Miller 提交于
      It just xor hashes over IPv4/IPv6 addresses and ports of transport.
      
      The only assumption it makes is that skb_network_header() is set
      correctly.
      
      With bug fixes from Eric Dumazet.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f0f2223
    • D
      netdev: Add netdev->select_queue() method. · eae792b7
      David S. Miller 提交于
      Devices or device layers can set this to control the queue selection
      performed by dev_pick_tx().
      
      This function runs under RCU protection, which allows overriding
      functions to have some way of synchronizing with things like dynamic
      ->real_num_tx_queues adjustments.
      
      This makes the spinlock prefetch in dev_queue_xmit() a little bit
      less effective, but that's the price right now for correctness.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eae792b7
    • D
      net: Use queue aware tests throughout. · fd2ea0a7
      David S. Miller 提交于
      This effectively "flips the switch" by making the core networking
      and multiqueue-aware drivers use the new TX multiqueue structures.
      
      Non-multiqueue drivers need no changes.  The interfaces they use such
      as netif_stop_queue() degenerate into an operation on TX queue zero.
      So everything "just works" for them.
      
      Code that really wants to do "X" to all TX queues now invokes a
      routine that does so, such as netif_tx_wake_all_queues(),
      netif_tx_stop_all_queues(), etc.
      
      pktgen and netpoll required a little bit more surgery than the others.
      
      In particular the pktgen changes, whilst functional, could be largely
      improved.  The initial check in pktgen_xmit() will sometimes check the
      wrong queue, which is mostly harmless.  The thing to do is probably to
      invoke fill_packet() earlier.
      
      The bulk of the netpoll changes is to make the code operate solely on
      the TX queue indicated by by the SKB queue mapping.
      
      Setting of the SKB queue mapping is entirely confined inside of
      net/core/dev.c:dev_pick_tx().  If we end up needing any kind of
      special semantics (drops, for example) it will be implemented here.
      
      Finally, we now have a "real_num_tx_queues" which is where the driver
      indicates how many TX queues are actually active.
      
      With IGB changes from Jeff Kirsher.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd2ea0a7
    • D
      netdev: Allocate multiple queues for TX. · e8a0464c
      David S. Miller 提交于
      alloc_netdev_mq() now allocates an array of netdev_queue
      structures for TX, based upon the queue_count argument.
      
      Furthermore, all accesses to the TX queues are now vectored
      through the netdev_get_tx_queue() and netdev_for_each_tx_queue()
      interfaces.  This makes it easy to grep the tree for all
      things that want to get to a TX queue of a net device.
      
      Problem spots which are not really multiqueue aware yet, and
      only work with one queue, can easily be spotted by grepping
      for all netdev_get_tx_queue() calls that pass in a zero index.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8a0464c
  3. 15 7月, 2008 4 次提交
  4. 09 7月, 2008 9 次提交
  5. 07 7月, 2008 1 次提交
  6. 02 7月, 2008 1 次提交
  7. 28 6月, 2008 1 次提交
  8. 21 6月, 2008 1 次提交
    • E
      netns: Don't receive new packets in a dead network namespace. · b9f75f45
      Eric W. Biederman 提交于
      Alexey Dobriyan <adobriyan@gmail.com> writes:
      > Subject: ICMP sockets destruction vs ICMP packets oops
      
      > After icmp_sk_exit() nuked ICMP sockets, we get an interrupt.
      > icmp_reply() wants ICMP socket.
      >
      > Steps to reproduce:
      >
      > 	launch shell in new netns
      > 	move real NIC to netns
      > 	setup routing
      > 	ping -i 0
      > 	exit from shell
      >
      > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      > IP: [<ffffffff803fce17>] icmp_sk+0x17/0x30
      > PGD 17f3cd067 PUD 17f3ce067 PMD 0 
      > Oops: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
      > CPU 0 
      > Modules linked in: usblp usbcore
      > Pid: 0, comm: swapper Not tainted 2.6.26-rc6-netns-ct #4
      > RIP: 0010:[<ffffffff803fce17>]  [<ffffffff803fce17>] icmp_sk+0x17/0x30
      > RSP: 0018:ffffffff8057fc30  EFLAGS: 00010286
      > RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81017c7db900
      > RDX: 0000000000000034 RSI: ffff81017c7db900 RDI: ffff81017dc41800
      > RBP: ffffffff8057fc40 R08: 0000000000000001 R09: 000000000000a815
      > R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff8057fd28
      > R13: ffffffff8057fd00 R14: ffff81017c7db938 R15: ffff81017dc41800
      > FS:  0000000000000000(0000) GS:ffffffff80525000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      > CR2: 0000000000000000 CR3: 000000017fcda000 CR4: 00000000000006e0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > Process swapper (pid: 0, threadinfo ffffffff8053a000, task ffffffff804fa4a0)
      > Stack:  0000000000000000 ffff81017c7db900 ffffffff8057fcf0 ffffffff803fcfe4
      >  ffffffff804faa38 0000000000000246 0000000000005a40 0000000000000246
      >  000000000001ffff ffff81017dd68dc0 0000000000005a40 0000000055342436
      > Call Trace:
      >  <IRQ>  [<ffffffff803fcfe4>] icmp_reply+0x44/0x1e0
      >  [<ffffffff803d3a0a>] ? ip_route_input+0x23a/0x1360
      >  [<ffffffff803fd645>] icmp_echo+0x65/0x70
      >  [<ffffffff803fd300>] icmp_rcv+0x180/0x1b0
      >  [<ffffffff803d6d84>] ip_local_deliver+0xf4/0x1f0
      >  [<ffffffff803d71bb>] ip_rcv+0x33b/0x650
      >  [<ffffffff803bb16a>] netif_receive_skb+0x27a/0x340
      >  [<ffffffff803be57d>] process_backlog+0x9d/0x100
      >  [<ffffffff803bdd4d>] net_rx_action+0x18d/0x250
      >  [<ffffffff80237be5>] __do_softirq+0x75/0x100
      >  [<ffffffff8020c97c>] call_softirq+0x1c/0x30
      >  [<ffffffff8020f085>] do_softirq+0x65/0xa0
      >  [<ffffffff80237af7>] irq_exit+0x97/0xa0
      >  [<ffffffff8020f198>] do_IRQ+0xa8/0x130
      >  [<ffffffff80212ee0>] ? mwait_idle+0x0/0x60
      >  [<ffffffff8020bc46>] ret_from_intr+0x0/0xf
      >  <EOI>  [<ffffffff80212f2c>] ? mwait_idle+0x4c/0x60
      >  [<ffffffff80212f23>] ? mwait_idle+0x43/0x60
      >  [<ffffffff8020a217>] ? cpu_idle+0x57/0xa0
      >  [<ffffffff8040f380>] ? rest_init+0x70/0x80
      > Code: 10 5b 41 5c 41 5d 41 5e c9 c3 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
      > 48 83 ec 08 48 8b 9f 78 01 00 00 e8 2b c7 f1 ff 89 c0 <48> 8b 04 c3 48 83 c4 08
      > 5b c9 c3 66 66 66 66 66 2e 0f 1f 84 00
      > RIP  [<ffffffff803fce17>] icmp_sk+0x17/0x30
      >  RSP <ffffffff8057fc30>
      > CR2: 0000000000000000
      > ---[ end trace ea161157b76b33e8 ]---
      > Kernel panic - not syncing: Aiee, killing interrupt handler!
      
      Receiving packets while we are cleaning up a network namespace is a
      racy proposition. It is possible when the packet arrives that we have
      removed some but not all of the state we need to fully process it.  We
      have the choice of either playing wack-a-mole with the cleanup routines
      or simply dropping packets when we don't have a network namespace to
      handle them.
      
      Since the check looks inexpensive in netif_receive_skb let's just
      drop the incoming packets.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f75f45
  9. 20 6月, 2008 1 次提交
  10. 18 6月, 2008 2 次提交
    • W
      netdevice: Fix promiscuity and allmulti overflow · dad9b335
      Wang Chen 提交于
      Max of promiscuity and allmulti plus positive @inc can cause overflow.
      Fox example: when allmulti=0xFFFFFFFF, any caller give dev_set_allmulti() a
      positive @inc will cause allmulti be off.
      This is not what we want, though it's rare case.
      The fix is that only negative @inc will cause allmulti or promiscuity be off
      and when any caller makes the counters touch the roof, we return error.
      
      Change of v2:
      Change void function dev_set_promiscuity/allmulti to return int.
      So callers can get the overflow error.
      Caller's fix will be done later.
      
      Change of v3:
      1. Since we return error to caller, we don't need to print KERN_ERROR,
      KERN_WARNING is enough.
      2. In dev_set_promiscuity(), if __dev_set_promiscuity() failed, we
      return at once.
      Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dad9b335
    • O
      net/core: add NETDEV_BONDING_FAILOVER event · c1da4ac7
      Or Gerlitz 提交于
      Add NETDEV_BONDING_FAILOVER event to be used in a successive patch
      by bonding to announce fail-over for the active-backup mode through the
      netdev events notifier chain mechanism. Such an event can be of use for the
      RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER)
      always be aligned with the IP stack, in the sense that they use the same
      ports/links as the stack does. More usages can be done to allow monitoring
      tools based on netlink events being aware to bonding fail-over.
      Signed-off-by: NOr Gerlitz <ogerlitz@voltaire.com>
      Signed-off-by: NJay Vosburgh <fubar@us.ibm.com>
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      c1da4ac7
  11. 17 6月, 2008 1 次提交
    • B
      net: Fix test for VLAN TX checksum offload capability · 6de329e2
      Ben Hutchings 提交于
      Selected device feature bits can be propagated to VLAN devices, so we
      can make use of TX checksum offload and TSO on VLAN-tagged packets.
      However, if the physical device does not do VLAN tag insertion or
      generic checksum offload then the test for TX checksum offload in
      dev_queue_xmit() will see a protocol of htons(ETH_P_8021Q) and yield
      false.
      
      This splits the checksum offload test into two functions:
      
      - can_checksum_protocol() tests a given protocol against a feature bitmask
      
      - dev_can_checksum() first tests the skb protocol against the device
        features; if that fails and the protocol is htons(ETH_P_8021Q) then
        it tests the encapsulated protocol against the effective device
        features for VLANs
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6de329e2
  12. 25 5月, 2008 1 次提交
    • C
      Remove argument from open_softirq which is always NULL · 962cf36c
      Carlos R. Mafra 提交于
      As git-grep shows, open_softirq() is always called with the last argument
      being NULL
      
      block/blk-core.c:       open_softirq(BLOCK_SOFTIRQ, blk_done_softirq, NULL);
      kernel/hrtimer.c:       open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq, NULL);
      kernel/rcuclassic.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
      kernel/rcupreempt.c:    open_softirq(RCU_SOFTIRQ, rcu_process_callbacks, NULL);
      kernel/sched.c: open_softirq(SCHED_SOFTIRQ, run_rebalance_domains, NULL);
      kernel/softirq.c:       open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
      kernel/softirq.c:       open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
      kernel/timer.c: open_softirq(TIMER_SOFTIRQ, run_timer_softirq, NULL);
      net/core/dev.c: open_softirq(NET_TX_SOFTIRQ, net_tx_action, NULL);
      net/core/dev.c: open_softirq(NET_RX_SOFTIRQ, net_rx_action, NULL);
      
      This observation has already been made by Matthew Wilcox in June 2002
      (http://www.cs.helsinki.fi/linux/linux-kernel/2002-25/0687.html)
      
      "I notice that none of the current softirq routines use the data element
      passed to them."
      
      and the situation hasn't changed since them. So it appears we can safely
      remove that extra argument to save 128 (54) bytes of kernel data (text).
      Signed-off-by: NCarlos R. Mafra <crmafra@ift.unesp.br>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      962cf36c
  13. 21 5月, 2008 1 次提交
  14. 15 5月, 2008 1 次提交
  15. 08 5月, 2008 2 次提交
    • B
      net: Added ASSERT_RTNL() to dev_open() and dev_close(). · e46b66bc
      Ben Hutchings 提交于
      dev_open() and dev_close() must be called holding the RTNL, since they
      call device functions and netdevice notifiers that are promised the RTNL.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e46b66bc
    • P
      netns: Fix arbitrary net_device-s corruptions on net_ns stop. · aca51397
      Pavel Emelyanov 提交于
      When a net namespace is destroyed, some devices (those, not killed
      on ns stop explicitly) are moved back to init_net.
      
      The problem, is that this net_ns change has one point of failure -
      the __dev_alloc_name() may be called if a name collision occurs (and
      this is easy to trigger). This allocator performs a likely-to-fail
      GFP_ATOMIC allocation to find a suitable number. Other possible 
      conditions that may cause error (for device being ns local or not
      registered) are always false in this case.
      
      So, when this call fails, the device is unregistered. But this is
      *not* the right thing to do, since after this the device may be
      released (and kfree-ed) improperly. E. g. bridges require more
      actions (sysfs update, timer disarming, etc.), some other devices 
      want to remove their private areas from lists, etc.
      
      I. e. arbitrary use-after-free cases may occur.
      
      The proposed fix is the following: since the only reason for the
      dev_change_net_namespace to fail is the name generation, we may
      give it a unique fall-back name w/o %d-s in it - the dev<ifindex>
      one, since ifindexes are still unique.
      
      So make this change, raise the failure-case printk loglevel to 
      EMERG and replace the unregister_netdevice call with BUG().
      
      [ Use snprintf() -DaveM ]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aca51397
  16. 03 5月, 2008 2 次提交
    • D
      netns: Fix device renaming for sysfs · aaf8cdc3
      Daniel Lezcano 提交于
      When a netdev is moved across namespaces with the
      'dev_change_net_namespace' function, the 'device_rename' function is
      used to fixup kobject and refresh the sysfs tree. The device_rename
      function will call kobject_rename and this one will check if there is
      an object with the same name and this is the case because we are
      renaming the object with the same name.
      
      The use of 'device_rename' seems for me wrong because we usually don't
      rename it but just move it across namespaces. As we just want to do a
      mini "netdev_[un]register", IMO the functions
      'netdev_[un]register_kobject' should be used instead, like an usual
      network device [un]registering.
      
      This patch replace device_rename by netdev_unregister_kobject,
      followed by netdev_register_kobject.
      
      The netdev_register_kobject will call device_initialize and will raise
      a warning indicating the device was already initialized. In order to
      fix that, I split the device initialization into a separate function
      and use it together with 'netdev_register_kobject' into
      register_netdevice. So we can safely call 'netdev_register_kobject' in
      'dev_change_net_namespace'.
      
      This fix will allow to properly use the sysfs per namespace which is
      coming from -mm tree.
      Signed-off-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
      Acked-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaf8cdc3
    • M
      net: remove NR_CPUS arrays in net/core/dev.c · 0c0b0aca
      Mike Travis 提交于
      Remove the fixed size channels[NR_CPUS] array in net/core/dev.c and
      dynamically allocate array based on nr_cpu_ids.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c0b0aca
  17. 29 4月, 2008 1 次提交
  18. 19 4月, 2008 1 次提交
    • A
      [NET]: Fix and allocate less memory for ->priv'less netdevices · d1643d24
      Alexey Dobriyan 提交于
      This patch effectively reverts commit d0498d9a
      aka "[NET]: Do not allocate unneeded memory for dev->priv alignment."
      It was found to be buggy because of final unconditional += NETDEV_ALIGN_CONST
      removal.
      
      For example, for sizeof(struct net_device) being 2048 bytes, "alloc_size"
      was also 2048 bytes, but allocator with debugging options turned on started
      giving out !32-byte aligned memory resulting in redzones overwrites.
      
      Patch does small optimization in ->priv'less case: bumping size to next
      32-byte boundary was always done to ensure ->priv will also be aligned.
      But, no ->priv, no need to do that.
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1643d24
  19. 16 4月, 2008 2 次提交