1. 19 8月, 2010 1 次提交
  2. 17 8月, 2010 1 次提交
    • K
      core: Factor out flow calculation from get_rps_cpu · bfb564e7
      Krishna Kumar 提交于
      Factor out flow calculation code from get_rps_cpu, since other
      functions can use the same code.
      
      Revisions:
      
      v2 (Ben): Separate flow calcuation out and use in select queue.
      v3 (Arnd): Don't re-implement MIN.
      v4 (Changli): skb->data points to ethernet header in macvtap, and
      	make a fast path. Tested macvtap with this patch.
      v5 (Changli):
      	- Cache skb->rxhash in skb_get_rxhash
      	- macvtap may not have pow(2) queues, so change code for
      	  queue selection.
          (Arnd):
      	- Use first available queue if all fails.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfb564e7
  3. 08 8月, 2010 1 次提交
  4. 06 8月, 2010 1 次提交
  5. 03 8月, 2010 2 次提交
  6. 01 8月, 2010 1 次提交
  7. 26 7月, 2010 1 次提交
  8. 20 7月, 2010 1 次提交
  9. 19 7月, 2010 1 次提交
    • R
      net: support time stamping in phy devices. · c1f19b51
      Richard Cochran 提交于
      This patch adds a new networking option to allow hardware time stamps
      from PHY devices. When enabled, likely candidates among incoming and
      outgoing network packets are offered to the PHY driver for possible
      time stamping. When accepted by the PHY driver, incoming packets are
      deferred for later delivery by the driver.
      
      The patch also adds phylib driver methods for the SIOCSHWTSTAMP ioctl
      and callbacks for transmit and receive time stamping. Drivers may
      optionally implement these functions.
      Signed-off-by: NRichard Cochran <richard.cochran@omicron.at>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1f19b51
  10. 15 7月, 2010 2 次提交
    • T
      net: fix problem in reading sock TX queue · b0f77d0e
      Tom Herbert 提交于
      Fix problem in reading the tx_queue recorded in a socket.  In
      dev_pick_tx, the TX queue is read by doing a check with
      sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
      The problem is that there is not mutual exclusion across these
      calls in the socket so it it is possible that the queue in the
      sock can be invalidated after sk_tx_queue_recorded is called so
      that sk_tx_queue get returns -1, which sets 65535 in queue_index
      and thus dev_pick_tx returns 65536 which is a bogus queue and
      can cause crash in dev_queue_xmit.
      
      We fix this by only calling sk_tx_queue_get which does the proper
      checks.  The interface is that sk_tx_queue_get returns the TX queue
      if the sock argument is non-NULL and TX queue is recorded, else it
      returns -1.  sk_tx_queue_recorded is no longer used so it can be
      completely removed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0f77d0e
    • E
      net: skb_tx_hash() fix relative to skb_orphan_try() · 87fd308c
      Eric Dumazet 提交于
      commit fc6055a5 (net: Introduce skb_orphan_try()) added early
      orphaning of skbs.
      
      This unfortunately added a performance regression in skb_tx_hash() in
      case of stacked devices (bonding, vlans, ...)
      
      Since skb->sk is now NULL, we cannot access sk->sk_hash anymore to
      spread tx packets to multiple NIC queues on multiqueue devices.
      
      skb_tx_hash() in this case only uses skb->protocol, same value for all
      flows.
      
      skb_orphan_try() can copy sk->sk_hash into skb->rxhash and skb_tx_hash()
      can use this saved sk_hash value to compute its internal hash value.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fd308c
  11. 10 7月, 2010 2 次提交
    • B
      net: Document that dev_get_stats() returns the given pointer · d7753516
      Ben Hutchings 提交于
      Document that dev_get_stats() returns the same stats pointer it was
      given.  Remove const qualification from the returned pointer since the
      caller may do what it likes with that structure.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7753516
    • B
      net: Get rid of rtnl_link_stats64 / net_device_stats union · 3cfde79c
      Ben Hutchings 提交于
      In commit be1f3c2c "net: Enable 64-bit
      net device statistics on 32-bit architectures" I redefined struct
      net_device_stats so that it could be used in a union with struct
      rtnl_link_stats64, avoiding the need for explicit copying or
      conversion between the two.  However, this is unsafe because there is
      no locking required and no lock consistently held around calls to
      dev_get_stats() and use of the statistics structure it returns.
      
      In commit 28172739 "net: fix 64 bit
      counters on 32 bit arches" Eric Dumazet dealt with that problem by
      requiring callers of dev_get_stats() to provide storage for the
      result.  This means that the net_device::stats64 field and the padding
      in struct net_device_stats are now redundant, so remove them.
      
      Update the comment on net_device_ops::ndo_get_stats64 to reflect its
      new usage.
      
      Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
      that is what all its callers are really using and it is no longer
      going to be compatible with struct net_device_stats.
      
      Eric Dumazet suggested the separate function for the structure
      conversion.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3cfde79c
  12. 08 7月, 2010 1 次提交
    • E
      net: fix 64 bit counters on 32 bit arches · 28172739
      Eric Dumazet 提交于
      There is a small possibility that a reader gets incorrect values on 32
      bit arches. SNMP applications could catch incorrect counters when a
      32bit high part is changed by another stats consumer/provider.
      
      One way to solve this is to add a rtnl_link_stats64 param to all
      ndo_get_stats64() methods, and also add such a parameter to
      dev_get_stats().
      
      Rule is that we are not allowed to use dev->stats64 as a temporary
      storage for 64bit stats, but a caller provided area (usually on stack)
      
      Old drivers (only providing get_stats() method) need no changes.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28172739
  13. 05 7月, 2010 1 次提交
  14. 03 7月, 2010 1 次提交
    • J
      net: decreasing real_num_tx_queues needs to flush qdisc · f0796d5c
      John Fastabend 提交于
      Reducing real_num_queues needs to flush the qdisc otherwise
      skbs with queue_mappings greater then real_num_tx_queues can
      be sent to the underlying driver.
      
      The flow for this is,
      
      dev_queue_xmit()
      	dev_pick_tx()
      		skb_tx_hash()  => hash using real_num_tx_queues
      		skb_set_queue_mapping()
      	...
      	qdisc_enqueue_root() => enqueue skb on txq from hash
      ...
      dev->real_num_tx_queues -= n
      ...
      sch_direct_xmit()
      	dev_hard_start_xmit()
      		ndo_start_xmit(skb,dev) => skb queue set with old hash
      
      skbs are enqueued on the qdisc with skb->queue_mapping set
      0 < queue_mappings < real_num_tx_queues.  When the driver
      decreases real_num_tx_queues skb's may be dequeued from the
      qdisc with a queue_mapping greater then real_num_tx_queues.
      
      This fixes a case in ixgbe where this was occurring with DCB
      and FCoE. Because the driver is using queue_mapping to map
      skbs to tx descriptor rings we can potentially map skbs to
      rings that no longer exist.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Tested-by: NRoss Brattain <ross.b.brattain@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0796d5c
  15. 01 7月, 2010 1 次提交
  16. 24 6月, 2010 1 次提交
  17. 16 6月, 2010 2 次提交
  18. 13 6月, 2010 1 次提交
    • B
      net: Enable 64-bit net device statistics on 32-bit architectures · be1f3c2c
      Ben Hutchings 提交于
      Use struct rtnl_link_stats64 as the statistics structure.
      
      On 32-bit architectures, insert 32 bits of padding after/before each
      field of struct net_device_stats to make its layout compatible with
      struct rtnl_link_stats64.  Add an anonymous union in net_device; move
      stats into the union and add struct rtnl_link_stats64 stats64.
      
      Add net_device_ops::ndo_get_stats64, implementations of which will
      return a pointer to struct rtnl_link_stats64.  Drivers that implement
      this operation must not update the structure asynchronously.
      
      Change dev_get_stats() to call ndo_get_stats64 if available, and to
      return a pointer to struct rtnl_link_stats64.  Change callers of
      dev_get_stats() accordingly.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be1f3c2c
  19. 11 6月, 2010 1 次提交
    • J
      net: deliver skbs on inactive slaves to exact matches · 597a264b
      John Fastabend 提交于
      Currently, the accelerated receive path for VLAN's will
      drop packets if the real device is an inactive slave and
      is not one of the special pkts tested for in
      skb_bond_should_drop().  This behavior is different then
      the non-accelerated path and for pkts over a bonded vlan.
      
      For example,
      
      vlanx -> bond0 -> ethx
      
      will be dropped in the vlan path and not delivered to any
      packet handlers at all.  However,
      
      bond0 -> vlanx -> ethx
      
      and
      
      bond0 -> ethx
      
      will be delivered to handlers that match the exact dev,
      because the VLAN path checks the real_dev which is not a
      slave and netif_recv_skb() doesn't drop frames but only
      delivers them to exact matches.
      
      This patch adds a sk_buff flag which is used for tagging
      skbs that would previously been dropped and allows the
      skb to continue to skb_netif_recv().  Here we add
      logic to check for the deliver_no_wcard flag and if it
      is set only deliver to handlers that match exactly.  This
      makes both paths above consistent and gives pkt handlers
      a way to identify skbs that come from inactive slaves.
      Without this patch in some configurations skbs will be
      delivered to handlers with exact matches and in others
      be dropped out right in the vlan path.
      
      I have tested the following 4 configurations in failover modes
      and load balancing modes.
      
      # bond0 -> ethx
      
      # vlanx -> bond0 -> ethx
      
      # bond0 -> vlanx -> ethx
      
      # bond0 -> ethx
                  |
        vlanx -> --
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597a264b
  20. 10 6月, 2010 1 次提交
  21. 08 6月, 2010 1 次提交
    • E
      anycast: Some RCU conversions · bb69ae04
      Eric Dumazet 提交于
      - dev_get_by_flags() changed to dev_get_by_flags_rcu()
      
      - ipv6_sock_ac_join() dont touch dev & idev refcounts
      - ipv6_sock_ac_drop() dont touch dev & idev refcounts
      - ipv6_sock_ac_close() dont touch dev & idev refcounts
      - ipv6_dev_ac_dec() dount touch idev refcount
      - ipv6_chk_acast_addr() dont touch idev refcount
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb69ae04
  22. 07 6月, 2010 1 次提交
    • J
      net: Remove unnecessary net action assertion · 271c1dfa
      jamal 提交于
      The extra assertion to allow packet munging only when there are
      no other ptypes listening which may have worked around an old bug
      is unnecessary. It is sufficient to check if the skb is cloned before
      trampling on it. Thanks to Herbert Xu for being persistent and patient
      in getting this across.
      [Note that cloning checks and assertions are the general rule used
      by tc actions (documentation/networking/tc-actions-env-rules.txt)].
      Signed-off-by: NJamal Hadi Salim <hadi@cyberus.ca>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271c1dfa
  23. 05 6月, 2010 1 次提交
  24. 02 6月, 2010 4 次提交
    • J
      net: replace hooks in __netif_receive_skb V5 · ab95bfe0
      Jiri Pirko 提交于
      What this patch does is it removes two receive frame hooks (for bridge and for
      macvlan) from __netif_receive_skb. These are replaced them with a single
      hook for both. It only supports one hook per device because it makes no
      sense to do bridging and macvlan on the same device.
      
      Then a network driver (of virtual netdev like macvlan or bridge) can register
      an rx_handler for needed net device.
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab95bfe0
    • E
      net: add additional lock to qdisc to increase throughput · 79640a4c
      Eric Dumazet 提交于
      When many cpus compete for sending frames on a given qdisc, the qdisc
      spinlock suffers from very high contention.
      
      The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
      the lock, and cannot dequeue packets fast enough, since it must wait for
      this lock for each dequeued packet.
      
      One solution to this problem is to force all cpus spinning on a second
      lock before trying to get the main lock, when/if they see
      __QDISC_STATE_RUNNING already set.
      
      The owning cpu then compete with at most one other cpu for the main
      lock, allowing for higher dequeueing rate.
      
      Based on a previous patch from Alexander Duyck. I added the heuristic to
      avoid the atomic in fast path, and put the new lock far away from the
      cache line used by the dequeue worker. Also try to release the busylock
      lock as late as possible.
      
      Tests with following script gave a boost from ~50.000 pps to ~600.000
      pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
      (A single netperf flow can reach ~800.000 pps on this platform)
      
      for j in `seq 0 3`; do
        for i in `seq 0 7`; do
          netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
        done
      done
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79640a4c
    • J
      net: fix conflict between null_or_orig and null_or_bond · 2df4a0fa
      John Fastabend 提交于
      If a skb is received on an inactive bond that does not meet
      the special cases checked for by skb_bond_should_drop it should
      only be delivered to exact matches as the comment in
      netif_receive_skb() says.
      
      However because null_or_bond could also be null this is not
      always true.  This patch renames null_or_bond to orig_or_bond
      and initializes it to orig_dev.  This keeps the intent of
      null_or_bond to pass frames received on VLAN interfaces stacked
      on bonding interfaces without invalidating the statement for
      null_or_orig.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2df4a0fa
    • E
      net: Define accessors to manipulate QDISC_STATE_RUNNING · bc135b23
      Eric Dumazet 提交于
      Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
      second patch will move on another location.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc135b23
  25. 31 5月, 2010 2 次提交
  26. 24 5月, 2010 1 次提交
  27. 22 5月, 2010 1 次提交
    • E
      net: Expose all network devices in a namespaces in sysfs · a1b3f594
      Eric W. Biederman 提交于
      This reverts commit aaf8cdc3.
      
      Drivers like the ipw2100 call device_create_group when they
      are initialized and device_remove_group when they are shutdown.
      Moving them between namespaces deletes their sysfs groups early.
      
      In particular the following call chain results.
      netdev_unregister_kobject -> device_del -> kobject_del -> sysfs_remove_dir
      With sysfs_remove_dir recursively deleting all of it's subdirectories,
      and nothing adding them back.
      
      Ouch!
      
      Therefore we need to call something that ultimate calls sysfs_mv_dir
      as that sysfs function can move sysfs directories between namespaces
      without deleting their subdirectories or their contents.   Allowing
      us to avoid placing extra boiler plate into every driver that does
      something interesting with sysfs.
      
      Currently the function that provides that capability is device_rename.
      That is the code works without nasty side effects as originally written.
      
      So remove the misguided fix for moving devices between namespaces.  The
      bug in the kobject layer that inspired it has now been recognized and
      fixed.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      a1b3f594
  28. 21 5月, 2010 1 次提交
    • T
      net: fix problem in dequeuing from input_pkt_queue · 76cc8b13
      Tom Herbert 提交于
      Fix some issues introduced in batch skb dequeuing for input_pkt_queue.
      The primary issue it that the queue head must be incremented only
      after a packet has been processed, that is only after
      __netif_receive_skb has been called.  This is needed for the mechanism
      to prevent OOO packet in RFS.  Also when flushing the input_pkt_queue
      and process_queue, the process queue should be done first to prevent
      OOO packets.
      
      Because the input_pkt_queue has been effectively split into two queues,
      the calculation of the tail ptr is no longer correct.  The correct value
      would be head+input_pkt_queue->len+process_queue->len.  To avoid
      this calculation we added an explict input_queue_tail in softnet_data.
      The tail value is simply incremented when queuing to input_pkt_queue.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76cc8b13
  29. 18 5月, 2010 2 次提交
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
    • E
      rps: avoid one atomic in enqueue_to_backlog · ebda37c2
      Eric Dumazet 提交于
      If CONFIG_SMP=y, then we own a queue spinlock, we can avoid the atomic
      test_and_set_bit() from napi_schedule_prep().
      
      We now have same number of atomic ops per netif_rx() calls than with
      pre-RPS kernel.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebda37c2
  30. 16 5月, 2010 2 次提交
    • E
      net: Consistent skb timestamping · 3b098e2d
      Eric Dumazet 提交于
      With RPS inclusion, skb timestamping is not consistent in RX path.
      
      If netif_receive_skb() is used, its deferred after RPS dispatch.
      
      If netif_rx() is used, its done before RPS dispatch.
      
      This can give strange tcpdump timestamps results.
      
      I think timestamping should be done as soon as possible in the receive
      path, to get meaningful values (ie timestamps taken at the time packet
      was delivered by NIC driver to our stack), even if NAPI already can
      defer timestamping a bit (RPS can help to reduce the gap)
      
      Tom Herbert prefer to sample timestamps after RPS dispatch. In case
      sampling is expensive (HPET/acpi_pm on x86), this makes sense.
      
      Let admins switch from one mode to another, using a new
      sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
      
      Its default value (1), means timestamps are taken as soon as possible,
      before backlog queueing, giving accurate timestamps.
      
      Setting a 0 value permits to sample timestamps when processing backlog,
      after RPS dispatch, to lower the load of the pre-RPS cpu.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b098e2d
    • J
      net: adjust handle_macvlan to pass port struct to hook · a14462f1
      Jiri Pirko 提交于
      Now there's null check here and also again in the hook. Looking at bridge bits
      which are simmilar, port structure is rcu_dereferenced right away in
      handle_bridge and passed to hook. Looks nicer.
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Acked-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a14462f1