1. 19 3月, 2010 1 次提交
  2. 17 3月, 2010 1 次提交
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
  3. 27 2月, 2010 3 次提交
    • P
      dev: support deferring device flag change notifications · bd380811
      Patrick McHardy 提交于
      Split dev_change_flags() into two functions: __dev_change_flags() to
      perform the actual changes and __dev_notify_flags() to invoke netdevice
      notifiers. This will be used by rtnl_link to defer netlink notifications
      until the device has been fully configured.
      
      This changes ordering of some operations, in particular:
      
      - netlink notifications are sent after all changes have been performed.
        As a side effect this surpresses one unnecessary netlink message when
        the IFF_UP and other flags are changed simultaneously.
      
      - The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked
        after all changes have been performed. Their relative is unchanged.
      
      - net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead
        of afterwards. This should not make any difference since both RX and TX
        are already shut down at this point.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd380811
    • P
      rtnetlink: handle rtnl_link netlink notifications manually · a2835763
      Patrick McHardy 提交于
      In order to support specifying device flags during device creation,
      we must be able to roll back device registration in case setting the
      flags fails without sending any notifications related to the device
      to userspace.
      
      This patch changes rollback_registered_many() and register_netdevice()
      to manually send netlink notifications for devices not handled by
      rtnl_link and allows to defer notifications for devices handled by
      rtnl_link until setup is complete.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2835763
    • J
      netdevice.h: check for CONFIG_WLAN instead of CONFIG_WLAN_80211 · caf66e58
      John W. Linville 提交于
      In "wireless: remove WLAN_80211 and WLAN_PRE80211 from Kconfig" I
      inadvertantly missed a line in include/linux/netdevice.h.  I thereby
      effectively reverted "net: Set LL_MAX_HEADER properly for wireless." by
      accident. :-(  Now we should check there for CONFIG_WLAN instead.
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      Reported-by: NChristoph Egger <siccegge@stud.informatik.uni-erlangen.de>
      Cc: stable@kernel.org
      caf66e58
  4. 13 2月, 2010 3 次提交
  5. 11 2月, 2010 1 次提交
  6. 05 2月, 2010 1 次提交
    • J
      net: use helpers to access mc list V2 · 6683ece3
      Jiri Pirko 提交于
      This patch introduces the similar helpers as those already done for uc list.
      However multicast lists are no list_head lists but "mademanually". The three
      macros added by this patch will make the transition of mc_list to list_head
      smooth in two steps:
      
      1) convert all drivers to use these macros (with the original iterator of type
         "struct dev_mc_list")
      2) once all drivers are converted, convert list type and iterators to "struct
         netdev_hw_addr" in one patch.
      
      >From now on, drivers can (and should) use "netdev_for_each_mc_addr" to iterate
      over the addresses with iterator of type "struct netdev_hw_addr". Also macros
      "netdev_mc_count" and "netdev_mc_empty" to read list's length. This is the state
      which should be reached in all drivers.
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6683ece3
  7. 04 2月, 2010 1 次提交
  8. 26 1月, 2010 1 次提交
  9. 23 1月, 2010 1 次提交
  10. 20 1月, 2010 1 次提交
  11. 04 12月, 2009 1 次提交
  12. 02 12月, 2009 1 次提交
  13. 27 11月, 2009 1 次提交
  14. 18 11月, 2009 2 次提交
    • E
      linkwatch: linkwatch_forget_dev() to speedup device dismantle · e014debe
      Eric Dumazet 提交于
      Herbert Xu a écrit :
      > On Tue, Nov 17, 2009 at 04:26:04AM -0800, David Miller wrote:
      >> Really, the link watch stuff is just due for a redesign.  I don't
      >> think a simple hack is going to cut it this time, sorry Eric :-)
      >
      > I have no objections against any redesigns, but since the only
      > caller of linkwatch_forget_dev runs in process context with the
      > RTNL, it could also legally emit those events.
      
      Thanks guys, here an updated version then, before linkwatch surgery ?
      
      In this version, I force the event to be sent synchronously.
      
      [PATCH net-next-2.6] linkwatch: linkwatch_forget_dev() to speedup device dismantle
      
      time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
      
      real	0m0.266s
      user	0m0.000s
      sys	0m0.001s
      
      real	0m0.770s
      user	0m0.000s
      sys	0m0.000s
      
      real	0m1.022s
      user	0m0.000s
      sys	0m0.000s
      
      One problem of current schem in vlan dismantle phase is the
      holding of device done by following chain :
      
      vlan_dev_stop() ->
      	netif_carrier_off(dev) ->
      		linkwatch_fire_event(dev) ->
      			dev_hold() ...
      
      And __linkwatch_run_queue() runs up to one second later...
      
      A generic fix to this problem is to add a linkwatch_forget_dev() method
      to unlink the device from the list of watched devices.
      
      dev->link_watch_next becomes dev->link_watch_list (and use a bit more memory),
      to be able to unlink device in O(1).
      
      After patch :
      time ip link del eth3.103 ; time ip link del eth3.104 ; time ip link del eth3.105
      
      real    0m0.024s
      user    0m0.000s
      sys     0m0.000s
      
      real    0m0.032s
      user    0m0.000s
      sys     0m0.001s
      
      real    0m0.033s
      user    0m0.000s
      sys     0m0.000s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e014debe
    • E
      net: add dev_txq_stats_fold() helper · d83345ad
      Eric Dumazet 提交于
      Some drivers ndo_get_stats() method need to perform txqueue stats folding.
      
      Move folding from dev_get_stats() to a new dev_txq_stats_fold() function
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d83345ad
  15. 16 11月, 2009 1 次提交
    • J
      net: Optimize hard_start_xmit() return checking · 9a1654ba
      Jarek Poplawski 提交于
      Recent changes in the TX error propagation require additional checking
      and masking of values returned from hard_start_xmit(), mainly to
      separate cases where skb was consumed. This aim can be simplified by
      changing the order of NETDEV_TX and NET_XMIT codes, because the latter
      are treated similarly to negative (ERRNO) values.
      
      After this change much simpler dev_xmit_complete() is also used in
      sch_direct_xmit(), so it is moved to netdevice.h.
      
      Additionally NET_RX definitions in netdevice.h are moved up from
      between TX codes to avoid confusion while reading the TX comment.
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a1654ba
  16. 14 11月, 2009 2 次提交
  17. 11 11月, 2009 1 次提交
  18. 05 11月, 2009 1 次提交
  19. 04 11月, 2009 1 次提交
  20. 02 11月, 2009 1 次提交
  21. 31 10月, 2009 1 次提交
  22. 30 10月, 2009 2 次提交
  23. 29 10月, 2009 2 次提交
  24. 28 10月, 2009 2 次提交
  25. 08 10月, 2009 1 次提交
  26. 07 10月, 2009 1 次提交
  27. 16 9月, 2009 1 次提交
  28. 15 9月, 2009 1 次提交
  29. 12 9月, 2009 1 次提交
    • M
      net: Add DEVTYPE support for Ethernet based devices · 384912ed
      Marcel Holtmann 提交于
      The Ethernet framing is used for a lot of devices these days. Most
      prominent are WiFi and WiMAX based devices. However for userspace
      application it is important to classify these devices correctly and
      not only see them as Ethernet devices. The daemons like HAL, DeviceKit
      or even NetworkManager with udev support tries to do the classification
      in userspace with a lot trickery and extra system calls. This is not
      good and actually reaches its limitations. Especially since the kernel
      does know the type of the Ethernet device it is pretty stupid.
      
      To solve this problem the underlying device type needs to be set and
      then the value will be exported as DEVTYPE via uevents and available
      within udev.
      
        # cat /sys/class/net/wlan0/uevent
        DEVTYPE=wlan
        INTERFACE=wlan0
        IFINDEX=5
      
      This is similar to subsystems like USB and SCSI that distinguish
      between hosts, devices, disks, partitions etc.
      
      The new SET_NETDEV_DEVTYPE() is a convenience helper to set the actual
      device type. All device types are free form, but for convenience the
      same strings as used with RFKILL are choosen.
      Signed-off-by: NMarcel Holtmann <marcel@holtmann.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      384912ed
  30. 06 9月, 2009 1 次提交
    • P
      net_sched: reintroduce dev->qdisc for use by sch_api · af356afa
      Patrick McHardy 提交于
      Currently the multiqueue integration with the qdisc API suffers from
      a few problems:
      
      - with multiple queues, all root qdiscs use the same handle. This means
        they can't be exposed to userspace in a backwards compatible fashion.
      
      - all API operations always refer to queue number 0. Newly created
        qdiscs are automatically shared between all queues, its not possible
        to address individual queues or restore multiqueue behaviour once a
        shared qdisc has been attached.
      
      - Dumps only contain the root qdisc of queue 0, in case of non-shared
        qdiscs this means the statistics are incomplete.
      
      This patch reintroduces dev->qdisc, which points to the (single) root qdisc
      from userspace's point of view. Currently it either points to the first
      (non-shared) default qdisc, or a qdisc shared between all queues. The
      following patches will introduce a classful dummy qdisc, which will be used
      as root qdisc and contain the per-queue qdiscs as children.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af356afa
  31. 01 9月, 2009 1 次提交
    • Y
      net: Add ndo_fcoe_enable/ndo_fcoe_disable to net_device_ops · cb454399
      Yi Zou 提交于
      Add ndo_fcoe_enable/_disable to net_device_ops so the corresponding
      HW can initialize itself for FCoE traffic or clean up after FCoE traffic is
      done. This is expected to be called by the kernel FCoE stack upon receiving
      a request for creating an FCoE instance on the corresponding netdev interface.
      When implemented by the actual HW, the HW driver check the op code to perform
      corresponding initialization or clean up for FCoE. The initialization normally
      includes allocating extra queues for FCoE, setting corresponding HW registers
      for FCoE, indicating FCoE offload features via netdev, etc. The clean-up would
      include releasing the resources allocated for FCoE.
      Signed-off-by: NYi Zou <yi.zou@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb454399