1. 19 3月, 2010 2 次提交
  2. 17 3月, 2010 4 次提交
    • J
      net: core: add IFLA_STATS64 support · 10708f37
      Jan Engelhardt 提交于
      `ip -s link` shows interface counters truncated to 32 bit. This is
      because interface statistics are transported only in 32-bit quantity
      to userspace. This commit adds a new IFLA_STATS64 attribute that
      exports them in full 64 bit.
      
      References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.htmlSigned-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10708f37
    • E
      net: remove rcu locking from fib_rules_event() · 2fb3573d
      Eric Dumazet 提交于
      We hold RTNL at this point and dont use RCU variants of list traversals,
      we dont need rcu_read_lock()/rcu_read_unlock()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fb3573d
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
    • J
      NET: netpoll, fix potential NULL ptr dereference · 21edbb22
      Jiri Slaby 提交于
      Stanse found that one error path in netpoll_setup dereferences npinfo
      even though it is NULL. Avoid that by adding new label and go to that
      instead.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Daniel Borkmann <danborkmann@googlemail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: chavey@google.com
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21edbb22
  3. 10 3月, 2010 2 次提交
  4. 09 3月, 2010 1 次提交
  5. 08 3月, 2010 1 次提交
  6. 06 3月, 2010 4 次提交
    • J
      ethtool: Add direct access to ops->get_sset_count · d17792eb
      Jeff Garzik 提交于
      On 03/04/2010 09:26 AM, Ben Hutchings wrote:
      > On Thu, 2010-03-04 at 00:51 -0800, Jeff Kirsher wrote:
      >> From: Jeff Garzik<jgarzik@redhat.com>
      >>
      >> This patch is an alternative approach for accessing string
      >> counts, vs. the drvinfo indirect approach.  This way the drvinfo
      >> space doesn't run out, and we don't break ABI later.
      > [...]
      >> --- a/net/core/ethtool.c
      >> +++ b/net/core/ethtool.c
      >> @@ -214,6 +214,10 @@ static noinline int ethtool_get_drvinfo(struct net_device *dev, void __user *use
      >>   	info.cmd = ETHTOOL_GDRVINFO;
      >>   	ops->get_drvinfo(dev,&info);
      >>
      >> +	/*
      >> +	 * this method of obtaining string set info is deprecated;
      >> +	 * consider using ETHTOOL_GSSET_INFO instead
      >> +	 */
      >
      > This comment belongs on the interface (ethtool.h) not the
      > implementation.
      
      Debatable -- the current comment is located at the callsite of
      ops->get_sset_count(), which is where an implementor might think to add
      a new call.  Not all the numeric fields in ethtool_drvinfo are obtained
      from ->get_sset_count().
      
      Hence the "some" in the attached patch to include/linux/ethtool.h,
      addressing your comment.
      
      > [...]
      >> +static noinline int ethtool_get_sset_info(struct net_device *dev,
      >> +                                          void __user *useraddr)
      >> +{
      > [...]
      >> +	/* calculate size of return buffer */
      >> +	for (i = 0; i<  64; i++)
      >> +		if (sset_mask&  (1ULL<<  i))
      >> +			n_bits++;
      > [...]
      >
      > We have a function for this:
      >
      > 	n_bits = hweight64(sset_mask);
      
      Agreed.
      
      I've attached a follow-up patch, which should enable my/Jeff's kernel
      patch to be applied, followed by this one.
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d17792eb
    • J
      ethtool: Add direct access to ops->get_sset_count · 723b2f57
      Jeff Garzik 提交于
      This patch is an alternative approach for accessing string
      counts, vs. the drvinfo indirect approach.  This way the drvinfo
      space doesn't run out, and we don't break ABI later.
      Signed-off-by: NJeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NPeter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      723b2f57
    • Z
      net: backlog functions rename · a3a858ff
      Zhu Yi 提交于
      sk_add_backlog -> __sk_add_backlog
      sk_add_backlog_limited -> sk_add_backlog
      Signed-off-by: NZhu Yi <yi.zhu@intel.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a858ff
    • Z
      net: add limit for socket backlog · 8eae939f
      Zhu Yi 提交于
      We got system OOM while running some UDP netperf testing on the loopback
      device. The case is multiple senders sent stream UDP packets to a single
      receiver via loopback on local host. Of course, the receiver is not able
      to handle all the packets in time. But we surprisingly found that these
      packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
      Instead, they are kept queuing to sk->sk_backlog and finally ate up all
      the memory. We believe this is a secure hole that a none privileged user
      can crash the system.
      
      The root cause for this problem is, when the receiver is doing
      __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
      skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
      sk_receive_queue with the softirq enabled. In the above case, multiple
      busy senders will almost make it an endless loop. The skbs in the
      backlog end up eat all the system memory.
      
      The issue is not only for UDP. Any protocols using socket backlog is
      potentially affected. The patch adds limit for socket backlog so that
      the backlog size cannot be expanded endlessly.
      Reported-by: NAlex Shi <alex.shi@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru
      Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Cc: Sridhar Samudrala <sri@us.ibm.com>
      Cc: Jon Maloy <jon.maloy@ericsson.com>
      Cc: Allan Stephens <allan.stephens@windriver.com>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Signed-off-by: NZhu Yi <yi.zhu@intel.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8eae939f
  7. 01 3月, 2010 1 次提交
  8. 28 2月, 2010 1 次提交
  9. 27 2月, 2010 4 次提交
    • P
      rtnetlink: support specifying device flags on device creation · 3729d502
      Patrick McHardy 提交于
      commit e8469ed959c373c2ff9e6f488aa5a14971aebe1f
      Author: Patrick McHardy <kaber@trash.net>
      Date:   Tue Feb 23 20:41:30 2010 +0100
      
      Support specifying the initial device flags when creating a device though
      rtnl_link. Devices allocated by rtnl_create_link() are marked as INITIALIZING
      in order to surpress netlink registration notifications. To complete setup,
      rtnl_configure_link() must be called, which performs the device flag changes
      and invokes the deferred notifiers if everything went well.
      
      Two examples:
      
      # add macvlan to eth0
      #
      $ ip link add link eth0 up allmulticast on type macvlan
      
      [LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN
          link/ether 26:f8:84:02:f9:2a brd ff:ff:ff:ff:ff:ff
      [ROUTE]ff00::/8 dev macvlan0  table local  metric 256  mtu 1500 advmss 1440 hoplimit 0
      [ROUTE]fe80::/64 dev macvlan0  proto kernel  metric 256  mtu 1500 advmss 1440 hoplimit 0
      [LINK]11: macvlan0@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500
          link/ether 26:f8:84:02:f9:2a
      [ADDR]11: macvlan0    inet6 fe80::24f8:84ff:fe02:f92a/64 scope link
             valid_lft forever preferred_lft forever
      [ROUTE]local fe80::24f8:84ff:fe02:f92a via :: dev lo  table local  proto none  metric 0  mtu 16436 advmss 16376 hoplimit 0
      [ROUTE]default via fe80::215:e9ff:fef0:10f8 dev macvlan0  proto kernel  metric 1024  mtu 1500 advmss 1440 hoplimit 0
      [NEIGH]fe80::215:e9ff:fef0:10f8 dev macvlan0 lladdr 00:15:e9:f0:10:f8 router STALE
      [ROUTE]2001:6f8:974::/64 dev macvlan0  proto kernel  metric 256  expires 0sec mtu 1500 advmss 1440 hoplimit 0
      [PREFIX]prefix 2001:6f8:974::/64 dev macvlan0 onlink autoconf valid 14400 preferred 131084
      [ADDR]11: macvlan0    inet6 2001:6f8:974:0:24f8:84ff:fe02:f92a/64 scope global dynamic
             valid_lft 86399sec preferred_lft 14399sec
      
      # add VLAN to eth1, eth1 is down
      #
      $ ip link add link eth1 up type vlan id 1000
      RTNETLINK answers: Network is down
      
      <no events>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3729d502
    • P
      dev: support deferring device flag change notifications · bd380811
      Patrick McHardy 提交于
      Split dev_change_flags() into two functions: __dev_change_flags() to
      perform the actual changes and __dev_notify_flags() to invoke netdevice
      notifiers. This will be used by rtnl_link to defer netlink notifications
      until the device has been fully configured.
      
      This changes ordering of some operations, in particular:
      
      - netlink notifications are sent after all changes have been performed.
        As a side effect this surpresses one unnecessary netlink message when
        the IFF_UP and other flags are changed simultaneously.
      
      - The NETDEV_UP/NETDEV_DOWN and NETDEV_CHANGE notifiers are invoked
        after all changes have been performed. Their relative is unchanged.
      
      - net_dmaengine_put() is invoked before the NETDEV_DOWN notifier instead
        of afterwards. This should not make any difference since both RX and TX
        are already shut down at this point.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd380811
    • P
      rtnetlink: handle rtnl_link netlink notifications manually · a2835763
      Patrick McHardy 提交于
      In order to support specifying device flags during device creation,
      we must be able to roll back device registration in case setting the
      flags fails without sending any notifications related to the device
      to userspace.
      
      This patch changes rollback_registered_many() and register_netdevice()
      to manually send netlink notifications for devices not handled by
      rtnl_link and allows to defer notifications for devices handled by
      rtnl_link until setup is complete.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2835763
    • P
      rtnetlink: ignore NETDEV_PRE_UP notifier in rtnetlink_event() · 10de05af
      Patrick McHardy 提交于
      Commit 3b8bcfd5 (net: introduce pre-up netdev notifier) added a new
      notifier which is run before a device is set UP for use by cfg80211.
      
      The patch missed to add the new notifier to the ignore list in
      rtnetlink_event(), so we currently get an unnecessary netlink
      notification before a device is set UP.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10de05af
  10. 26 2月, 2010 5 次提交
  11. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  12. 24 2月, 2010 1 次提交
    • A
      net: bug fix for vlan + gro issue · c4d49794
      Ajit Khaparde 提交于
      Traffic (tcp) doesnot start on a vlan interface when gro is enabled.
      Even the tcp handshake was not taking place.
      This is because, the eth_type_trans call before the netif_receive_skb
      in napi_gro_finish() resets the skb->dev to napi->dev from the previously
      set vlan netdev interface. This causes the ip_route_input to drop the
      incoming packet considering it as a packet coming from a martian source.
      
      I could repro this on 2.6.32.7 (stable) and 2.6.33-rc7.
      With this fix, the traffic starts and the test runs fine on both vlan
      and non-vlan interfaces.
      
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAjit Khaparde <ajitk@serverengines.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4d49794
  13. 23 2月, 2010 1 次提交
  14. 20 2月, 2010 1 次提交
  15. 18 2月, 2010 4 次提交
  16. 17 2月, 2010 3 次提交
  17. 16 2月, 2010 3 次提交
  18. 13 2月, 2010 1 次提交