1. 25 7月, 2010 1 次提交
    • S
      sysfs: add attribute to indicate hw address assignment type · c1f79426
      Stefan Assmann 提交于
      Add addr_assign_type to struct net_device and expose it via sysfs.
      This new attribute has the purpose of giving user-space the ability to
      distinguish between different assignment types of MAC addresses.
      
      For example user-space can treat NICs with randomly generated MAC
      addresses differently than NICs that have permanent (locally assigned)
      MAC addresses.
      For the former udev could write a persistent net rule by matching the
      device path instead of the MAC address.
      There's also the case of devices that 'steal' MAC addresses from slave
      devices. In which it is also be beneficial for user-space to be aware
      of the fact.
      
      This patch also introduces a helper function to assist adoption of
      drivers that generate MAC addresses randomly.
      Signed-off-by: NStefan Assmann <sassmann@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1f79426
  2. 20 7月, 2010 1 次提交
  3. 19 7月, 2010 2 次提交
  4. 10 7月, 2010 2 次提交
    • B
      net: Document that dev_get_stats() returns the given pointer · d7753516
      Ben Hutchings 提交于
      Document that dev_get_stats() returns the same stats pointer it was
      given.  Remove const qualification from the returned pointer since the
      caller may do what it likes with that structure.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7753516
    • B
      net: Get rid of rtnl_link_stats64 / net_device_stats union · 3cfde79c
      Ben Hutchings 提交于
      In commit be1f3c2c "net: Enable 64-bit
      net device statistics on 32-bit architectures" I redefined struct
      net_device_stats so that it could be used in a union with struct
      rtnl_link_stats64, avoiding the need for explicit copying or
      conversion between the two.  However, this is unsafe because there is
      no locking required and no lock consistently held around calls to
      dev_get_stats() and use of the statistics structure it returns.
      
      In commit 28172739 "net: fix 64 bit
      counters on 32 bit arches" Eric Dumazet dealt with that problem by
      requiring callers of dev_get_stats() to provide storage for the
      result.  This means that the net_device::stats64 field and the padding
      in struct net_device_stats are now redundant, so remove them.
      
      Update the comment on net_device_ops::ndo_get_stats64 to reflect its
      new usage.
      
      Change dev_txq_stats_fold() to use struct rtnl_link_stats64, since
      that is what all its callers are really using and it is no longer
      going to be compatible with struct net_device_stats.
      
      Eric Dumazet suggested the separate function for the structure
      conversion.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3cfde79c
  5. 08 7月, 2010 1 次提交
    • E
      net: fix 64 bit counters on 32 bit arches · 28172739
      Eric Dumazet 提交于
      There is a small possibility that a reader gets incorrect values on 32
      bit arches. SNMP applications could catch incorrect counters when a
      32bit high part is changed by another stats consumer/provider.
      
      One way to solve this is to add a rtnl_link_stats64 param to all
      ndo_get_stats64() methods, and also add such a parameter to
      dev_get_stats().
      
      Rule is that we are not allowed to use dev->stats64 as a temporary
      storage for 64bit stats, but a caller provided area (usually on stack)
      
      Old drivers (only providing get_stats() method) need no changes.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28172739
  6. 06 7月, 2010 1 次提交
  7. 05 7月, 2010 2 次提交
  8. 03 7月, 2010 1 次提交
    • J
      net: decreasing real_num_tx_queues needs to flush qdisc · f0796d5c
      John Fastabend 提交于
      Reducing real_num_queues needs to flush the qdisc otherwise
      skbs with queue_mappings greater then real_num_tx_queues can
      be sent to the underlying driver.
      
      The flow for this is,
      
      dev_queue_xmit()
      	dev_pick_tx()
      		skb_tx_hash()  => hash using real_num_tx_queues
      		skb_set_queue_mapping()
      	...
      	qdisc_enqueue_root() => enqueue skb on txq from hash
      ...
      dev->real_num_tx_queues -= n
      ...
      sch_direct_xmit()
      	dev_hard_start_xmit()
      		ndo_start_xmit(skb,dev) => skb queue set with old hash
      
      skbs are enqueued on the qdisc with skb->queue_mapping set
      0 < queue_mappings < real_num_tx_queues.  When the driver
      decreases real_num_tx_queues skb's may be dequeued from the
      qdisc with a queue_mapping greater then real_num_tx_queues.
      
      This fixes a case in ixgbe where this was occurring with DCB
      and FCoE. Because the driver is using queue_mapping to map
      skbs to tx descriptor rings we can potentially map skbs to
      rings that no longer exist.
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Tested-by: NRoss Brattain <ross.b.brattain@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0796d5c
  9. 17 6月, 2010 1 次提交
  10. 16 6月, 2010 5 次提交
  11. 13 6月, 2010 1 次提交
    • B
      net: Enable 64-bit net device statistics on 32-bit architectures · be1f3c2c
      Ben Hutchings 提交于
      Use struct rtnl_link_stats64 as the statistics structure.
      
      On 32-bit architectures, insert 32 bits of padding after/before each
      field of struct net_device_stats to make its layout compatible with
      struct rtnl_link_stats64.  Add an anonymous union in net_device; move
      stats into the union and add struct rtnl_link_stats64 stats64.
      
      Add net_device_ops::ndo_get_stats64, implementations of which will
      return a pointer to struct rtnl_link_stats64.  Drivers that implement
      this operation must not update the structure asynchronously.
      
      Change dev_get_stats() to call ndo_get_stats64 if available, and to
      return a pointer to struct rtnl_link_stats64.  Change callers of
      dev_get_stats() accordingly.
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be1f3c2c
  12. 08 6月, 2010 1 次提交
    • E
      anycast: Some RCU conversions · bb69ae04
      Eric Dumazet 提交于
      - dev_get_by_flags() changed to dev_get_by_flags_rcu()
      
      - ipv6_sock_ac_join() dont touch dev & idev refcounts
      - ipv6_sock_ac_drop() dont touch dev & idev refcounts
      - ipv6_sock_ac_close() dont touch dev & idev refcounts
      - ipv6_dev_ac_dec() dount touch idev refcount
      - ipv6_chk_acast_addr() dont touch idev refcount
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb69ae04
  13. 04 6月, 2010 1 次提交
  14. 02 6月, 2010 2 次提交
  15. 31 5月, 2010 1 次提交
    • I
      arp_notify: allow drivers to explicitly request a notification event. · 06c4648d
      Ian Campbell 提交于
      Currently such notifications are only generated when the device comes up or the
      address changes. However one use case for these notifications is to enable
      faster network recovery after a virtual machine migration (by causing switches
      to relearn their MAC tables). A migration appears to the network stack as a
      temporary loss of carrier and therefore does not trigger either of the current
      conditions. Rather than adding carrier up as a trigger (which can cause issues
      when interfaces a flapping) simply add an interface which the driver can use
      to explicitly trigger the notification.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: Stephen Hemminger <shemminger@linux-foundation.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Cc: stable@kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06c4648d
  16. 24 5月, 2010 1 次提交
  17. 21 5月, 2010 1 次提交
    • T
      net: fix problem in dequeuing from input_pkt_queue · 76cc8b13
      Tom Herbert 提交于
      Fix some issues introduced in batch skb dequeuing for input_pkt_queue.
      The primary issue it that the queue head must be incremented only
      after a packet has been processed, that is only after
      __netif_receive_skb has been called.  This is needed for the mechanism
      to prevent OOO packet in RFS.  Also when flushing the input_pkt_queue
      and process_queue, the process queue should be done first to prevent
      OOO packets.
      
      Because the input_pkt_queue has been effectively split into two queues,
      the calculation of the tail ptr is no longer correct.  The correct value
      would be head+input_pkt_queue->len+process_queue->len.  To avoid
      this calculation we added an explict input_queue_tail in softnet_data.
      The tail value is simply incremented when queuing to input_pkt_queue.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76cc8b13
  18. 18 5月, 2010 1 次提交
    • S
      net: Add netlink support for virtual port management (was iovnl) · 57b61080
      Scott Feldman 提交于
      Add new netdev ops ndo_{set|get}_vf_port to allow setting of
      port-profile on a netdev interface.  Extends netlink socket RTM_SETLINK/
      RTM_GETLINK with two new sub msgs called IFLA_VF_PORTS and IFLA_PORT_SELF
      (added to end of IFLA_cmd list).  These are both nested atrtibutes
      using this layout:
      
                    [IFLA_NUM_VF]
                    [IFLA_VF_PORTS]
                            [IFLA_VF_PORT]
                                    [IFLA_PORT_*], ...
                            [IFLA_VF_PORT]
                                    [IFLA_PORT_*], ...
                            ...
                    [IFLA_PORT_SELF]
                            [IFLA_PORT_*], ...
      
      These attributes are design to be set and get symmetrically.  VF_PORTS
      is a list of VF_PORTs, one for each VF, when dealing with an SR-IOV
      device.  PORT_SELF is for the PF of the SR-IOV device, in case it wants
      to also have a port-profile, or for the case where the VF==PF, like in
      enic patch 2/2 of this patch set.
      
      A port-profile is used to configure/enable the external switch virtual port
      backing the netdev interface, not to configure the host-facing side of the
      netdev.  A port-profile is an identifier known to the switch.  How port-
      profiles are installed on the switch or how available port-profiles are
      made know to the host is outside the scope of this patch.
      
      There are two types of port-profiles specs in the netlink msg.  The first spec
      is for 802.1Qbg (pre-)standard, VDP protocol.  The second spec is for devices
      that run a similar protocol as VDP but in firmware, thus hiding the protocol
      details.  In either case, the specs have much in common and makes sense to
      define the netlink msg as the union of the two specs.  For example, both specs
      have a notition of associating/deassociating a port-profile.  And both specs
      require some information from the hypervisor manager, such as client port
      instance ID.
      
      The general flow is the port-profile is applied to a host netdev interface
      using RTM_SETLINK, the receiver of the RTM_SETLINK msg communicates with the
      switch, and the switch virtual port backing the host netdev interface is
      configured/enabled based on the settings defined by the port-profile.  What
      those settings comprise, and how those settings are managed is again
      outside the scope of this patch, since this patch only deals with the
      first step in the flow.
      Signed-off-by: NScott Feldman <scofeldm@cisco.com>
      Signed-off-by: NRoopa Prabhu <roprabhu@cisco.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57b61080
  19. 16 5月, 2010 1 次提交
    • E
      net: Consistent skb timestamping · 3b098e2d
      Eric Dumazet 提交于
      With RPS inclusion, skb timestamping is not consistent in RX path.
      
      If netif_receive_skb() is used, its deferred after RPS dispatch.
      
      If netif_rx() is used, its done before RPS dispatch.
      
      This can give strange tcpdump timestamps results.
      
      I think timestamping should be done as soon as possible in the receive
      path, to get meaningful values (ie timestamps taken at the time packet
      was delivered by NIC driver to our stack), even if NAPI already can
      defer timestamping a bit (RPS can help to reduce the gap)
      
      Tom Herbert prefer to sample timestamps after RPS dispatch. In case
      sampling is expensive (HPET/acpi_pm on x86), this makes sense.
      
      Let admins switch from one mode to another, using a new
      sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
      
      Its default value (1), means timestamps are taken as soon as possible,
      before backlog queueing, giving accurate timestamps.
      
      Setting a 0 value permits to sample timestamps when processing backlog,
      after RPS dispatch, to lower the load of the pre-RPS cpu.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b098e2d
  20. 11 5月, 2010 1 次提交
    • M
      PM QOS update · ed77134b
      Mark Gross 提交于
      This patch changes the string based list management to a handle base
      implementation to help with the hot path use of pm-qos, it also renames
      much of the API to use "request" as opposed to "requirement" that was
      used in the initial implementation.  I did this because request more
      accurately represents what it actually does.
      
      Also, I added a string based ABI for users wanting to use a string
      interface.  So if the user writes 0xDDDDDDDD formatted hex it will be
      accepted by the interface.  (someone asked me for it and I don't think
      it hurts anything.)
      
      This patch updates some documentation input I got from Randy.
      Signed-off-by: Nmarkgross <mgross@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      ed77134b
  21. 06 5月, 2010 1 次提交
    • W
      netpoll: add generic support for bridge and bonding devices · 0e34e931
      WANG Cong 提交于
      This whole patchset is for adding netpoll support to bridge and bonding
      devices. I already tested it for bridge, bonding, bridge over bonding,
      and bonding over bridge. It looks fine now.
      
      To make bridge and bonding support netpoll, we need to adjust
      some netpoll generic code. This patch does the following things:
      
      1) introduce two new priv_flags for struct net_device:
         IFF_IN_NETPOLL which identifies we are processing a netpoll;
         IFF_DISABLE_NETPOLL is used to disable netpoll support for a device
         at run-time;
      
      2) introduce one new method for netdev_ops:
         ->ndo_netpoll_cleanup() is used to clean up netpoll when a device is
           removed.
      
      3) introduce netpoll_poll_dev() which takes a struct net_device * parameter;
         export netpoll_send_skb() and netpoll_poll_dev() which will be used later;
      
      4) hide a pointer to struct netpoll in struct netpoll_info, ditto.
      
      5) introduce ->real_dev for struct netpoll.
      
      6) introduce a new status NETDEV_BONDING_DESLAE, which is used to disable
         netconsole before releasing a slave, to avoid deadlocks.
      
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e34e931
  22. 03 5月, 2010 2 次提交
  23. 28 4月, 2010 2 次提交
  24. 20 4月, 2010 2 次提交
    • E
      rps: cleanups · e36fa2f7
      Eric Dumazet 提交于
      struct softnet_data holds many queues, so consistent use "sd" name
      instead of "queue" is better.
      
      Adds a rps_ipi_queued() helper to cleanup enqueue_to_backlog()
      
      Adds a _and_irq_disable suffix to net_rps_action() name, as David
      suggested.
      
      incr_input_queue_head() becomes input_queue_head_incr()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e36fa2f7
    • E
      rps: shortcut net_rps_action() · 88751275
      Eric Dumazet 提交于
      net_rps_action() is a bit expensive on NR_CPUS=64..4096 kernels, even if
      RPS is not active.
      
      Tom Herbert used two bitmasks to hold information needed to send IPI,
      but a single LIFO list seems more appropriate.
      
      Move all RPS logic into net_rps_action() to cleanup net_rx_action() code
      (remove two ifdefs)
      
      Move rps_remote_softirq_cpus into softnet_data to share its first cache
      line, filling an existing hole.
      
      In a future patch, we could call net_rps_action() from process_backlog()
      to make sure we send IPI before handling this cpu backlog.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88751275
  25. 17 4月, 2010 1 次提交
    • T
      rfs: Receive Flow Steering · fec5e652
      Tom Herbert 提交于
      This patch implements receive flow steering (RFS).  RFS steers
      received packets for layer 3 and 4 processing to the CPU where
      the application for the corresponding flow is running.  RFS is an
      extension of Receive Packet Steering (RPS).
      
      The basic idea of RFS is that when an application calls recvmsg
      (or sendmsg) the application's running CPU is stored in a hash
      table that is indexed by the connection's rxhash which is stored in
      the socket structure.  The rxhash is passed in skb's received on
      the connection from netif_receive_skb.  For each received packet,
      the associated rxhash is used to look up the CPU in the hash table,
      if a valid CPU is set then the packet is steered to that CPU using
      the RPS mechanisms.
      
      The convolution of the simple approach is that it would potentially
      allow OOO packets.  If threads are thrashing around CPUs or multiple
      threads are trying to read from the same sockets, a quickly changing
      CPU value in the hash table could cause rampant OOO packets--
      we consider this a non-starter.
      
      To avoid OOO packets, this solution implements two types of hash
      tables: rps_sock_flow_table and rps_dev_flow_table.
      
      rps_sock_table is a global hash table.  Each entry is just a CPU
      number and it is populated in recvmsg and sendmsg as described above.
      This table contains the "desired" CPUs for flows.
      
      rps_dev_flow_table is specific to each device queue.  Each entry
      contains a CPU and a tail queue counter.  The CPU is the "current"
      CPU for a matching flow.  The tail queue counter holds the value
      of a tail queue counter for the associated CPU's backlog queue at
      the time of last enqueue for a flow matching the entry.
      
      Each backlog queue has a queue head counter which is incremented
      on dequeue, and so a queue tail counter is computed as queue head
      count + queue length.  When a packet is enqueued on a backlog queue,
      the current value of the queue tail counter is saved in the hash
      entry of the rps_dev_flow_table.
      
      And now the trick: when selecting the CPU for RPS (get_rps_cpu)
      the rps_sock_flow table and the rps_dev_flow table for the RX queue
      are consulted.  When the desired CPU for the flow (found in the
      rps_sock_flow table) does not match the current CPU (found in the
      rps_dev_flow table), the current CPU is changed to the desired CPU
      if one of the following is true:
      
      - The current CPU is unset (equal to RPS_NO_CPU)
      - Current CPU is offline
      - The current CPU's queue head counter >= queue tail counter in the
      rps_dev_flow table.  This checks if the queue tail has advanced
      beyond the last packet that was enqueued using this table entry.
      This guarantees that all packets queued using this entry have been
      dequeued, thus preserving in order delivery.
      
      Making each queue have its own rps_dev_flow table has two advantages:
      1) the tail queue counters will be written on each receive, so
      keeping the table local to interrupting CPU s good for locality.  2)
      this allows lockless access to the table-- the CPU number and queue
      tail counter need to be accessed together under mutual exclusion
      from netif_receive_skb, we assume that this is only called from
      device napi_poll which is non-reentrant.
      
      This patch implements RFS for TCP and connected UDP sockets.
      It should be usable for other flow oriented protocols.
      
      There are two configuration parameters for RFS.  The
      "rps_flow_entries" kernel init parameter sets the number of
      entries in the rps_sock_flow_table, the per rxqueue sysfs entry
      "rps_flow_cnt" contains the number of entries in the rps_dev_flow
      table for the rxqueue.  Both are rounded to power of two.
      
      The obvious benefit of RFS (over just RPS) is that it achieves
      CPU locality between the receive processing for a flow and the
      applications processing; this can result in increased performance
      (higher pps, lower latency).
      
      The benefits of RFS are dependent on cache hierarchy, application
      load, and other factors.  On simple benchmarks, we don't necessarily
      see improvement and sometimes see degradation.  However, for more
      complex benchmarks and for applications where cache pressure is
      much higher this technique seems to perform very well.
      
      Below are some benchmark results which show the potential benfit of
      this patch.  The netperf test has 500 instances of netperf TCP_RR
      test with 1 byte req. and resp.  The RPC test is an request/response
      test similar in structure to netperf RR test ith 100 threads on
      each host, but does more work in userspace that netperf.
      
      e1000e on 8 core Intel
         No RFS or RPS		104K tps at 30% CPU
         No RFS (best RPS config):    290K tps at 63% CPU
         RFS				303K tps at 61% CPU
      
      RPC test	tps	CPU%	50/90/99% usec latency	Latency StdDev
        No RFS/RPS	103K	48%	757/900/3185		4472.35
        RPS only:	174K	73%	415/993/2468		491.66
        RFS		223K	73%	379/651/1382		315.61
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fec5e652
  26. 15 4月, 2010 1 次提交
  27. 13 4月, 2010 1 次提交
  28. 08 4月, 2010 1 次提交
  29. 04 4月, 2010 1 次提交
    • J
      net: convert multicast list to list_head · 22bedad3
      Jiri Pirko 提交于
      Converts the list and the core manipulating with it to be the same as uc_list.
      
      +uses two functions for adding/removing mc address (normal and "global"
       variant) instead of a function parameter.
      +removes dev_mcast.c completely.
      +exposes netdev_hw_addr_list_* macros along with __hw_addr_* functions for
       manipulation with lists on a sandbox (used in bonding and 80211 drivers)
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22bedad3