1. 17 3月, 2010 22 次提交
    • S
      bridge: per-cpu packet statistics (v3) · 14bb4789
      stephen hemminger 提交于
      The shared packet statistics are a potential source of slow down
      on bridged traffic. Convert to per-cpu array, but only keep those
      statistics which change per-packet.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14bb4789
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
    • T
      RDS: Enable per-cpu workqueue threads · 768bbedf
      Tina Yang 提交于
      Create per-cpu workqueue threads instead of a single
      krdsd thread. This is a step towards better scalability.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      768bbedf
    • A
      RDS: Do not call set_page_dirty() with irqs off · 561c7df6
      Andy Grover 提交于
      set_page_dirty() unconditionally re-enables interrupts, so
      if we call it with irqs off, they will be on after the call,
      and that's bad. This patch moves the call after we've re-enabled
      interrupts in send_drop_to(), so it's safe.
      
      Also, add BUG_ONs to let us know if we ever do call set_page_dirty
      with interrupts off.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      561c7df6
    • S
      RDS: Properly unmap when getting a remote access error · 450d06c0
      Sherman Pun 提交于
      If the RDMA op has aborted with a remote access error,
      in addition to what we already do (tell userspace it has
      completed with an error) also unmap it and put() the rm.
      
      Otherwise, hangs may occur on arches that track maps and
      will not exit without proper cleanup.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      450d06c0
    • A
      RDS: only put sockets that have seen congestion on the poll_waitq · b98ba52f
      Andy Grover 提交于
      rds_poll_waitq's listeners will be awoken if we receive a congestion
      notification. Bad performance may result because *all* polled sockets
      contend for this single lock. However, it should not be necessary to
      wake pollers when a congestion update arrives if they have never
      experienced congestion, and not putting these on the waitq will
      hopefully greatly reduce contention.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b98ba52f
    • T
      RDS: Fix locking in rds_send_drop_to() · 550a8002
      Tina Yang 提交于
      It seems rds_send_drop_to() called
      __rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED)
      with only rds_sock lock, but not rds_message lock. It raced with
      other threads that is attempting to modify the rds_message as well,
      such as from within rds_rdma_send_complete().
      Signed-off-by: NTina Yang <tina.yang@oracle.com>
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      550a8002
    • A
      RDS: Turn down alarming reconnect messages · 97069788
      Andy Grover 提交于
      RDS's error messages when a connection goes down are a little
      extreme. A connection may go down, and it will be re-established,
      and everything is fine. This patch links these messages through
      rdsdebug(), instead of to printk directly.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97069788
    • A
      RDS: Workaround for in-use MRs on close causing crash · 571c02fa
      Andy Grover 提交于
      if a machine is shut down without closing sockets properly, and
      freeing all MRs, then a BUG_ON will bring it down. This patch
      changes these to WARN_ONs -- leaking MRs is not fatal (although
      not ideal, and there is more work to do here for a proper fix.)
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      571c02fa
    • T
      RDS: Fix send locking issue · 048c15e6
      Tina Yang 提交于
      Fix a deadlock between rds_rdma_send_complete() and
      rds_send_remove_from_sock() when rds socket lock and
      rds message lock are acquired out-of-order.
      Signed-off-by: NTina Yang <Tina.Yang@oracle.com>
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      048c15e6
    • A
      RDS: Fix congestion issues for loopback · 2e7b3b99
      Andy Grover 提交于
      We have two kinds of loopback: software (via loop transport)
      and hardware (via IB). sw is used for 127.0.0.1, and doesn't
      support rdma ops. hw is used for sends to local device IPs,
      and supports rdma. Both are used in different cases.
      
      For both of these, when there is a congestion map update, we
      want to call rds_cong_map_updated() but not actually send
      anything -- since loopback local and foreign congestion maps
      point to the same spot, they're already in sync.
      
      The old code never called sw loop's xmit_cong_map(),so
      rds_cong_map_updated() wasn't being called for it. sw loop
      ports would not work right with the congestion monitor.
      
      Fixing that meant that hw loopback now would send congestion maps
      to itself. This is also undesirable (racy), so we check for this
      case in the ib-specific xmit code.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e7b3b99
    • A
      RDS/TCP: Wait to wake thread when write space available · 8e82376e
      Andy Grover 提交于
      Instead of waking the send thread whenever any send space is available,
      wait until it is at least half empty. This is modeled on how
      sock_def_write_space() does it, and may help to minimize context
      switches.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e82376e
    • A
      RDS: update copy_to_user state in tcp transport · b075cfdb
      Andy Grover 提交于
      Other transports use rds_page_copy_user, which updates our
      s_copy_to_user counter. TCP doesn't, so it needs to explicity
      call rds_stats_add().
      Reported-by: NRichard Frank <richard.frank@oracle.com>
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b075cfdb
    • A
      RDS: sendmsg() should check sndtimeo, not rcvtimeo · 1123fd73
      Andy Grover 提交于
      Most likely cut n paste error - sendmsg() was checking sock_rcvtimeo.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1123fd73
    • A
      RDS: Do not BUG() on error returned from ib_post_send · 735f61e6
      Andy Grover 提交于
      BUGging on a runtime error code should be avoided. This
      patch also eliminates all other BUG()s that have no real
      reason to exist.
      Signed-off-by: NAndy Grover <andy.grover@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      735f61e6
    • D
      bridge: Make first arg to deliver_clone const. · 87faf3cc
      David S. Miller 提交于
      Otherwise we get a warning from the call in br_forward().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87faf3cc
    • Y
      bridge br_multicast: Don't refer to BR_INPUT_SKB_CB(skb)->mrouters_only without IGMP snooping. · 32dec5dd
      YOSHIFUJI Hideaki / 吉藤英明 提交于
      Without CONFIG_BRIDGE_IGMP_SNOOPING,
      BR_INPUT_SKB_CB(skb)->mrouters_only is not appropriately
      initialized, so we can see garbage.
      
      A clear option to fix this is to set it even without that
      config, but we cannot optimize out the branch.
      
      Let's introduce a macro that returns value of mrouters_only
      and let it return 0 without CONFIG_BRIDGE_IGMP_SNOOPING.
      Signed-off-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32dec5dd
    • V
      route: Fix caught BUG_ON during rt_secret_rebuild_oneshot() · 858a18a6
      Vitaliy Gusev 提交于
      route: Fix caught BUG_ON during rt_secret_rebuild_oneshot()
      
      Call rt_secret_rebuild can cause BUG_ON(timer_pending(&net->ipv4.rt_secret_timer)) in
      add_timer as there is not any synchronization for call rt_secret_rebuild_oneshot()
      for the same net namespace.
      
      Also this issue affects to rt_secret_reschedule().
      
      Thus use mod_timer enstead.
      Signed-off-by: NVitaliy Gusev <vgusev@openvz.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      858a18a6
    • Y
    • Y
    • J
      NET: netpoll, fix potential NULL ptr dereference · 21edbb22
      Jiri Slaby 提交于
      Stanse found that one error path in netpoll_setup dereferences npinfo
      even though it is NULL. Avoid that by adding new label and go to that
      instead.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Daniel Borkmann <danborkmann@googlemail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: chavey@google.com
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21edbb22
    • N
      tipc: fix lockdep warning on address assignment · a2f46ee1
      Neil Horman 提交于
      So in the forward porting of various tipc packages, I was constantly
      getting this lockdep warning everytime I used tipc-config to set a network
      address for the protocol:
      
      [ INFO: possible circular locking dependency detected ]
      2.6.33 #1
      tipc-config/1326 is trying to acquire lock:
      (ref_table_lock){+.-...}, at: [<ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
      
      but task is already holding lock:
      (&(&entry->lock)->rlock#2){+.-...}, at: [<ffffffffa03150d5>] tipc_ref_lock+0x43/0x63 [tipc]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #1 (&(&entry->lock)->rlock#2){+.-...}:
      [<ffffffff8107b508>] __lock_acquire+0xb67/0xd0f
      [<ffffffff8107b78c>] lock_acquire+0xdc/0x102
      [<ffffffff8145471e>] _raw_spin_lock_bh+0x3b/0x6e
      [<ffffffffa03152b1>] tipc_ref_acquire+0xe8/0x11b [tipc]
      [<ffffffffa031433f>] tipc_createport_raw+0x78/0x1b9 [tipc]
      [<ffffffffa031450b>] tipc_createport+0x8b/0x125 [tipc]
      [<ffffffffa030f221>] tipc_subscr_start+0xce/0x126 [tipc]
      [<ffffffffa0308fb2>] process_signal_queue+0x47/0x7d [tipc]
      [<ffffffff81053e0c>] tasklet_action+0x8c/0xf4
      [<ffffffff81054bd8>] __do_softirq+0xf8/0x1cd
      [<ffffffff8100aadc>] call_softirq+0x1c/0x30
      [<ffffffff810549f4>] _local_bh_enable_ip+0xb8/0xd7
      [<ffffffff81054a21>] local_bh_enable_ip+0xe/0x10
      [<ffffffff81454d31>] _raw_spin_unlock_bh+0x34/0x39
      [<ffffffffa0308eb8>] spin_unlock_bh.clone.0+0x15/0x17 [tipc]
      [<ffffffffa0308f47>] tipc_k_signal+0x8d/0xb1 [tipc]
      [<ffffffffa0308dd9>] tipc_core_start+0x8a/0xad [tipc]
      [<ffffffffa01b1087>] 0xffffffffa01b1087
      [<ffffffff8100207d>] do_one_initcall+0x72/0x18a
      [<ffffffff810872fb>] sys_init_module+0xd8/0x23a
      [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b
      
      -> #0 (ref_table_lock){+.-...}:
      [<ffffffff8107b3b2>] __lock_acquire+0xa11/0xd0f
      [<ffffffff8107b78c>] lock_acquire+0xdc/0x102
      [<ffffffff81454836>] _raw_write_lock_bh+0x3b/0x6e
      [<ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
      [<ffffffffa03141ee>] tipc_deleteport+0x40/0x119 [tipc]
      [<ffffffffa0316e35>] release+0xeb/0x137 [tipc]
      [<ffffffff8139dbf4>] sock_release+0x1f/0x6f
      [<ffffffff8139dc6b>] sock_close+0x27/0x2b
      [<ffffffff811116f6>] __fput+0x12a/0x1df
      [<ffffffff811117c5>] fput+0x1a/0x1c
      [<ffffffff8110e49b>] filp_close+0x68/0x72
      [<ffffffff8110e552>] sys_close+0xad/0xe7
      [<ffffffff81009b42>] system_call_fastpath+0x16/0x1b
      
      Finally decided I should fix this.  Its a straightforward inversion,
      tipc_ref_acquire takes two locks in this order:
      ref_table_lock
      entry->lock
      
      while tipc_deleteport takes them in this order:
      entry->lock (via tipc_port_lock())
      ref_table_lock (via tipc_ref_discard())
      
      when the same entry is referenced, we get the above warning.  The fix is equally
      straightforward.  Theres no real relation between the entry->lock and the
      ref_table_lock (they just are needed at the same time), so move the entry->lock
      aquisition in tipc_ref_acquire down, after we unlock ref_table_lock (this is
      safe since the ref_table_lock guards changes to the reference table, and we've
      already claimed a slot there.  I've tested the below fix and confirmed that it
      clears up the lockdep issue
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: Allan Stephens <allan.stephens@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2f46ee1
  2. 16 3月, 2010 4 次提交
  3. 14 3月, 2010 2 次提交
  4. 13 3月, 2010 3 次提交
  5. 12 3月, 2010 1 次提交
    • D
      ipconfig: Handle devices which take some time to come up. · 964ad81c
      David S. Miller 提交于
      Some network devices, particularly USB ones, take several seconds to
      fully init and appear in the device list.
      
      If the user turned ipconfig on, they are using it for NFS root or some
      other early booting purpose.  So it makes no sense to just flat out
      fail immediately if the device isn't found.
      
      It also doesn't make sense to just jack up the initial wait to
      something crazy like 10 seconds.
      
      Instead, poll immediately, and then periodically once a second,
      waiting for a usable device to appear.  Fail after 12 seconds.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Tested-by: NChristian Pellegrin <chripell@fsfe.org>
      964ad81c
  6. 11 3月, 2010 3 次提交
  7. 10 3月, 2010 3 次提交
  8. 09 3月, 2010 2 次提交
    • N
      tipc: filter out messages not intended for this host · de586571
      Neil Horman 提交于
      Port commit 20deb48d16fdd07ce2fdc8d03ea317362217e085
      from git://tipc.cslab.ericsson.net/pub/git/people/allan/tipc.git
      
      Part of the large effort I'm trying to help with getting all the downstreamed
      code from windriver forward ported to the upstream tree
      
      Origional commit message
      Restore check to filter out inadverdently received messages
      This patch reimplements a check that allows TIPC to discard messages
      that are not intended for it.  This check was present in TIPC 1.5/1.6,
      but was removed by accident during the development of TIPC 1.7; it has
      now been updated to account for new features present in TIPC 1.7 and
      reinserted into TIPC.  The main benefit of this check is to filter
      out messages arriving from orphaned link endpoints, which can arise
      when a node exits the network and then re-enters it with a different
      TIPC network address (i.e. <Z.C.N> value).
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Origionally-authored-by: NAllan Stephens <allan.stephens@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de586571
    • N
      tipc: fix endianness on tipc subscriber messages · d88dca79
      Neil Horman 提交于
      Remove htohl implementation from tipc
      
      I was working on forward porting the downstream commits for TIPC and ran accross this one:
      http://tipc.cslab.ericsson.net/cgi-bin/gitweb.cgi?p=people/allan/tipc.git;a=commitdiff;h=894279b9437b63cbb02405ad5b8e033b51e4e31e
      
      I was going to just take it, when I looked closer and noted what it was doing.
      This is basically a routine to byte swap fields of data in sent/received packets
      for tipc, dependent upon the receivers guessed endianness of the peer when a
      connection is established.  Asside from just seeming silly to me, it appears to
      violate the latest RFC draft for tipc:
      http://tipc.sourceforge.net/doc/draft-spec-tipc-02.txt
      Which, according to section 4.2 and 4.3.3, requires that all fields of all
      commands be sent in network byte order.  So instead of just taking this patch,
      instead I'm removing the htohl function and replacing the calls with calls to
      ntohl in the rx path and htonl in the send path.
      
      As part of this fix, I'm also changing the subscr_cancel function, which
      searches the list of subscribers, using a memcmp of the entire subscriber list,
      for the entry to tear down.  unfortunately it memcmps the entire tipc_subscr
      structure which has several bits that are private to the local side, so nothing
      will ever match.  section 5.2 of the draft spec indicates the <type,upper,lower>
      tuple should uniquely identify a subscriber, so convert subscr_cancel to just
      match on those fields (properly endian swapped).
      
      I've tested this using the tipc test suite, and its passed without issue.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d88dca79