1. 21 10月, 2010 9 次提交
  2. 20 10月, 2010 2 次提交
    • E
      net: avoid RCU for NOCACHE dst · 27b75c95
      Eric Dumazet 提交于
      There is no point using RCU for dst we allocate for a very short time
      (used once).
      
      Change dst_release() to take DST_NOCACHE into account, but also change
      skb_dst_set_noref() to force a refcount increment for such dst.
      
      This is a _huge_ gain, because we dont waste memory to store xx thousand
      of dsts. Instead of queueing them to RCU, we can free them instantly.
      
      CPU caches can stay hot, re-using same memory blocks to hold temporary
      dsts.
      
      Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
      since atomic_dec_return() implies a full memory barrier.
      
      Stress test, 160.000.000 udp frames sent, IP route cache disabled
      (DDOS).
      
      Before:
      
      real    0m38.091s
      user    0m13.189s
      sys     7m53.018s
      
      After:
      
      real	0m29.946s
      user	0m12.157s
      sys	7m40.605s
      
      For reference, if IP route cache was enabled :
      
      real	0m32.030s
      user	0m10.521s
      sys	8m15.243s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b75c95
    • T
      net: allocate tx queues in register_netdevice · e6484930
      Tom Herbert 提交于
      This patch introduces netif_alloc_netdev_queues which is called from
      register_device instead of alloc_netdev_mq.  This makes TX queue
      allocation symmetric with RX allocation.  Also, queue locks allocation
      is done in netdev_init_one_queue.  Change set_real_num_tx_queues to
      fail if requested number < 1 or greater than number of allocated
      queues.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6484930
  3. 18 10月, 2010 3 次提交
    • N
      bonding: Fix bonding drivers improper modification of netpoll structure · c2355e1a
      Neil Horman 提交于
      The bonding driver currently modifies the netpoll structure in its xmit path
      while sending frames from netpoll.  This is racy, as other cpus can access the
      netpoll structure in parallel. Since the bonding driver points np->dev to a
      slave device, other cpus can inadvertently attempt to send data directly to
      slave devices, leading to improper locking with the bonding master, lost frames,
      and deadlocks.  This patch fixes that up.
      
      This patch also removes the real_dev pointer from the netpoll structure as that
      data is really only used by bonding in the poll_controller, and we can emulate
      its behavior by check each slave for IS_UP.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2355e1a
    • M
      can: mcp251x: Don't use pdata->model for chip selection anymore · f1f8c6cb
      Marc Kleine-Budde 提交于
      Since commit e446630c, i.e. v2.6.35-rc1,
      the mcp251x chip model can be selected via the modalias member in the
      struct spi_board_info. The driver stores the actual model in the
      struct mcp251x_platform_data.
      
      From the driver point of view the platform_data should be read only.
      Since all in-tree users of the mcp251x have already been converted to
      the modalias method, this patch moves the "model" member from the
      struct mcp251x_platform_data to the driver's private data structure.
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      Cc: Christian Pellegrin <chripell@fsfe.org>
      Cc: Marc Zyngier <maz@misterjones.org>
      f1f8c6cb
    • E
      netns: reorder fields in struct net · 8e602ce2
      Eric Dumazet 提交于
      In a network bench, I noticed an unfortunate false sharing between
      'loopback_dev' and 'count' fields in "struct net".
      
      'count' is written each time a socket is created or destroyed, while
      loopback_dev might be often read in routing code.
      
      Move loopback_dev in a read mostly section of "struct net"
      
      Note: struct netns_xfrm is cache line aligned on SMP.
      (It contains a "struct dst_ops")
      Move it at the end to avoid holes, and reduce sizeof(struct net) by 128
      bytes on ia32.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e602ce2
  4. 17 10月, 2010 2 次提交
    • S
      tipc: cleanup function namespace · 31e3c3f6
      stephen hemminger 提交于
      Do some cleanups of TIPC based on make namespacecheck
        1. Don't export unused symbols
        2. Eliminate dead code
        3. Make functions and variables local
        4. Rename buf_acquire to tipc_buf_acquire since it is used in several files
      
      Compile tested only.
      This make break out of tree kernel modules that depend on TIPC routines.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Acked-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31e3c3f6
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  5. 14 10月, 2010 4 次提交
    • K
      Phonet: 'connect' socket implementation for Pipe controller · b3d62553
      Kumar Sanghvi 提交于
      Based on suggestion by Rémi Denis-Courmont to implement 'connect'
      for Pipe controller logic,  this patch implements 'connect' socket
      call for the Pipe controller logic.
      The patch does following:-
      - Removes setsockopts for PNPIPE_CREATE and PNPIPE_DESTROY
      - Adds setsockopt for setting the Pipe handle value
      - Implements connect socket call
      - Updates the Pipe controller logic
      
      User-space should now follow below sequence with Pipe controller:-
      -socket
      -bind
      -setsockopt for PNPIPE_PIPE_HANDLE
      -connect
      -setsockopt for PNPIPE_ENCAP_IP
      -setsockopt for PNPIPE_ENABLE
      
      GPRS/3G data has been tested working fine with this.
      Signed-off-by: NKumar Sanghvi <kumar.sanghvi@stericsson.com>
      Acked-by: NRémi Denis-Courmont <remi.denis-courmont@nokia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3d62553
    • J
      mac80211: add probe request filter flag · 7be5086d
      Johannes Berg 提交于
      Using the frame registration notification, we
      can see when probe requests are requested and
      notify the low-level driver via filtering. The
      flag is also set in AP and IBSS modes.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      7be5086d
    • J
      cfg80211: notify drivers about frame registrations · 271733cf
      Johannes Berg 提交于
      Drivers may need to adjust their filters according
      to frame registrations, so notify them about them.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      271733cf
    • G
      wext: fix alignment problem in serializing 'struct iw_point' · 10d8dad8
      Gerrit Renker 提交于
      wext: fix alignment problem in serializing 'struct iw_point'
      
      This fixes a typo in the definition of the serialized length of struct iw_point:
       a) wireless.h is exported to userspace, the typo causes IW_EV_POINT_PK_LEN
          to be 12 on 64-bit, and 8 on 32-bit systems (causing misalignment);
       b) in compat-64 mode iwe_stream_add_point() memcpys overlap (see below).
      
      The second case in  in compat-64 mode looks like (variable names are as in
      include/net/iw_handler.h:iwe_stream_add_point()):
      
       point_len = IW_EV_COMPAT_POINT_LEN = 8
       lcp_len   = IW_EV_COMPAT_LCP_LEN   = 4
       2nd memcpy: IW_EV_POINT_PK_LEN - IW_EV_LCP_PK_LEN = 12 - 4 = 8
      
       IW_EV_LCP_PK_LEN
       <-------------->                *---> 'extra' data area
       +-------+-------+-------+-------+---------------+------- ...-+
       | len   | cmd   |length | flags |  (empty) -> extra      ... |
       +-------+-------+-------+-------+---------------+------- ...-+
          2       2       2       2          4
      
           lcp_len
       <-------------->                <-!! OVERLAP !!>
       <--1st memcpy--><------- 2nd memcpy ----------->
                                       <---- 3rd memcpy ------- ... >
       <--------- point_len ---------->
      
      This case could cause overrun whenever iw_point.length < 4.
      The other two cases are -
       * 32-bit systems: IW_EV_POINT_PK_LEN - IW_EV_LCP_PK_LEN =  8 - 4 = 4,
         the second memcpy copies exactly the 4 required bytes;
       * 64-bit systems: IW_EV_POINT_PK_LEN - IW_EV_LCP_PK_LEN = 12 - 4 = 8,
         the second memcpy copies a superfluous (but non overlapping) 4 bytes.
      
      The patch changes IW_EV_POINT_PK_LEN to be 8, so that in all 3 cases always only
      the requested iw_point.{length,flags} (both __u16) are copied, avoiding overrrun
      (compat-64) and superfluous copy (64-bit). In addition, the userspace header is
      sanitized (in agreement with version 30 of the wireless tools).
      
      Many thanks to Johannes Berg for help and review with this patch.
      Signed-off-by: NGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      10d8dad8
  6. 13 10月, 2010 1 次提交
    • E
      net: percpu net_device refcount · 29b4433d
      Eric Dumazet 提交于
      We tried very hard to remove all possible dev_hold()/dev_put() pairs in
      network stack, using RCU conversions.
      
      There is still an unavoidable device refcount change for every dst we
      create/destroy, and this can slow down some workloads (routers or some
      app servers, mmap af_packet)
      
      We can switch to a percpu refcount implementation, now dynamic per_cpu
      infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
      per device.
      
      On x86, dev_hold(dev) code :
      
      before
              lock    incl 0x280(%ebx)
      after:
              movl    0x260(%ebx),%eax
              incl    fs:(%eax)
      
      Stress bench :
      
      (Sending 160.000.000 UDP frames,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_TRIE)
      
      Before:
      
      real    1m1.662s
      user    0m14.373s
      sys     12m55.960s
      
      After:
      
      real    0m51.179s
      user    0m15.329s
      sys     10m15.942s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29b4433d
  7. 12 10月, 2010 11 次提交
  8. 09 10月, 2010 1 次提交
    • R
      Phonet: cleanup pipe enable socket option · 03789f26
      Rémi Denis-Courmont 提交于
      The current code works like this:
      
        int garbage, status;
        socklen_t len = sizeof(status);
      
        /* enable pipe */
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &garbage, sizeof(garbage));
        /* disable pipe */
        setsockopt(fd, SOL_PNPIPE, PNPIPE_DISABLE, &garbage, sizeof(garbage));
        /* get status */
        getsockopt(fd, SOL_PNPIPE, PNPIPE_INQ, &status, &len);
      
      ...which does not follow the usual socket option pattern. This patch
      merges all three "options" into a single gettable&settable option,
      before Linux 2.6.37 gets out:
      
        int status;
        socklen_t len = sizeof(status);
      
        /* enable pipe */
        status = 1;
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, sizeof(status));
        /* disable pipe */
        status = 0;
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, sizeof(status));
        /* get status */
        getsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, &len);
      
      This also fixes the error code from EFAULT to ENOTCONN.
      Signed-off-by: NRémi Denis-Courmont <remi.denis-courmont@nokia.com>
      Cc: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03789f26
  9. 08 10月, 2010 1 次提交
  10. 07 10月, 2010 3 次提交
  11. 06 10月, 2010 3 次提交
    • E
      fib: RCU conversion of fib_lookup() · ebc0ffae
      Eric Dumazet 提交于
      fib_lookup() converted to be called in RCU protected context, no
      reference taken and released on a contended cache line (fib_clntref)
      
      fib_table_lookup() and fib_semantic_match() get an additional parameter.
      
      struct fib_info gets an rcu_head field, and is freed after an rcu grace
      period.
      
      Stress test :
      (Sending 160.000.000 UDP frames on same neighbour,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_HASH) (about same results for FIB_TRIE)
      
      Before patch :
      
      real	1m31.199s
      user	0m13.761s
      sys	23m24.780s
      
      After patch:
      
      real	1m5.375s
      user	0m14.997s
      sys	15m50.115s
      
      Before patch Profile :
      
      13044.00 15.4% __ip_route_output_key vmlinux
       8438.00 10.0% dst_destroy           vmlinux
       5983.00  7.1% fib_semantic_match    vmlinux
       5410.00  6.4% fib_rules_lookup      vmlinux
       4803.00  5.7% neigh_lookup          vmlinux
       4420.00  5.2% _raw_spin_lock        vmlinux
       3883.00  4.6% rt_set_nexthop        vmlinux
       3261.00  3.9% _raw_read_lock        vmlinux
       2794.00  3.3% fib_table_lookup      vmlinux
       2374.00  2.8% neigh_resolve_output  vmlinux
       2153.00  2.5% dst_alloc             vmlinux
       1502.00  1.8% _raw_read_lock_bh     vmlinux
       1484.00  1.8% kmem_cache_alloc      vmlinux
       1407.00  1.7% eth_header            vmlinux
       1406.00  1.7% ipv4_dst_destroy      vmlinux
       1298.00  1.5% __copy_from_user_ll   vmlinux
       1174.00  1.4% dev_queue_xmit        vmlinux
       1000.00  1.2% ip_output             vmlinux
      
      After patch Profile :
      
      13712.00 15.8% dst_destroy             vmlinux
       8548.00  9.9% __ip_route_output_key   vmlinux
       7017.00  8.1% neigh_lookup            vmlinux
       4554.00  5.3% fib_semantic_match      vmlinux
       4067.00  4.7% _raw_read_lock          vmlinux
       3491.00  4.0% dst_alloc               vmlinux
       3186.00  3.7% neigh_resolve_output    vmlinux
       3103.00  3.6% fib_table_lookup        vmlinux
       2098.00  2.4% _raw_read_lock_bh       vmlinux
       2081.00  2.4% kmem_cache_alloc        vmlinux
       2013.00  2.3% _raw_spin_lock          vmlinux
       1763.00  2.0% __copy_from_user_ll     vmlinux
       1763.00  2.0% ip_output               vmlinux
       1761.00  2.0% ipv4_dst_destroy        vmlinux
       1631.00  1.9% eth_header              vmlinux
       1440.00  1.7% _raw_read_unlock_bh     vmlinux
      
      Reference results, if IP route cache is enabled :
      
      real	0m29.718s
      user	0m10.845s
      sys	7m37.341s
      
      25213.00 29.5% __ip_route_output_key   vmlinux
       9011.00 10.5% dst_release             vmlinux
       4817.00  5.6% ip_push_pending_frames  vmlinux
       4232.00  5.0% ip_finish_output        vmlinux
       3940.00  4.6% udp_sendmsg             vmlinux
       3730.00  4.4% __copy_from_user_ll     vmlinux
       3716.00  4.4% ip_route_output_flow    vmlinux
       2451.00  2.9% __xfrm_lookup           vmlinux
       2221.00  2.6% ip_append_data          vmlinux
       1718.00  2.0% _raw_spin_lock_bh       vmlinux
       1655.00  1.9% __alloc_skb             vmlinux
       1572.00  1.8% sock_wfree              vmlinux
       1345.00  1.6% kfree                   vmlinux
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebc0ffae
    • F
      bonding: add retransmit membership reports tunable · c2952c31
      Flavio Leitner 提交于
      Allow sysadmins to configure the number of multicast
      membership report sent on a link failure event.
      Signed-off-by: NFlavio Leitner <fleitner@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2952c31
    • E
      net neigh: RCU conversion of neigh hash table · d6bf7817
      Eric Dumazet 提交于
      David
      
      This is the first step for RCU conversion of neigh code.
      
      Next patches will convert hash_buckets[] and "struct neighbour" to RCU
      protected objects.
      
      Thanks
      
      [PATCH net-next] net neigh: RCU conversion of neigh hash table
      
      Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
      neigh_table", a new structure is defined :
      
      struct neigh_hash_table {
             struct neighbour        **hash_buckets;
             unsigned int            hash_mask;
             __u32                   hash_rnd;
             struct rcu_head         rcu;
      };
      
      And "struct neigh_table" has an RCU protected pointer to such a
      neigh_hash_table.
      
      This means the signature of (*hash)() function changed: We need to add a
      third parameter with the actual hash_rnd value, since this is not
      anymore a neigh_table field.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6bf7817