1. 12 10月, 2010 4 次提交
    • E
      neigh: reorder struct neighbour fields · e37ef961
      Eric Dumazet 提交于
      Le mardi 12 octobre 2010 à 00:02 +0200, Eric Dumazet a écrit :
      > Here is the followup patch.
      >
      > Thanks !
      >
      
      Oops, this was an old version, the up2date ones also took care of "used"
      field.
      
      I guess its time for a sleep, sorry again.
      
      [PATCH net-next V2] neigh: reorder struct neighbour fields
      
      (refcnt) and (ha_lock, ha, used, dev, output, ops, primary_key) should
      be placed on a separate cache lines.
      
      refcnt can be often written, while other fields are mostly read.
      
      This gave me good result on stress test :
      
      before:
      
      real    0m45.570s
      user    0m15.525s
      sys     9m56.669s
      
      After:
      
      real    0m41.841s
      user    0m15.261s
      sys     8m45.949s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e37ef961
    • E
      net dst: use a percpu_counter to track entries · fc66f95c
      Eric Dumazet 提交于
      struct dst_ops tracks number of allocated dst in an atomic_t field,
      subject to high cache line contention in stress workload.
      
      Switch to a percpu_counter, to reduce number of time we need to dirty a
      central location. Place it on a separate cache line to avoid dirtying
      read only fields.
      
      Stress test :
      
      (Sending 160.000.000 UDP frames,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_TRIE, SLUB/NUMA)
      
      Before:
      
      real    0m51.179s
      user    0m15.329s
      sys     10m15.942s
      
      After:
      
      real	0m45.570s
      user	0m15.525s
      sys	9m56.669s
      
      With a small reordering of struct neighbour fields, subject of a
      following patch, (to separate refcnt from other read mostly fields)
      
      real	0m41.841s
      user	0m15.261s
      sys	8m45.949s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc66f95c
    • E
      neigh: Protect neigh->ha[] with a seqlock · 0ed8ddf4
      Eric Dumazet 提交于
      Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
      dirtying neighbour in stress situation (many different flows / dsts)
      
      Dirtying takes place because of read_lock(&n->lock) and n->used writes.
      
      Switching to a seqlock, and writing n->used only on jiffies changes
      permits less dirtying.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ed8ddf4
    • E
      neigh: speedup neigh_hh_init() · 34d101dd
      Eric Dumazet 提交于
      When a new dst is used to send a frame, neigh_resolve_output() tries to
      associate an struct hh_cache to this dst, calling neigh_hh_init() with
      the neigh rwlock write locked.
      
      Most of the time, hh_cache is already known and linked into neighbour,
      so we find it and increment its refcount.
      
      This patch changes the logic so that we call neigh_hh_init() with
      neighbour lock read locked only, so that fast path can be run in
      parallel by concurrent cpus.
      
      This brings part of the speedup we got with commit c7d4426a
      (introduce DST_NOCACHE flag) for non cached dsts, even for cached ones,
      removing one of the contention point that routers hit on multiqueue
      enabled machines.
      
      Further improvements would need to use a seqlock instead of an rwlock to
      protect neigh->ha[], to not dirty neigh too often and remove two atomic
      ops.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34d101dd
  2. 09 10月, 2010 1 次提交
    • R
      Phonet: cleanup pipe enable socket option · 03789f26
      Rémi Denis-Courmont 提交于
      The current code works like this:
      
        int garbage, status;
        socklen_t len = sizeof(status);
      
        /* enable pipe */
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &garbage, sizeof(garbage));
        /* disable pipe */
        setsockopt(fd, SOL_PNPIPE, PNPIPE_DISABLE, &garbage, sizeof(garbage));
        /* get status */
        getsockopt(fd, SOL_PNPIPE, PNPIPE_INQ, &status, &len);
      
      ...which does not follow the usual socket option pattern. This patch
      merges all three "options" into a single gettable&settable option,
      before Linux 2.6.37 gets out:
      
        int status;
        socklen_t len = sizeof(status);
      
        /* enable pipe */
        status = 1;
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, sizeof(status));
        /* disable pipe */
        status = 0;
        setsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, sizeof(status));
        /* get status */
        getsockopt(fd, SOL_PNPIPE, PNPIPE_ENABLE, &status, &len);
      
      This also fixes the error code from EFAULT to ENOTCONN.
      Signed-off-by: NRémi Denis-Courmont <remi.denis-courmont@nokia.com>
      Cc: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      03789f26
  3. 08 10月, 2010 1 次提交
  4. 07 10月, 2010 3 次提交
  5. 06 10月, 2010 12 次提交
    • E
      fib: RCU conversion of fib_lookup() · ebc0ffae
      Eric Dumazet 提交于
      fib_lookup() converted to be called in RCU protected context, no
      reference taken and released on a contended cache line (fib_clntref)
      
      fib_table_lookup() and fib_semantic_match() get an additional parameter.
      
      struct fib_info gets an rcu_head field, and is freed after an rcu grace
      period.
      
      Stress test :
      (Sending 160.000.000 UDP frames on same neighbour,
      IP route cache disabled, dual E5540 @2.53GHz,
      32bit kernel, FIB_HASH) (about same results for FIB_TRIE)
      
      Before patch :
      
      real	1m31.199s
      user	0m13.761s
      sys	23m24.780s
      
      After patch:
      
      real	1m5.375s
      user	0m14.997s
      sys	15m50.115s
      
      Before patch Profile :
      
      13044.00 15.4% __ip_route_output_key vmlinux
       8438.00 10.0% dst_destroy           vmlinux
       5983.00  7.1% fib_semantic_match    vmlinux
       5410.00  6.4% fib_rules_lookup      vmlinux
       4803.00  5.7% neigh_lookup          vmlinux
       4420.00  5.2% _raw_spin_lock        vmlinux
       3883.00  4.6% rt_set_nexthop        vmlinux
       3261.00  3.9% _raw_read_lock        vmlinux
       2794.00  3.3% fib_table_lookup      vmlinux
       2374.00  2.8% neigh_resolve_output  vmlinux
       2153.00  2.5% dst_alloc             vmlinux
       1502.00  1.8% _raw_read_lock_bh     vmlinux
       1484.00  1.8% kmem_cache_alloc      vmlinux
       1407.00  1.7% eth_header            vmlinux
       1406.00  1.7% ipv4_dst_destroy      vmlinux
       1298.00  1.5% __copy_from_user_ll   vmlinux
       1174.00  1.4% dev_queue_xmit        vmlinux
       1000.00  1.2% ip_output             vmlinux
      
      After patch Profile :
      
      13712.00 15.8% dst_destroy             vmlinux
       8548.00  9.9% __ip_route_output_key   vmlinux
       7017.00  8.1% neigh_lookup            vmlinux
       4554.00  5.3% fib_semantic_match      vmlinux
       4067.00  4.7% _raw_read_lock          vmlinux
       3491.00  4.0% dst_alloc               vmlinux
       3186.00  3.7% neigh_resolve_output    vmlinux
       3103.00  3.6% fib_table_lookup        vmlinux
       2098.00  2.4% _raw_read_lock_bh       vmlinux
       2081.00  2.4% kmem_cache_alloc        vmlinux
       2013.00  2.3% _raw_spin_lock          vmlinux
       1763.00  2.0% __copy_from_user_ll     vmlinux
       1763.00  2.0% ip_output               vmlinux
       1761.00  2.0% ipv4_dst_destroy        vmlinux
       1631.00  1.9% eth_header              vmlinux
       1440.00  1.7% _raw_read_unlock_bh     vmlinux
      
      Reference results, if IP route cache is enabled :
      
      real	0m29.718s
      user	0m10.845s
      sys	7m37.341s
      
      25213.00 29.5% __ip_route_output_key   vmlinux
       9011.00 10.5% dst_release             vmlinux
       4817.00  5.6% ip_push_pending_frames  vmlinux
       4232.00  5.0% ip_finish_output        vmlinux
       3940.00  4.6% udp_sendmsg             vmlinux
       3730.00  4.4% __copy_from_user_ll     vmlinux
       3716.00  4.4% ip_route_output_flow    vmlinux
       2451.00  2.9% __xfrm_lookup           vmlinux
       2221.00  2.6% ip_append_data          vmlinux
       1718.00  2.0% _raw_spin_lock_bh       vmlinux
       1655.00  1.9% __alloc_skb             vmlinux
       1572.00  1.8% sock_wfree              vmlinux
       1345.00  1.6% kfree                   vmlinux
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebc0ffae
    • F
      bonding: add retransmit membership reports tunable · c2952c31
      Flavio Leitner 提交于
      Allow sysadmins to configure the number of multicast
      membership report sent on a link failure event.
      Signed-off-by: NFlavio Leitner <fleitner@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c2952c31
    • E
      net neigh: RCU conversion of neigh hash table · d6bf7817
      Eric Dumazet 提交于
      David
      
      This is the first step for RCU conversion of neigh code.
      
      Next patches will convert hash_buckets[] and "struct neighbour" to RCU
      protected objects.
      
      Thanks
      
      [PATCH net-next] net neigh: RCU conversion of neigh hash table
      
      Instead of storing hash_buckets, hash_mask and hash_rnd in "struct
      neigh_table", a new structure is defined :
      
      struct neigh_hash_table {
             struct neighbour        **hash_buckets;
             unsigned int            hash_mask;
             __u32                   hash_rnd;
             struct rcu_head         rcu;
      };
      
      And "struct neigh_table" has an RCU protected pointer to such a
      neigh_hash_table.
      
      This means the signature of (*hash)() function changed: We need to add a
      third parameter with the actual hash_rnd value, since this is not
      anymore a neigh_table field.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6bf7817
    • E
      net: add a core netdev->rx_dropped counter · caf586e5
      Eric Dumazet 提交于
      In various situations, a device provides a packet to our stack and we
      drop it before it enters protocol stack :
      - softnet backlog full (accounted in /proc/net/softnet_stat)
      - bad vlan tag (not accounted)
      - unknown/unregistered protocol (not accounted)
      
      We can handle a per-device counter of such dropped frames at core level,
      and automatically adds it to the device provided stats (rx_dropped), so
      that standard tools can be used (ifconfig, ip link, cat /proc/net/dev)
      
      This is a generalization of commit 8990f468 (net: rx_dropped
      accounting), thus reverting it.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      caf586e5
    • E
      wait: using uninitialized member of wait queue · 231d0aef
      Evgeny Kuznetsov 提交于
      The "flags" member of "struct wait_queue_t" is used in several places in
      the kernel code without beeing initialized by init_wait().  "flags" is
      used in bitwise operations.
      
      If "flags" not initialized then unexpected behaviour may take place.
      Incorrect flags might used later in code.
      
      Added initialization of "wait_queue_t.flags" with zero value into
      "init_wait".
      Signed-off-by: NEvgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com>
      [ The bit we care about does end up being initialized by both
         prepare_to_wait() and add_to_wait_queue(), so this doesn't seem to
         cause actual bugs, but is definitely the right thing to do -Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      231d0aef
    • L
      modules: Fix module_bug_list list corruption race · 5336377d
      Linus Torvalds 提交于
      With all the recent module loading cleanups, we've minimized the code
      that sits under module_mutex, fixing various deadlocks and making it
      possible to do most of the module loading in parallel.
      
      However, that whole conversion totally missed the rather obscure code
      that adds a new module to the list for BUG() handling.  That code was
      doubly obscure because (a) the code itself lives in lib/bugs.c (for
      dubious reasons) and (b) it gets called from the architecture-specific
      "module_finalize()" rather than from generic code.
      
      Calling it from arch-specific code makes no sense what-so-ever to begin
      with, and is now actively wrong since that code isn't protected by the
      module loading lock any more.
      
      So this commit moves the "module_bug_{finalize,cleanup}()" calls away
      from the arch-specific code, and into the generic code - and in the
      process protects it with the module_mutex so that the list operations
      are now safe.
      
      Future fixups:
       - move the module list handling code into kernel/module.c where it
         belongs.
       - get rid of 'module_bug_list' and just use the regular list of modules
         (called 'modules' - imagine that) that we already create and maintain
         for other reasons.
      Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5336377d
    • J
      nl80211: fix remain-on-channel documentation · 691895e7
      Johannes Berg 提交于
      The documentation for NL80211_CMD_REMAIN_ON_CHANNEL
      isn't accurate, an interface index is required by
      the command. Update it accordingly.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      691895e7
    • J
      genetlink: introduce pre_doit/post_doit hooks · ff4c92d8
      Johannes Berg 提交于
      Each family may have some amount of boilerplate
      locking code that applies to most, or even all,
      commands.
      
      This allows a family to handle such things in
      a more generic way, by allowing it to
       a) include private flags in each operation
       b) specify a pre_doit hook that is called,
          before an operation's doit() callback and
          may return an error directly,
       c) specify a post_doit hook that can undo
          locking or similar things done by pre_doit,
          and finally
       d) include two private pointers in each info
          struct passed between all these operations
          including doit(). (It's two because I'll
          need two in nl80211 -- can be extended.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      ff4c92d8
    • H
      mac80211: distinct between max rates and the number of rates the hw can report · 78be49ec
      Helmut Schaa 提交于
      Some drivers cannot handle multiple retry rates specified by the rc
      algorithm but instead use their own retry table (for example rt2800).
      However, if such a device registers itself with a max_rates value of 1
      the rc algorithm cannot make use of the extended information the device
      can provide about retried rates. On the other hand, if a device
      registers itself with a max_rates value > 1 the rc algorithm assumes
      that the device can handle multi rate retries.
      
      Fix this issue by introducing another hw parameter max_report_rates that
      can be set to a different value then max_rates to indicate if a device
      is capable of reporting more rates then specified in max_rates.
      Signed-off-by: NHelmut Schaa <helmut.schaa@googlemail.com>
      Signed-off-by: NIvo van Doorn <IvDoorn@gmail.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      78be49ec
    • B
      cfg80211: patches to allow setting the WDS peer · e8347eba
      Bill Jordan 提交于
      Added a nl interface to set the peer bssid of a WDS interface.
      Signed-off-by: NBill Jordan <bjordan@rajant.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      e8347eba
    • J
      cfg80211: remove spurious __KERNEL__ ifdef · ea229e68
      Johannes Berg 提交于
      The net/cfg80211.h header file isn't exported to
      userspace, so there's no need for any kind of
      __KERNEL__ protection in it. If it was exported,
      everything else in it would need protection as
      well, not just the logging stuff ...
      
      Cc:Joe Perches <joe@perches.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      ea229e68
    • F
      nl80211: allow drivers to indicate whether the survey data channel is in use · 17e5a808
      Felix Fietkau 提交于
      Some user space applications only want to display survey data for
      the operating channel, however there is no API to get that yet.
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      17e5a808
  6. 05 10月, 2010 4 次提交
  7. 04 10月, 2010 2 次提交
    • E
      net: introduce DST_NOCACHE flag · c7d4426a
      Eric Dumazet 提交于
      While doing stress tests with IP route cache disabled, and multi queue
      devices, I noticed a very high contention on one rwlock used in
      neighbour code.
      
      When many cpus are trying to send frames (possibly using a high
      performance multiqueue device) to the same neighbour, they fight for the
      neigh->lock rwlock in order to call neigh_hh_init(), and fight on
      hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test())
      
      But we dont need to call neigh_hh_init() for dst that are used only
      once. It costs four atomic operations at least, on two contended cache
      lines, plus the high contention on neigh->lock rwlock.
      
      Introduce a new dst flag, DST_NOCACHE, that is set when dst was not
      inserted in route cache.
      
      With the stress test bench, sending 160000000 frames on one neighbour,
      results are :
      
      Before patch:
      
      real	2m28.406s
      user	0m11.781s
      sys	36m17.964s
      
      
      After patch:
      
      real	1m26.532s
      user	0m12.185s
      sys	20m3.903s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d4426a
    • E
      ipmr: RCU protection for mfc_cache_array · a8c9486b
      Eric Dumazet 提交于
      Use RCU & RTNL protection for mfc_cache_array[]
      
      ipmr_cache_find() is called under rcu_read_lock();
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8c9486b
  8. 01 10月, 2010 5 次提交
  9. 30 9月, 2010 6 次提交
  10. 29 9月, 2010 2 次提交
    • T
      ipv4: Allow configuring subnets as local addresses · 4465b469
      Tom Herbert 提交于
      This patch allows a host to be configured to respond to any address in
      a specified range as if it were local, without actually needing to
      configure the address on an interface.  This is done through routing
      table configuration.  For instance, to configure a host to respond
      to any address in 10.1/16 received on eth0 as a local address we can do:
      
      ip rule add from all iif eth0 lookup 200
      ip route add local 10.1/16 dev lo proto kernel scope host src 127.0.0.1 table 200
      
      This host is now reachable by any 10.1/16 address (route lookup on
      input for packets received on eth0 can find the route).  On output, the
      rule will not be matched so that this host can still send packets to
      10.1/16 (not sent on loopback).  Presumably, external routing can be
      configured to make sense out of this.
      
      To make this work, we needed to modify the logic in finding the
      interface which is assigned a given source address for output
      (dev_ip_find).  We perform a normal fib_lookup instead of just a
      lookup on the local table, and in the lookup we ignore the input
      interface for matching.
      
      This patch is useful to implement IP-anycast for subnets of virtual
      addresses.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4465b469
    • L
      ACPI: Fix typos · 58f87ed0
      Lucas De Marchi 提交于
      Signed-off-by: NLen Brown <len.brown@intel.com>
      58f87ed0