1. 01 10月, 2010 1 次提交
    • E
      ipv4: __mkroute_output() speedup · dd28d1a0
      Eric Dumazet 提交于
      While doing stress tests with a disabled IP route cache, I found
      __mkroute_output() was touching three times in_device atomic refcount.
      
      Use RCU to touch it once to reduce cache line ping pongs.
      
      Before patch
      
      time to perform the test
      real	1m42.009s
      user	0m12.545s
      sys	25m0.726s
      
      Profile :
      
      16109.00 26.4% ip_route_output_slow   vmlinux
       7434.00 12.2% dst_destroy            vmlinux
       3280.00  5.4% fib_rules_lookup       vmlinux
       3252.00  5.3% fib_semantic_match     vmlinux
       2622.00  4.3% fib_table_lookup       vmlinux
       2535.00  4.1% dst_alloc              vmlinux
       1750.00  2.9% _raw_read_lock         vmlinux
       1532.00  2.5% rt_set_nexthop         vmlinux
      
      After patch
      
      real	1m36.503s
      user	0m12.977s
      sys	23m25.608s
      
      14234.00 22.4% ip_route_output_slow   vmlinux
       8717.00 13.7% dst_destroy            vmlinux
       4052.00  6.4% fib_rules_lookup       vmlinux
       3951.00  6.2% fib_semantic_match     vmlinux
       3191.00  5.0% dst_alloc              vmlinux
       1764.00  2.8% fib_table_lookup       vmlinux
       1692.00  2.7% _raw_read_lock         vmlinux
       1605.00  2.5% rt_set_nexthop         vmlinux
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd28d1a0
  2. 27 9月, 2010 1 次提交
  3. 24 9月, 2010 1 次提交
  4. 09 9月, 2010 1 次提交
  5. 20 8月, 2010 1 次提交
  6. 23 7月, 2010 1 次提交
  7. 13 7月, 2010 1 次提交
  8. 17 6月, 2010 1 次提交
    • E
      inetpeer: restore small inet_peer structures · 317fe0e6
      Eric Dumazet 提交于
      Addition of rcu_head to struct inet_peer added 16bytes on 64bit arches.
      
      Thats a bit unfortunate, since old size was exactly 64 bytes.
      
      This can be solved, using an union between this rcu_head an four fields,
      that are normally used only when a refcount is taken on inet_peer.
      rcu_head is used only when refcnt=-1, right before structure freeing.
      
      Add a inet_peer_refcheck() function to check this assertion for a while.
      
      We can bring back SLAB_HWCACHE_ALIGN qualifier in kmem cache creation.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      317fe0e6
  9. 11 6月, 2010 1 次提交
  10. 08 6月, 2010 1 次提交
  11. 04 6月, 2010 1 次提交
  12. 03 6月, 2010 2 次提交
  13. 31 5月, 2010 1 次提交
  14. 18 5月, 2010 2 次提交
    • E
      net: implements ip_route_input_noref() · 407eadd9
      Eric Dumazet 提交于
      ip_route_input() is the version returning a refcounted dst, while
      ip_route_input_noref() returns a non refcounted one.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      407eadd9
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
  15. 08 5月, 2010 1 次提交
  16. 21 4月, 2010 1 次提交
  17. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  18. 27 3月, 2010 2 次提交
  19. 22 3月, 2010 1 次提交
    • G
      ipv4: Don't drop redirected route cache entry unless PTMU actually expired · 5e016cbf
      Guenter Roeck 提交于
      TCP sessions over IPv4 can get stuck if routers between endpoints
      do not fragment packets but implement PMTU instead, and we are using
      those routers because of an ICMP redirect.
      
      Setup is as follows
      
             MTU1    MTU2   MTU1
          A--------B------C------D
      
      with MTU1 > MTU2. A and D are endpoints, B and C are routers. B and C
      implement PMTU and drop packets larger than MTU2 (for example because
      DF is set on all packets). TCP sessions are initiated between A and D.
      There is packet loss between A and D, causing frequent TCP
      retransmits.
      
      After the number of retransmits on a TCP session reaches tcp_retries1,
      tcp calls dst_negative_advice() prior to each retransmit. This results
      in route cache entries for the peer to be deleted in
      ipv4_negative_advice() if the Path MTU is set.
      
      If the outstanding data on an affected TCP session is larger than
      MTU2, packets sent from the endpoints will be dropped by B or C, and
      ICMP NEEDFRAG will be returned. A and D receive NEEDFRAG messages and
      update PMTU.
      
      Before the next retransmit, tcp will again call dst_negative_advice(),
      causing the route cache entry (with correct PMTU) to be deleted. The
      retransmitted packet will be larger than MTU2, causing it to be
      dropped again.
      
      This sequence repeats until the TCP session aborts or is terminated.
      
      Problem is fixed by removing redirected route cache entries in
      ipv4_negative_advice() only if the PMTU is expired.
      Signed-off-by: NGuenter Roeck <guenter.roeck@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e016cbf
  20. 20 3月, 2010 1 次提交
    • T
      ipv4: check rt_genid in dst_check · d11a4dc1
      Timo Teräs 提交于
      Xfrm_dst keeps a reference to ipv4 rtable entries on each
      cached bundle. The only way to renew xfrm_dst when the underlying
      route has changed, is to implement dst_check for this. This is
      what ipv6 side does too.
      
      The problems started after 87c1e12b
      ("ipsec: Fix bogus bundle flowi") which fixed a bug causing xfrm_dst
      to not get reused, until that all lookups always generated new
      xfrm_dst with new route reference and path mtu worked. But after the
      fix, the old routes started to get reused even after they were expired
      causing pmtu to break (well it would occationally work if the rtable
      gc had run recently and marked the route obsolete causing dst_check to
      get called).
      Signed-off-by: NTimo Teras <timo.teras@iki.fi>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d11a4dc1
  21. 17 3月, 2010 1 次提交
  22. 09 3月, 2010 1 次提交
    • E
      net: fix route cache rebuilds · 98376387
      Eric Dumazet 提交于
      We added an automatic route cache rebuilding in commit 1080d709
      but had to correct few bugs. One of the assumption of original patch,
      was that entries where kept sorted in a given way.
      
      This assumption is known to be wrong (commit 1ddbcb00 gave an
      explanation of this and corrected a leak) and expensive to respect.
      
      Paweł Staszewski reported to me one of his machine got its routing cache
      disabled after few messages like :
      
      [ 2677.850065] Route hash chain too long!
      [ 2677.850080] Adjust your secret_interval!
      [82839.662993] Route hash chain too long!
      [82839.662996] Adjust your secret_interval!
      [155843.731650] Route hash chain too long!
      [155843.731664] Adjust your secret_interval!
      [155843.811881] Route hash chain too long!
      [155843.811891] Adjust your secret_interval!
      [155843.858209] vlan0811: 5 rebuilds is over limit, route caching
      disabled
      [155843.858212] Route hash chain too long!
      [155843.858213] Adjust your secret_interval!
      
      This is because rt_intern_hash() might be fooled when computing a chain
      length, because multiple entries with same keys can differ because of
      TOS (or mark/oif) bits.
      
      In the rare case the fast algorithm see a too long chain, and before
      taking expensive path, we call a helper function in order to not count
      duplicates of same routes, that only differ with tos/mark/oif bits. This
      helper works with data already in cpu cache and is not be very
      expensive, despite its O(N^2) implementation.
      
      Paweł Staszewski sucessfully tested this patch on his loaded router.
      Reported-and-tested-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98376387
  23. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  24. 17 2月, 2010 1 次提交
    • T
      percpu: add __percpu sparse annotations to net · 7d720c3e
      Tejun Heo 提交于
      Add __percpu sparse annotations to net.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      
      The macro and type tricks around snmp stats make things a bit
      interesting.  DEFINE/DECLARE_SNMP_STAT() macros mark the target field
      as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly.  All
      snmp_mib_*() users which used to cast the argument to (void **) are
      updated to cast it to (void __percpu **).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d720c3e
  25. 18 1月, 2010 1 次提交
  26. 07 1月, 2010 1 次提交
    • J
      net: RFC3069, private VLAN proxy arp support · 65324144
      Jesper Dangaard Brouer 提交于
      This is to be used together with switch technologies, like RFC3069,
      that where the individual ports are not allowed to communicate with
      each other, but they are allowed to talk to the upstream router.  As
      described in RFC 3069, it is possible to allow these hosts to
      communicate through the upstream router by proxy_arp'ing.
      
      This patch basically allow proxy arp replies back to the same
      interface (from which the ARP request/solicitation was received).
      
      Tunable per device via proc "proxy_arp_pvlan":
        /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan
      
      This switch technology is known by different vendor names:
       - In RFC 3069 it is called VLAN Aggregation.
       - Cisco and Allied Telesyn call it Private VLAN.
       - Hewlett-Packard call it Source-Port filtering or port-isolation.
       - Ericsson call it MAC-Forced Forwarding (RFC Draft).
      Signed-off-by: NJesper Dangaard Brouer <hawk@comx.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65324144
  27. 02 12月, 2009 1 次提交
    • E
      net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH · a5ee1551
      Eric W. Biederman 提交于
      The motivation for an additional notifier in batched netdevice
      notification (rt_do_flush) only needs to be called once per batch not
      once per namespace.
      
      For further batching improvements I need a guarantee that the
      netdevices are unregistered in order allowing me to unregister an all
      of the network devices in a network namespace at the same time with
      the guarantee that the loopback device is really and truly
      unregistered last.
      
      Additionally it appears that we moved the route cache flush after
      the final synchronize_net, which seems wrong and there was no
      explanation.  So I have restored the original location of the final
      synchronize_net.
      
      Cc: Octavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5ee1551
  28. 26 11月, 2009 2 次提交
  29. 24 11月, 2009 1 次提交
  30. 14 11月, 2009 1 次提交
    • E
      inetpeer: Optimize inet_getid() · 2c1409a0
      Eric Dumazet 提交于
      While investigating for network latencies, I found inet_getid() was a
      contention point for some workloads, as inet_peer_idlock is shared
      by all inet_getid() users regardless of peers.
      
      One way to fix this is to make ip_id_count an atomic_t instead
      of __u16, and use atomic_add_return().
      
      In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
      tcp_ts_stamp is also converted to __u32 instead of "unsigned long".
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c1409a0
  31. 12 11月, 2009 1 次提交
    • E
      sysctl net: Remove unused binary sysctl code · f8572d8f
      Eric W. Biederman 提交于
      Now that sys_sysctl is a compatiblity wrapper around /proc/sys
      all sysctl strategy routines, and all ctl_name and strategy
      entries in the sysctl tables are unused, and can be
      revmoed.
      
      In addition neigh_sysctl_register has been modified to no longer
      take a strategy argument and it's callers have been modified not
      to pass one.
      
      Cc: "David Miller" <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f8572d8f
  32. 30 10月, 2009 1 次提交
  33. 20 10月, 2009 1 次提交
  34. 24 9月, 2009 1 次提交
  35. 22 9月, 2009 1 次提交
  36. 29 8月, 2009 1 次提交