1. 27 3月, 2010 2 次提交
  2. 22 3月, 2010 1 次提交
    • G
      ipv4: Don't drop redirected route cache entry unless PTMU actually expired · 5e016cbf
      Guenter Roeck 提交于
      TCP sessions over IPv4 can get stuck if routers between endpoints
      do not fragment packets but implement PMTU instead, and we are using
      those routers because of an ICMP redirect.
      
      Setup is as follows
      
             MTU1    MTU2   MTU1
          A--------B------C------D
      
      with MTU1 > MTU2. A and D are endpoints, B and C are routers. B and C
      implement PMTU and drop packets larger than MTU2 (for example because
      DF is set on all packets). TCP sessions are initiated between A and D.
      There is packet loss between A and D, causing frequent TCP
      retransmits.
      
      After the number of retransmits on a TCP session reaches tcp_retries1,
      tcp calls dst_negative_advice() prior to each retransmit. This results
      in route cache entries for the peer to be deleted in
      ipv4_negative_advice() if the Path MTU is set.
      
      If the outstanding data on an affected TCP session is larger than
      MTU2, packets sent from the endpoints will be dropped by B or C, and
      ICMP NEEDFRAG will be returned. A and D receive NEEDFRAG messages and
      update PMTU.
      
      Before the next retransmit, tcp will again call dst_negative_advice(),
      causing the route cache entry (with correct PMTU) to be deleted. The
      retransmitted packet will be larger than MTU2, causing it to be
      dropped again.
      
      This sequence repeats until the TCP session aborts or is terminated.
      
      Problem is fixed by removing redirected route cache entries in
      ipv4_negative_advice() only if the PMTU is expired.
      Signed-off-by: NGuenter Roeck <guenter.roeck@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e016cbf
  3. 20 3月, 2010 1 次提交
    • T
      ipv4: check rt_genid in dst_check · d11a4dc1
      Timo Teräs 提交于
      Xfrm_dst keeps a reference to ipv4 rtable entries on each
      cached bundle. The only way to renew xfrm_dst when the underlying
      route has changed, is to implement dst_check for this. This is
      what ipv6 side does too.
      
      The problems started after 87c1e12b
      ("ipsec: Fix bogus bundle flowi") which fixed a bug causing xfrm_dst
      to not get reused, until that all lookups always generated new
      xfrm_dst with new route reference and path mtu worked. But after the
      fix, the old routes started to get reused even after they were expired
      causing pmtu to break (well it would occationally work if the rtable
      gc had run recently and marked the route obsolete causing dst_check to
      get called).
      Signed-off-by: NTimo Teras <timo.teras@iki.fi>
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d11a4dc1
  4. 17 3月, 2010 1 次提交
  5. 09 3月, 2010 1 次提交
    • E
      net: fix route cache rebuilds · 98376387
      Eric Dumazet 提交于
      We added an automatic route cache rebuilding in commit 1080d709
      but had to correct few bugs. One of the assumption of original patch,
      was that entries where kept sorted in a given way.
      
      This assumption is known to be wrong (commit 1ddbcb00 gave an
      explanation of this and corrected a leak) and expensive to respect.
      
      Paweł Staszewski reported to me one of his machine got its routing cache
      disabled after few messages like :
      
      [ 2677.850065] Route hash chain too long!
      [ 2677.850080] Adjust your secret_interval!
      [82839.662993] Route hash chain too long!
      [82839.662996] Adjust your secret_interval!
      [155843.731650] Route hash chain too long!
      [155843.731664] Adjust your secret_interval!
      [155843.811881] Route hash chain too long!
      [155843.811891] Adjust your secret_interval!
      [155843.858209] vlan0811: 5 rebuilds is over limit, route caching
      disabled
      [155843.858212] Route hash chain too long!
      [155843.858213] Adjust your secret_interval!
      
      This is because rt_intern_hash() might be fooled when computing a chain
      length, because multiple entries with same keys can differ because of
      TOS (or mark/oif) bits.
      
      In the rare case the fast algorithm see a too long chain, and before
      taking expensive path, we call a helper function in order to not count
      duplicates of same routes, that only differ with tos/mark/oif bits. This
      helper works with data already in cpu cache and is not be very
      expensive, despite its O(N^2) implementation.
      
      Paweł Staszewski sucessfully tested this patch on his loaded router.
      Reported-and-tested-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98376387
  6. 25 2月, 2010 1 次提交
    • P
      net: Add checking to rcu_dereference() primitives · a898def2
      Paul E. McKenney 提交于
      Update rcu_dereference() primitives to use new lockdep-based
      checking. The rcu_dereference() in __in6_dev_get() may be
      protected either by rcu_read_lock() or RTNL, per Eric Dumazet.
      The rcu_dereference() in __sk_free() is protected by the fact
      that it is never reached if an update could change it.  Check
      for this by using rcu_dereference_check() to verify that the
      struct sock's ->sk_wmem_alloc counter is zero.
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1266887105-1528-5-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a898def2
  7. 17 2月, 2010 1 次提交
    • T
      percpu: add __percpu sparse annotations to net · 7d720c3e
      Tejun Heo 提交于
      Add __percpu sparse annotations to net.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      
      The macro and type tricks around snmp stats make things a bit
      interesting.  DEFINE/DECLARE_SNMP_STAT() macros mark the target field
      as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly.  All
      snmp_mib_*() users which used to cast the argument to (void **) are
      updated to cast it to (void __percpu **).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d720c3e
  8. 18 1月, 2010 1 次提交
  9. 07 1月, 2010 1 次提交
    • J
      net: RFC3069, private VLAN proxy arp support · 65324144
      Jesper Dangaard Brouer 提交于
      This is to be used together with switch technologies, like RFC3069,
      that where the individual ports are not allowed to communicate with
      each other, but they are allowed to talk to the upstream router.  As
      described in RFC 3069, it is possible to allow these hosts to
      communicate through the upstream router by proxy_arp'ing.
      
      This patch basically allow proxy arp replies back to the same
      interface (from which the ARP request/solicitation was received).
      
      Tunable per device via proc "proxy_arp_pvlan":
        /proc/sys/net/ipv4/conf/*/proxy_arp_pvlan
      
      This switch technology is known by different vendor names:
       - In RFC 3069 it is called VLAN Aggregation.
       - Cisco and Allied Telesyn call it Private VLAN.
       - Hewlett-Packard call it Source-Port filtering or port-isolation.
       - Ericsson call it MAC-Forced Forwarding (RFC Draft).
      Signed-off-by: NJesper Dangaard Brouer <hawk@comx.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65324144
  10. 02 12月, 2009 1 次提交
    • E
      net: NETDEV_UNREGISTER_PERNET -> NETDEV_UNREGISTER_BATCH · a5ee1551
      Eric W. Biederman 提交于
      The motivation for an additional notifier in batched netdevice
      notification (rt_do_flush) only needs to be called once per batch not
      once per namespace.
      
      For further batching improvements I need a guarantee that the
      netdevices are unregistered in order allowing me to unregister an all
      of the network devices in a network namespace at the same time with
      the guarantee that the loopback device is really and truly
      unregistered last.
      
      Additionally it appears that we moved the route cache flush after
      the final synchronize_net, which seems wrong and there was no
      explanation.  So I have restored the original location of the final
      synchronize_net.
      
      Cc: Octavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5ee1551
  11. 26 11月, 2009 2 次提交
  12. 24 11月, 2009 1 次提交
  13. 14 11月, 2009 1 次提交
    • E
      inetpeer: Optimize inet_getid() · 2c1409a0
      Eric Dumazet 提交于
      While investigating for network latencies, I found inet_getid() was a
      contention point for some workloads, as inet_peer_idlock is shared
      by all inet_getid() users regardless of peers.
      
      One way to fix this is to make ip_id_count an atomic_t instead
      of __u16, and use atomic_add_return().
      
      In order to keep sizeof(struct inet_peer) = 64 on 64bit arches
      tcp_ts_stamp is also converted to __u32 instead of "unsigned long".
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c1409a0
  14. 12 11月, 2009 1 次提交
    • E
      sysctl net: Remove unused binary sysctl code · f8572d8f
      Eric W. Biederman 提交于
      Now that sys_sysctl is a compatiblity wrapper around /proc/sys
      all sysctl strategy routines, and all ctl_name and strategy
      entries in the sysctl tables are unused, and can be
      revmoed.
      
      In addition neigh_sysctl_register has been modified to no longer
      take a strategy argument and it's callers have been modified not
      to pass one.
      
      Cc: "David Miller" <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      f8572d8f
  15. 30 10月, 2009 1 次提交
  16. 20 10月, 2009 1 次提交
  17. 24 9月, 2009 1 次提交
  18. 22 9月, 2009 1 次提交
  19. 29 8月, 2009 1 次提交
  20. 31 7月, 2009 1 次提交
    • N
      xfrm: select sane defaults for xfrm[4|6] gc_thresh · a33bc5c1
      Neil Horman 提交于
      Choose saner defaults for xfrm[4|6] gc_thresh values on init
      
      Currently, the xfrm[4|6] code has hard-coded initial gc_thresh values
      (set to 1024).  Given that the ipv4 and ipv6 routing caches are sized
      dynamically at boot time, the static selections can be non-sensical.
      This patch dynamically selects an appropriate gc threshold based on
      the corresponding main routing table size, using the assumption that
      we should in the worst case be able to handle as many connections as
      the routing table can.
      
      For ipv4, the maximum route cache size is 16 * the number of hash
      buckets in the route cache.  Given that xfrm4 starts garbage
      collection at the gc_thresh and prevents new allocations at 2 *
      gc_thresh, we set gc_thresh to half the maximum route cache size.
      
      For ipv6, its a bit trickier.  there is no maximum route cache size,
      but the ipv6 dst_ops gc_thresh is statically set to 1024.  It seems
      sane to select a simmilar gc_thresh for the xfrm6 code that is half
      the number of hash buckets in the v6 route cache times 16 (like the v4
      code does).
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a33bc5c1
  21. 24 6月, 2009 1 次提交
    • N
      ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off · b6280b47
      Neil Horman 提交于
      When route caching is disabled (rt_caching returns false), We still use route
      cache entries that are created and passed into rt_intern_hash once.  These
      routes need to be made usable for the one call path that holds a reference to
      them, and they need to be reclaimed when they're finished with their use.  To be
      made usable, they need to be associated with a neighbor table entry (which they
      currently are not), otherwise iproute_finish2 just discards the packet, since we
      don't know which L2 peer to send the packet to.  To do this binding, we need to
      follow the path a bit higher up in rt_intern_hash, which calls
      arp_bind_neighbour, but not assign the route entry to the hash table.
      Currently, if caching is off, we simply assign the route to the rp pointer and
      are reutrn success.  This patch associates us with a neighbor entry first.
      
      Secondly, we need to make sure that any single use routes like this are known to
      the garbage collector when caching is off.  If caching is off, and we try to
      hash in a route, it will leak when its refcount reaches zero.  To avoid this,
      this patch calls rt_free on the route cache entry passed into rt_intern_hash.
      This places us on the gc list for the route cache garbage collector, so that
      when its refcount reaches zero, it will be reclaimed (Thanks to Alexey for this
      suggestion).
      
      I've tested this on a local system here, and with these patches in place, I'm
      able to maintain routed connectivity to remote systems, even if I set
      /proc/sys/net/ipv4/rt_cache_rebuild_count to -1, which forces rt_caching to
      return false.
      Signed-off-by: NNeil Horman <nhorman@redhat.com>
      Reported-by: NJarek Poplawski <jarkao2@gmail.com>
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6280b47
  22. 20 6月, 2009 1 次提交
    • N
      ipv4: fix NULL pointer + success return in route lookup path · 73e42897
      Neil Horman 提交于
      Don't drop route if we're not caching	
      
      	I recently got a report of an oops on a route lookup.  Maxime was
      testing what would happen if route caching was turned off (doing so by setting
      making rt_caching always return 0), and found that it triggered an oops.  I
      looked at it and found that the problem stemmed from the fact that the route
      lookup routines were returning success from their lookup paths (which is good),
      but never set the **rp pointer to anything (which is bad).  This happens because
      in rt_intern_hash, if rt_caching returns false, we call rt_drop and return 0.
      This almost emulates slient success.  What we should be doing is assigning *rp =
      rt and _not_ dropping the route.  This way, during slow path lookups, when we
      create a new route cache entry, we don't immediately discard it, rather we just
      don't add it into the cache hash table, but we let this one lookup use it for
      the purpose of this route request.  Maxime has tested and reports it prevents
      the oops.  There is still a subsequent routing issue that I'm looking into
      further, but I'm confident that, even if its related to this same path, this
      patch makes sense to take.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73e42897
  23. 14 6月, 2009 1 次提交
  24. 03 6月, 2009 2 次提交
  25. 21 5月, 2009 2 次提交
    • E
      net: fix rtable leak in net/ipv4/route.c · 1ddbcb00
      Eric Dumazet 提交于
      Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
      analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
      Quoted here because its a perfect one :
      
      begin_of_quotation
       2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
       patch has at least one critical flaw, and another problem.
      
       rt_intern_hash calculates rthi pointer, which is later used for new entry
       insertion. The same loop calculates cand pointer which is used to clean the
       list. If the pointers are the same, rtable leak occurs, as first the cand is
       removed then the new entry is appended to it.
      
       This leak leads to unregister_netdevice problem (usage count > 0).
      
       Another problem of the patch is that it tries to insert the entries in certain
       order, to facilitate counting of entries distinct by all but QoS parameters.
       Unfortunately, referencing an existing rtable entry moves it to list beginning,
       to speed up further lookups, so the carefully built order is destroyed.
      
       For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
       it will also destroy the ordering.
      end_of_quotation
      
      Problematic commit is 1080d709
      (net: implement emergency route cache rebulds when gc_elasticity is exceeded)
      
      Trying to keep dst_entries ordered is too complex and breaks the fact that
      order should depend on the frequency of use for garbage collection.
      
      A possible fix is to make rt_intern_hash() simpler, and only makes
      rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
      entries order. The added loop is running on cache hot data, while cpu
      is prefetching next object, so should be unnoticied.
      Reported-and-analyzed-by: NAlexander V. Lukyanov <lav@yar.ru>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ddbcb00
    • E
      net: fix length computation in rt_check_expire() · cf8da764
      Eric Dumazet 提交于
      rt_check_expire() computes average and standard deviation of chain lengths,
      but not correclty reset length to 0 at beginning of each chain.
      This probably gives overflows for sum2 (and sum) on loaded machines instead
      of meaningful results.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf8da764
  26. 27 4月, 2009 1 次提交
    • A
      ipv4: Limit size of route cache hash table · c9503e0f
      Anton Blanchard 提交于
      Right now we have no upper limit on the size of the route cache hash table.
      On a 128GB POWER6 box it ends up as 32MB:
      
          IP route cache hash table entries: 4194304 (order: 9, 33554432 bytes)
      
      It would be nice to cap this for memory consumption reasons, but a massive
      hashtable also causes a significant spike when measuring OS jitter.
      
      With a 32MB hashtable and 4 million entries, rt_worker_func is taking
      5 ms to complete. On another system with more memory it's taking 14 ms.
      Even though rt_worker_func does call cond_sched() to limit its impact,
      in an HPC environment we want to keep all sources of OS jitter to a minimum.
      
      With the patch applied we limit the number of entries to 512k which
      can still be overriden by using the rt_entries boot option:
      
          IP route cache hash table entries: 524288 (order: 6, 4194304 bytes)
      
      With this patch rt_worker_func now takes 0.460 ms on the same system.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Acked-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9503e0f
  27. 25 2月, 2009 1 次提交
  28. 01 2月, 2009 1 次提交
  29. 23 1月, 2009 1 次提交
    • B
      netns: ipmr: enable namespace support in ipv4 multicast routing code · 4feb88e5
      Benjamin Thery 提交于
      This last patch makes the appropriate changes to use and propagate the
      network namespace where needed in IPv4 multicast routing code.
      
      This consists mainly in replacing all the remaining init_net occurences
      with current netns pointer retrieved from sockets, net devices or
      mfc_caches depending on the routines' contexts.
      
      Some routines receive a new 'struct net' parameter to propagate the current
      netns:
      * vif_add/vif_delete
      * ipmr_new_tunnel
      * mroute_clean_tables
      * ipmr_cache_find
      * ipmr_cache_report
      * ipmr_cache_unresolved
      * ipmr_mfc_add/ipmr_mfc_delete
      * ipmr_get_route
      * rt_fill_info (in route.c)
      Signed-off-by: NBenjamin Thery <benjamin.thery@bull.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4feb88e5
  30. 30 12月, 2008 1 次提交
  31. 26 11月, 2008 1 次提交
  32. 12 11月, 2008 1 次提交
  33. 04 11月, 2008 1 次提交
    • A
      net: '&' redux · 6d9f239a
      Alexey Dobriyan 提交于
      I want to compile out proc_* and sysctl_* handlers totally and
      stub them to NULL depending on config options, however usage of &
      will prevent this, since taking adress of NULL pointer will break
      compilation.
      
      So, drop & in front of every ->proc_handler and every ->strategy
      handler, it was never needed in fact.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d9f239a
  34. 31 10月, 2008 1 次提交
  35. 29 10月, 2008 2 次提交