1. 03 12月, 2006 4 次提交
  2. 12 10月, 2006 1 次提交
  3. 29 9月, 2006 7 次提交
  4. 23 9月, 2006 5 次提交
  5. 08 8月, 2006 1 次提交
    • K
      [IPV4]: Limit rt cache size properly. · 8d1502de
      Kirill Korotaev 提交于
      From: Kirill Korotaev <dev@sw.ru>
      
      During OpenVZ stress testing we found that UDP traffic with random src
      can generate too much excessive rt hash growing leading finally to OOM
      and kernel panics.
      
      It was found that for 4GB i686 system (having 1048576 total pages and
        225280 normal zone pages) kernel allocates the following route hash:
      syslog: IP route cache hash table entries: 262144 (order: 8, 1048576
      bytes) => ip_rt_max_size = 4194304 entries, i.e.  max rt size is
      4194304 * 256b = 1Gb of RAM > normal_zone
      
      Attached the patch which removes HASH_HIGHMEM flag from
      alloc_large_system_hash() call.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8d1502de
  6. 03 8月, 2006 1 次提交
  7. 04 7月, 2006 2 次提交
  8. 01 7月, 2006 1 次提交
  9. 26 6月, 2006 1 次提交
  10. 18 4月, 2006 1 次提交
  11. 11 4月, 2006 1 次提交
  12. 25 3月, 2006 1 次提交
    • I
      [IPV4]: Aggregate route entries with different TOS values · cef2685e
      Ilia Sotnikov 提交于
      When we get an ICMP need-to-frag message, the original TOS value in the
      ICMP payload cannot be used as a key to look up the routes to update.
      This is because the TOS field may have been modified by routers on the
      way.  Similarly, ip_rt_redirect should also ignore the TOS as the router
      that gave us the message may have modified the TOS value.
      
      The patch achieves this objective by aggregating entries with different
      TOS values (but are otherwise identical) into the same bucket.  This
      makes it easy to update them at the same time when an ICMP message is
      received.
      
      In future we should use a twin-hashing scheme where teh aggregation
      occurs at the entry level.  That is, the TOS goes back into the hash
      for normal lookups while ICMP lookups will end up with a node that
      gives us a list that contains all other route entries that differ
      only by TOS.
      Signed-off-by: NIlia Sotnikov <hostcc@gmail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef2685e
  13. 24 2月, 2006 1 次提交
  14. 18 1月, 2006 1 次提交
  15. 17 1月, 2006 1 次提交
    • E
      [IPV4]: rt_cache_stat can be statically defined · 2f970d83
      Eric Dumazet 提交于
      Using __get_cpu_var(obj) is slightly faster than per_cpu_ptr(obj, 
      raw_smp_processor_id()).
      
      1) Smaller code and memory use
      For static and small objects, DEFINE_PER_CPU(type, object) is preferred over a 
      alloc_percpu() : Better and smaller code to access them, and no extra memory 
      (storing the pointer, and the percpu array of pointers)
      
      x86_64 code before patch
      
      mov    1237577(%rip),%rax        # ffffffff803e5990 <rt_cache_stat>
      not    %rax  # part of per_cpu machinery
      mov    %gs:0x3c,%edx # get cpu number
      movslq %edx,%rdx # extend 32 bits cpu number to 64 bits
      mov    (%rax,%rdx,8),%rax # get the pointer for this cpu
      incl   0x38(%rax)
      
      x86_64 code after patch
      
      mov    $per_cpu__rt_cache_stat,%rdx
      mov    %gs:0x48,%rax # get percpu data offset
      incl   0x38(%rax,%rdx,1)
      
      2) False sharing avoidance for SMP :
      For a small NR_CPUS, the array of per cpu pointers allocated in alloc_percpu() 
      can be <= 32 bytes. This let slab code gives a part of a cache line. If the 
      other part of this 64 bytes (or 128 bytes) cache line is used by a mostly 
      written object, we can have false sharing and expensive per_cpu_ptr() operations.
      
      Size of rt_cache_stat is 64 bytes, so this patch is not a danger of a too big 
      increase of bss (in UP mode) or static per_cpu data for SMP 
      (PERCPU_ENOUGH_ROOM is currently 32768 bytes)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f970d83
  16. 30 11月, 2005 2 次提交
    • A
      [NET]: Add const markers to various variables. · 9b5b5cff
      Arjan van de Ven 提交于
      the patch below marks various variables const in net/; the goal is to
      move them to the .rodata section so that they can't false-share
      cachelines with things that get written to, as well as potentially
      helping gcc a bit with optimisations.  (these were found using a gcc
      patch to warn about such variables)
      Signed-off-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b5b5cff
    • M
      [IPV4] tcp/route: Another look at hash table sizes · 18955cfc
      Mike Stroyan 提交于
        The tcp_ehash hash table gets too big on systems with really big memory.
      It is worse on systems with pages larger than 4KB.  It wastes memory that
      could be better used.  It also makes the netstat command slow because reading
      /proc/net/tcp and /proc/net/tcp6 needs to go through the full hash table.
      
        The default value should not be larger for larger page sizes.  It seems
      that the effect of page size is an unintended error dating back a long
      time.  I also wonder if the default value really should be a larger
      fraction of memory for systems with more memory.  While systems with
      really big ram can afford more space for hash tables, it is not clear to
      me that they benefit from increasing the allocation ratio for this table.
      
        The amount of memory allocated is determined by net/ipv4/tcp.c:tcp_init and
      mm/page_alloc.c:alloc_large_system_hash.
      
      tcp_init calls alloc_large_system_hash passing parameters-
          bucketsize=sizeof(struct tcp_ehash_bucket)
          numentries=thash_entries
          scale=(num_physpages >= 128 * 1024) ? (25-PAGE_SHIFT) : (27-PAGE_SHIFT)
          limit=0
      
      On i386, PAGE_SHIFT is 12 for a page size of 4K
      On ia64, PAGE_SHIFT defaults to 14 for a page size of 16K
      
      The num_physpages test above makes the allocation take a larger fraction
      of the total memory on systems with larger memory.  The threshold size
      for a i386 system is 512MB.  For an ia64 system with 16KB pages the
      threshold is 2GB.
      
      For smaller memory systems-
      On i386, scale = (27 - 12) = 15
      On ia64, scale = (27 - 14) = 13
      For larger memory systems-
      On i386, scale = (25 - 12) = 13
      On ia64, scale = (25 - 14) = 11
      
        For the rest of this discussion, I'll just track the larger memory case.
      
        The default behavior has numentries=thash_entries=0, so the allocated
      size is determined by either scale or by the default limit of 1/16 of
      total memory.
      
      In alloc_large_system_hash-
      |	numentries = (flags & HASH_HIGHMEM) ? nr_all_pages : nr_kernel_pages;
      |	numentries += (1UL << (20 - PAGE_SHIFT)) - 1;
      |	numentries >>= 20 - PAGE_SHIFT;
      |	numentries <<= 20 - PAGE_SHIFT;
      
        At this point, numentries is pages for all of memory, rounded up to the
      nearest megabyte boundary.
      
      |	/* limit to 1 bucket per 2^scale bytes of low memory */
      |	if (scale > PAGE_SHIFT)
      |		numentries >>= (scale - PAGE_SHIFT);
      |	else
      |		numentries <<= (PAGE_SHIFT - scale);
      
      On i386, numentries >>= (13 - 12), so numentries is 1/8196 of
      bytes of total memory.
      On ia64, numentries <<= (14 - 11), so numentries is 1/2048 of
      bytes of total memory.
      
      |        log2qty = long_log2(numentries);
      |
      |        do {
      |                size = bucketsize << log2qty;
      
      bucketsize is 16, so size is 16 times numentries, rounded
      down to a power of two.
      
      On i386, size is 1/512 of bytes of total memory.
      On ia64, size is 1/128 of bytes of total memory.
      
      For smaller systems the results are
      On i386, size is 1/2048 of bytes of total memory.
      On ia64, size is 1/512 of bytes of total memory.
      
        The large page effect can be removed by just replacing
      the use of PAGE_SHIFT with a constant of 12 in the calls to
      alloc_large_system_hash.  That makes them more like the other uses of
      that function from fs/inode.c and fs/dcache.c
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18955cfc
  17. 04 10月, 2005 1 次提交
    • H
      [IPV4]: Replace __in_dev_get with __in_dev_get_rcu/rtnl · e5ed6399
      Herbert Xu 提交于
      The following patch renames __in_dev_get() to __in_dev_get_rtnl() and
      introduces __in_dev_get_rcu() to cover the second case.
      
      1) RCU with refcnt should use in_dev_get().
      2) RCU without refcnt should use __in_dev_get_rcu().
      3) All others must hold RTNL and use __in_dev_get_rtnl().
      
      There is one exception in net/ipv4/route.c which is in fact a pre-existing
      race condition.  I've marked it as such so that we remember to fix it.
      
      This patch is based on suggestions and prior work by Suzanne Wood and
      Paul McKenney.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5ed6399
  18. 09 9月, 2005 1 次提交
    • J
      [IPV4]: Fix refcount damaging in net/ipv4/route.c · ce723d8e
      Julian Anastasov 提交于
      	One such place that can damage the dst refcnts is route.c with
      CONFIG_IP_ROUTE_MULTIPATH_CACHED enabled, i don't see the user's
      .config. In this new code i see that rt_intern_hash is called before
      dst->refcnt is set to 1, dst is the 2nd arg to rt_intern_hash.
      
      Arg 2 of rt_intern_hash must come with refcnt 1 as it is added to
      table or dropped depending on error/add/update. One such example is
      ip_mkroute_input where __mkroute_input return rth with refcnt 0 which
      is provided to rt_intern_hash. ip_mkroute_output looks like a 2nd such
      place. Appending untested patch for comments and review.  The idea is
      to put previous reference as we are going to return next result/error.
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce723d8e
  19. 30 8月, 2005 2 次提交
  20. 12 7月, 2005 1 次提交
    • O
      [IPV4]: Prevent oops when printing martian source · 0b7f22aa
      Olaf Kirch 提交于
      In some cases, we may be generating packets with a source address that
      qualifies as martian. This can happen when we're in the middle of setting
      up the network, and netfilter decides to reject a packet with an RST.
      The IPv4 routing code would try to print a warning and oops, because
      locally generated packets do not have a valid skb->mac.raw pointer
      at this point.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b7f22aa
  21. 06 7月, 2005 3 次提交
  22. 29 6月, 2005 1 次提交