1. 22 6月, 2005 3 次提交
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
    • M
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks 提交于
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      753ee728
    • I
      [PATCH] smp_processor_id() cleanup · 39c715b7
      Ingo Molnar 提交于
      This patch implements a number of smp_processor_id() cleanup ideas that
      Arjan van de Ven and I came up with.
      
      The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
      spaghetti was hard to follow both on the implementational and on the
      usage side.
      
      Some of the complexity arose from picking wrong names, some of the
      complexity comes from the fact that not all architectures defined
      __smp_processor_id.
      
      In the new code, there are two externally visible symbols:
      
       - smp_processor_id(): debug variant.
      
       - raw_smp_processor_id(): nondebug variant. Replaces all existing
         uses of _smp_processor_id() and __smp_processor_id(). Defined
         by every SMP architecture in include/asm-*/smp.h.
      
      There is one new internal symbol, dependent on DEBUG_PREEMPT:
      
       - debug_smp_processor_id(): internal debug variant, mapped to
                                   smp_processor_id().
      
      Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
      lib/smp_processor_id.c file.  All related comments got updated and/or
      clarified.
      
      I have build/boot tested the following 8 .config combinations on x86:
      
       {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
      
      I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
      architectures are untested, but should work just fine.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39c715b7
  2. 21 6月, 2005 1 次提交
  3. 18 6月, 2005 1 次提交
  4. 14 6月, 2005 1 次提交
  5. 01 6月, 2005 1 次提交
    • R
      [PATCH] flush icache in correct context · ae92ef8a
      Roman Zippel 提交于
      flush_icache_range() is used in two different situation - in binfmt_elf.c &
      co for user space mappings and module.c for kernel modules.  On m68k
      flush_icache_range() doesn't know which data to flush, as it has separate
      address spaces and the pointer argument can be valid in either address
      space.
      
      First I considered splitting flush_icache_range(), but this patch is
      simpler.  Setting the correct context gives flush_icache_range() enough
      information to flush the correct data.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ae92ef8a
  6. 29 5月, 2005 1 次提交
    • J
      [PATCH] drop note_interrupt() for per-CPU for proper scaling · b60c1f6f
      John Hawkes 提交于
      The "unhandled interrupts" catcher, note_interrupt(), increments a global
      desc->irq_count and grossly damages scaling of very large systems, e.g.,
      >192p ia64 Altix, because of this highly contented cacheline, especially
      for timer interrupts.  384p is severely crippled, and 512p is unuseable.
      
      All calls to note_interrupt() can be disabled by booting with "noirqdebug",
      but this disables the useful interrupt checking for all interrupts.
      
      I propose eliminating note_interrupt() for all per-CPU interrupts.  This
      was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
      restructuring added a call to note_interrupt() for per-CPU interrupts.
      Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
      the desc->irq_count++ increment isn't atomic (which, if done, would make
      scaling even worse).
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b60c1f6f
  7. 27 5月, 2005 2 次提交
    • P
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson 提交于
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NSimon Derr <simon.derr@bull.net>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2efe86b8
    • D
  8. 26 5月, 2005 1 次提交
    • D
      AUDIT: Defer freeing aux items until audit_free_context() · 7551ced3
      David Woodhouse 提交于
      While they were all just simple blobs it made sense to just free them
      as we walked through and logged them. Now that there are pointers to
      other objects which need refcounting, we might as well revert to
      _only_ logging them in audit_log_exit(), and put the code to free them
      properly in only one place -- in audit_free_aux().
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      ----------------------------------------------------------
      7551ced3
  9. 25 5月, 2005 1 次提交
  10. 24 5月, 2005 2 次提交
  11. 22 5月, 2005 2 次提交
    • D
      AUDIT: Assign serial number to non-syscall messages · bfb4496e
      David Woodhouse 提交于
      Move audit_serial() into audit.c and use it to generate serial numbers 
      on messages even when there is no audit context from syscall auditing.  
      This allows us to disambiguate audit records when more than one is 
      generated in the same millisecond.
      
      Based on a patch by Steve Grubb after he observed the problem.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      bfb4496e
    • S
      [PATCH] spin_unlock_bh() and preempt_check_resched() · 10f02d1c
      Samuel Thibault 提交于
      In _spin_unlock_bh(lock):
      	do { \
      		_raw_spin_unlock(lock); \
      		preempt_enable(); \
      		local_bh_enable(); \
      		__release(lock); \
      	} while (0)
      
      there is no reason for using preempt_enable() instead of a simple
      preempt_enable_no_resched()
      
      Since we know bottom halves are disabled, preempt_schedule() will always
      return at once (preempt_count!=0), and hence preempt_check_resched() is
      useless here...
      
      This fixes it by using "preempt_enable_no_resched()" instead of the
      "preempt_enable()", and thus avoids the useless preempt_check_resched()
      just before re-enabling bottom halves.
      Signed-off-by: NSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      10f02d1c
  12. 21 5月, 2005 4 次提交
  13. 19 5月, 2005 4 次提交
    • D
      AUDIT: Honour audit_backlog_limit again. · fb19b4c6
      David Woodhouse 提交于
      The limit on the number of outstanding audit messages was inadvertently
      removed with the switch to queuing skbs directly for sending by a kernel
      thread. Put it back again.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      fb19b4c6
    • D
      AUDIT: Quis Custodiet Ipsos Custodes? · 7ca00264
      David Woodhouse 提交于
      Nobody does. Really, it gets very silly if auditd is recording its
      own actions.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      7ca00264
    • D
      AUDIT: Send netlink messages from a separate kernel thread · b7d11258
      David Woodhouse 提交于
      netlink_unicast() will attempt to reallocate and will free messages if
      the socket's rcvbuf limit is reached unless we give it an infinite 
      timeout. So do that, from a kernel thread which is dedicated to spewing
      stuff up the netlink socket.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      b7d11258
    • S
      AUDIT: Clean up logging of untrusted strings · 168b7173
      Steve Grubb 提交于
      * If vsnprintf returns -1, it will mess up the sk buffer space accounting. 
      This is fixed by not calling skb_put with bogus len values.
      
      * audit_log_hex was a loop that called audit_log_vformat with %02X for each 
      character. This is very inefficient since conversion from unsigned character 
      to Ascii representation is essentially masking, shifting, and byte lookups. 
      Also, the length of the converted string is well known - it's twice the 
      original. Fixed by rewriting the function.
      
      * audit_log_untrustedstring had no comments. This makes it hard for 
      someone to understand what the string format will be.
      
      * audit_log_d_path was never fixed to use untrustedstring. This could mess
      up user space parsers. This was fixed to make a temp buffer, call d_path, 
      and log temp buffer using untrustedstring. 
      
      From: Steve Grubb <sgrubb@redhat.com>
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      168b7173
  14. 18 5月, 2005 2 次提交
  15. 17 5月, 2005 4 次提交
  16. 14 5月, 2005 3 次提交
  17. 13 5月, 2005 1 次提交
  18. 11 5月, 2005 6 次提交