1. 24 6月, 2005 2 次提交
    • O
      [PATCH] timers fixes/improvements · 55c888d6
      Oleg Nesterov 提交于
      This patch tries to solve following problems:
      
      1. del_timer_sync() is racy. The timer can be fired again after
         del_timer_sync have checked all cpus and before it will recheck
         timer_pending().
      
      2. It has scalability problems. All cpus are scanned to determine
         if the timer is running on that cpu.
      
         With this patch del_timer_sync is O(1) and no slower than plain
         del_timer(pending_timer), unless it has to actually wait for
         completion of the currently running timer.
      
         The only restriction is that the recurring timer should not use
         add_timer_on().
      
      3. The timers are not serialized wrt to itself.
      
         If CPU_0 does mod_timer(jiffies+1) while the timer is currently
         running on CPU 1, it is quite possible that local interrupt on
         CPU_0 will start that timer before it finished on CPU_1.
      
      4. The timers locking is suboptimal. __mod_timer() takes 3 locks
         at once and still requires wmb() in del_timer/run_timers.
      
         The new implementation takes 2 locks sequentially and does not
         need memory barriers.
      
      Currently ->base != NULL means that the timer is pending. In that case
      ->base.lock is used to lock the timer. __mod_timer also takes timer->lock
      because ->base can be == NULL.
      
      This patch uses timer->entry.next != NULL as indication that the timer is
      pending. So it does __list_del(), entry->next = NULL instead of list_del()
      when the timer is deleted.
      
      The ->base field is used for hashed locking only, it is initialized
      in init_timer() which sets ->base = per_cpu(tvec_bases). When the
      tvec_bases.lock is locked, it means that all timers which are tied
      to this base via timer->base are locked, and the base itself is locked
      too.
      
      So __run_timers/migrate_timers can safely modify all timers which could
      be found on ->tvX lists (pending timers).
      
      When the timer's base is locked, and the timer removed from ->entry list
      (which means that _run_timers/migrate_timers can't see this timer), it is
      possible to set timer->base = NULL and drop the lock: the timer remains
      locked.
      
      This patch adds lock_timer_base() helper, which waits for ->base != NULL,
      locks the ->base, and checks it is still the same.
      
      __mod_timer() schedules the timer on the local CPU and changes it's base.
      However, it does not lock both old and new bases at once. It locks the
      timer via lock_timer_base(), deletes the timer, sets ->base = NULL, and
      unlocks old base. Then __mod_timer() locks new_base, sets ->base = new_base,
      and adds this timer. This simplifies the code, because AB-BA deadlock is not
      possible. __mod_timer() also ensures that the timer's base is not changed
      while the timer's handler is running on the old base.
      
      __run_timers(), del_timer() do not change ->base anymore, they only clear
      pending flag.
      
      So del_timer_sync() can test timer->base->running_timer == timer to detect
      whether it is running or not.
      
      We don't need timer_list->lock anymore, this patch kills it.
      
      We also don't need barriers. del_timer() and __run_timers() used smp_wmb()
      before clearing timer's pending flag. It was needed because __mod_timer()
      did not lock old_base if the timer is not pending, so __mod_timer()->list_add()
      could race with del_timer()->list_del(). With this patch these functions are
      serialized through base->lock.
      
      One problem. TIMER_INITIALIZER can't use per_cpu(tvec_bases). So this patch
      adds global
      
              struct timer_base_s {
                      spinlock_t lock;
                      struct timer_list *running_timer;
              } __init_timer_base;
      
      which is used by TIMER_INITIALIZER. The corresponding fields in tvec_t_base_s
      struct are replaced by struct timer_base_s t_base.
      
      It is indeed ugly. But this can't have scalability problems. The global
      __init_timer_base.lock is used only when __mod_timer() is called for the first
      time AND the timer was compile time initialized. After that the timer migrates
      to the local CPU.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NRenaud Lienhart <renaud.lienhart@free.fr>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      55c888d6
    • C
      [PATCH] i386: Selectable Frequency of the Timer Interrupt · 59121003
      Christoph Lameter 提交于
      Make the timer frequency selectable. The timer interrupt may cause bus
      and memory contention in large NUMA systems since the interrupt occurs
      on each processor HZ times per second.
      Signed-off-by: NChristoph Lameter <christoph@lameter.com>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      59121003
  2. 22 6月, 2005 6 次提交
    • P
      [PATCH] uml: make hw_controller_type->release exist only for archs needing it · b77d6adc
      Paolo 'Blaisorblade' Giarrusso 提交于
      With Chris Wedgwood <cw@f00f.org>
      
      As suggested by Chris, we can make the "just added" method ->release
      conditional to UML only (better: to archs requesting it, i.e.  only UML
      currently), so that other archs don't get this unneeded crud, and if UML
      won't need it any more we can kill this.
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b77d6adc
    • P
      [PATCH] uml: add and use generic hw_controller_type->release · dbce706e
      Paolo 'Blaisorblade' Giarrusso 提交于
      With Chris Wedgwood <cw@f00f.org>
      
      Currently UML must explicitly call the UML-specific
      free_irq_by_irq_and_dev() for each free_irq call it's done.
      
      This is needed because ->shutdown and/or ->disable are only called when the
      last "action" for that irq is removed.
      
      Instead, for UML shared IRQs (UML IRQs are very often, if not always,
      shared), for each dev_id some setup is done, which must be cleared on the
      release of that fd.  For instance, for each open console a new instance
      (i.e.  new dev_id) of the same IRQ is requested().
      
      Exactly, a fd is stored in an array (pollfds), which is after read by a
      host thread and passed to poll().  Each event registered by poll() triggers
      an interrupt.  So, for each free_irq() we must remove the corresponding
      host fd from the table, which we do via this -release() method.
      
      In this patch we add an appropriate hook for this, and remove all uses of
      it by pointing the hook to the said procedure; this is safe to do since the
      said procedure.
      
      Also some cosmetic improvements are included.
      
      This is heavily based on some work by Chris Wedgwood, which however didn't
      get the patch merged for something I'd call a "misunderstanding" (the need
      for this patch wasn't cleanly explained, thus adding the generic hook was
      felt as undesirable).
      Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      CC: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dbce706e
    • H
      [PATCH] dup_mmap: update comment on new vma · 45918e1a
      Hugh Dickins 提交于
      Remove part of comment on linking new vma in dup_mmap: since anon_vma rmap
      came in, try_to_unmap_one knows the vma without needing find_vma.  But add
      a comment to note that here vma is inserted without mmap_sem.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      45918e1a
    • W
      [PATCH] Avoiding mmap fragmentation · 1363c3cd
      Wolfgang Wander 提交于
      Ingo recently introduced a great speedup for allocating new mmaps using the
      free_area_cache pointer which boosts the specweb SSL benchmark by 4-5% and
      causes huge performance increases in thread creation.
      
      The downside of this patch is that it does lead to fragmentation in the
      mmap-ed areas (visible via /proc/self/maps), such that some applications
      that work fine under 2.4 kernels quickly run out of memory on any 2.6
      kernel.
      
      The problem is twofold:
      
        1) the free_area_cache is used to continue a search for memory where
           the last search ended.  Before the change new areas were always
           searched from the base address on.
      
           So now new small areas are cluttering holes of all sizes
           throughout the whole mmap-able region whereas before small holes
           tended to close holes near the base leaving holes far from the base
           large and available for larger requests.
      
        2) the free_area_cache also is set to the location of the last
           munmap-ed area so in scenarios where we allocate e.g.  five regions of
           1K each, then free regions 4 2 3 in this order the next request for 1K
           will be placed in the position of the old region 3, whereas before we
           appended it to the still active region 1, placing it at the location
           of the old region 2.  Before we had 1 free region of 2K, now we only
           get two free regions of 1K -> fragmentation.
      
      The patch addresses thes issues by introducing yet another cache descriptor
      cached_hole_size that contains the largest known hole size below the
      current free_area_cache.  If a new request comes in the size is compared
      against the cached_hole_size and if the request can be filled with a hole
      below free_area_cache the search is started from the base instead.
      
      The results look promising: Whereas 2.6.12-rc4 fragments quickly and my
      (earlier posted) leakme.c test program terminates after 50000+ iterations
      with 96 distinct and fragmented maps in /proc/self/maps it performs nicely
      (as expected) with thread creation, Ingo's test_str02 with 20000 threads
      requires 0.7s system time.
      
      Taking out Ingo's patch (un-patch available per request) by basically
      deleting all mentions of free_area_cache from the kernel and starting the
      search for new memory always at the respective bases we observe: leakme
      terminates successfully with 11 distinctive hardly fragmented areas in
      /proc/self/maps but thread creating is gringdingly slow: 30+s(!) system
      time for Ingo's test_str02 with 20000 threads.
      
      Now - drumroll ;-) the appended patch works fine with leakme: it ends with
      only 7 distinct areas in /proc/self/maps and also thread creation seems
      sufficiently fast with 0.71s for 20000 threads.
      Signed-off-by: NWolfgang Wander <wwc@rentec.com>
      Credit-to: "Richard Purdie" <rpurdie@rpsys.net>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: Ingo Molnar <mingo@elte.hu> (partly)
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1363c3cd
    • M
      [PATCH] VM: early zone reclaim · 753ee728
      Martin Hicks 提交于
      This is the core of the (much simplified) early reclaim.  The goal of this
      patch is to reclaim some easily-freed pages from a zone before falling back
      onto another zone.
      
      One of the major uses of this is NUMA machines.  With the default allocator
      behavior the allocator would look for memory in another zone, which might be
      off-node, before trying to reclaim from the current zone.
      
      This adds a zone tuneable to enable early zone reclaim.  It is selected on a
      per-zone basis and is turned on/off via syscall.
      
      Adding some extra throttling on the reclaim was also required (patch
      4/4).  Without the machine would grind to a crawl when doing a "make -j"
      kernel build.  Even with this patch the System Time is higher on
      average, but it seems tolerable.  Here are some numbers for kernbench
      runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:
      
      			wall  user   sys   %cpu  ctx sw.  sleeps
      			----  ----   ---   ----   ------  ------
      No patch		1009  1384   847   258   298170   504402
      w/patch, no reclaim     880   1376   667   288   254064   396745
      w/patch & reclaim       1079  1385   926   252   291625   548873
      
      These numbers are the average of 2 runs of 3 "make -j" runs done right
      after system boot.  Run-to-run variability for "make -j" is huge, so
      these numbers aren't terribly useful except to seee that with reclaim
      the benchmark still finishes in a reasonable amount of time.
      
      I also looked at the NUMA hit/miss stats for the "make -j" runs and the
      reclaim doesn't make any difference when the machine is thrashing away.
      
      Doing a "make -j8" on a single node that is filled with page cache pages
      takes 700 seconds with reclaim turned on and 735 seconds without reclaim
      (due to remote memory accesses).
      
      The simple zone_reclaim syscall program is at
      http://www.bork.org/~mort/sgi/zone_reclaim.cSigned-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      753ee728
    • I
      [PATCH] smp_processor_id() cleanup · 39c715b7
      Ingo Molnar 提交于
      This patch implements a number of smp_processor_id() cleanup ideas that
      Arjan van de Ven and I came up with.
      
      The previous __smp_processor_id/_smp_processor_id/smp_processor_id API
      spaghetti was hard to follow both on the implementational and on the
      usage side.
      
      Some of the complexity arose from picking wrong names, some of the
      complexity comes from the fact that not all architectures defined
      __smp_processor_id.
      
      In the new code, there are two externally visible symbols:
      
       - smp_processor_id(): debug variant.
      
       - raw_smp_processor_id(): nondebug variant. Replaces all existing
         uses of _smp_processor_id() and __smp_processor_id(). Defined
         by every SMP architecture in include/asm-*/smp.h.
      
      There is one new internal symbol, dependent on DEBUG_PREEMPT:
      
       - debug_smp_processor_id(): internal debug variant, mapped to
                                   smp_processor_id().
      
      Also, i moved debug_smp_processor_id() from lib/kernel_lock.c into a new
      lib/smp_processor_id.c file.  All related comments got updated and/or
      clarified.
      
      I have build/boot tested the following 8 .config combinations on x86:
      
       {SMP,UP} x {PREEMPT,!PREEMPT} x {DEBUG_PREEMPT,!DEBUG_PREEMPT}
      
      I have also build/boot tested x64 on UP/PREEMPT/DEBUG_PREEMPT.  (Other
      architectures are untested, but should work just fine.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39c715b7
  3. 21 6月, 2005 1 次提交
  4. 18 6月, 2005 1 次提交
  5. 14 6月, 2005 1 次提交
  6. 01 6月, 2005 1 次提交
    • R
      [PATCH] flush icache in correct context · ae92ef8a
      Roman Zippel 提交于
      flush_icache_range() is used in two different situation - in binfmt_elf.c &
      co for user space mappings and module.c for kernel modules.  On m68k
      flush_icache_range() doesn't know which data to flush, as it has separate
      address spaces and the pointer argument can be valid in either address
      space.
      
      First I considered splitting flush_icache_range(), but this patch is
      simpler.  Setting the correct context gives flush_icache_range() enough
      information to flush the correct data.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ae92ef8a
  7. 29 5月, 2005 1 次提交
    • J
      [PATCH] drop note_interrupt() for per-CPU for proper scaling · b60c1f6f
      John Hawkes 提交于
      The "unhandled interrupts" catcher, note_interrupt(), increments a global
      desc->irq_count and grossly damages scaling of very large systems, e.g.,
      >192p ia64 Altix, because of this highly contented cacheline, especially
      for timer interrupts.  384p is severely crippled, and 512p is unuseable.
      
      All calls to note_interrupt() can be disabled by booting with "noirqdebug",
      but this disables the useful interrupt checking for all interrupts.
      
      I propose eliminating note_interrupt() for all per-CPU interrupts.  This
      was the behavior of linux-2.6.10 and earlier, but in 2.6.11 a code
      restructuring added a call to note_interrupt() for per-CPU interrupts.
      Besides, note_interrupt() is a bit racy for concurrent CPU calls anyway, as
      the desc->irq_count++ increment isn't atomic (which, if done, would make
      scaling even worse).
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b60c1f6f
  8. 27 5月, 2005 2 次提交
    • P
      [PATCH] cpuset exit NULL dereference fix · 2efe86b8
      Paul Jackson 提交于
      There is a race in the kernel cpuset code, between the code
      to handle notify_on_release, and the code to remove a cpuset.
      The notify_on_release code can end up trying to access a
      cpuset that has been removed.  In the most common case, this
      causes a NULL pointer dereference from the routine cpuset_path.
      However all manner of bad things are possible, in theory at least.
      
      The existing code decrements the cpuset use count, and if the
      count goes to zero, processes the notify_on_release request,
      if appropriate.  However, once the count goes to zero, unless we
      are holding the global cpuset_sem semaphore, there is nothing to
      stop another task from immediately removing the cpuset entirely,
      and recycling its memory.
      
      The obvious fix would be to always hold the cpuset_sem
      semaphore while decrementing the use count and dealing with
      notify_on_release.  However we don't want to force a global
      semaphore into the mainline task exit path, as that might create
      a scaling problem.
      
      The actual fix is almost as easy - since this is only an issue
      for cpusets using notify_on_release, which the top level big
      cpusets don't normally need to use, only take the cpuset_sem
      for cpusets using notify_on_release.
      
      This code has been run for hours without a hiccup, while running
      a cpuset create/destroy stress test that could crash the existing
      kernel in seconds.  This patch applies to the current -linus
      git kernel.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NSimon Derr <simon.derr@bull.net>
      Acked-by: NDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2efe86b8
    • D
  9. 26 5月, 2005 1 次提交
    • D
      AUDIT: Defer freeing aux items until audit_free_context() · 7551ced3
      David Woodhouse 提交于
      While they were all just simple blobs it made sense to just free them
      as we walked through and logged them. Now that there are pointers to
      other objects which need refcounting, we might as well revert to
      _only_ logging them in audit_log_exit(), and put the code to free them
      properly in only one place -- in audit_free_aux().
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      ----------------------------------------------------------
      7551ced3
  10. 25 5月, 2005 1 次提交
  11. 24 5月, 2005 2 次提交
  12. 22 5月, 2005 2 次提交
    • D
      AUDIT: Assign serial number to non-syscall messages · bfb4496e
      David Woodhouse 提交于
      Move audit_serial() into audit.c and use it to generate serial numbers 
      on messages even when there is no audit context from syscall auditing.  
      This allows us to disambiguate audit records when more than one is 
      generated in the same millisecond.
      
      Based on a patch by Steve Grubb after he observed the problem.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      bfb4496e
    • S
      [PATCH] spin_unlock_bh() and preempt_check_resched() · 10f02d1c
      Samuel Thibault 提交于
      In _spin_unlock_bh(lock):
      	do { \
      		_raw_spin_unlock(lock); \
      		preempt_enable(); \
      		local_bh_enable(); \
      		__release(lock); \
      	} while (0)
      
      there is no reason for using preempt_enable() instead of a simple
      preempt_enable_no_resched()
      
      Since we know bottom halves are disabled, preempt_schedule() will always
      return at once (preempt_count!=0), and hence preempt_check_resched() is
      useless here...
      
      This fixes it by using "preempt_enable_no_resched()" instead of the
      "preempt_enable()", and thus avoids the useless preempt_check_resched()
      just before re-enabling bottom halves.
      Signed-off-by: NSamuel Thibault <samuel.thibault@ens-lyon.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      10f02d1c
  13. 21 5月, 2005 4 次提交
  14. 19 5月, 2005 4 次提交
    • D
      AUDIT: Honour audit_backlog_limit again. · fb19b4c6
      David Woodhouse 提交于
      The limit on the number of outstanding audit messages was inadvertently
      removed with the switch to queuing skbs directly for sending by a kernel
      thread. Put it back again.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      fb19b4c6
    • D
      AUDIT: Quis Custodiet Ipsos Custodes? · 7ca00264
      David Woodhouse 提交于
      Nobody does. Really, it gets very silly if auditd is recording its
      own actions.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      7ca00264
    • D
      AUDIT: Send netlink messages from a separate kernel thread · b7d11258
      David Woodhouse 提交于
      netlink_unicast() will attempt to reallocate and will free messages if
      the socket's rcvbuf limit is reached unless we give it an infinite 
      timeout. So do that, from a kernel thread which is dedicated to spewing
      stuff up the netlink socket.
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      b7d11258
    • S
      AUDIT: Clean up logging of untrusted strings · 168b7173
      Steve Grubb 提交于
      * If vsnprintf returns -1, it will mess up the sk buffer space accounting. 
      This is fixed by not calling skb_put with bogus len values.
      
      * audit_log_hex was a loop that called audit_log_vformat with %02X for each 
      character. This is very inefficient since conversion from unsigned character 
      to Ascii representation is essentially masking, shifting, and byte lookups. 
      Also, the length of the converted string is well known - it's twice the 
      original. Fixed by rewriting the function.
      
      * audit_log_untrustedstring had no comments. This makes it hard for 
      someone to understand what the string format will be.
      
      * audit_log_d_path was never fixed to use untrustedstring. This could mess
      up user space parsers. This was fixed to make a temp buffer, call d_path, 
      and log temp buffer using untrustedstring. 
      
      From: Steve Grubb <sgrubb@redhat.com>
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      168b7173
  15. 18 5月, 2005 2 次提交
  16. 17 5月, 2005 4 次提交
  17. 14 5月, 2005 3 次提交
  18. 13 5月, 2005 1 次提交
  19. 11 5月, 2005 1 次提交
    • C
      Add audit_log_type · c1b773d8
      Chris Wright 提交于
      Add audit_log_type to allow callers to specify type and pid when logging.
      Convert audit_log to wrapper around audit_log_type.  Could have
      converted all audit_log callers directly, but common case is default
      of type AUDIT_KERNEL and pid 0.  Update audit_log_start to take type
      and pid values when creating a new audit_buffer.  Move sequences that
      did audit_log_start, audit_log_format, audit_set_type, audit_log_end,
      to simply call audit_log_type directly.  This obsoletes audit_set_type
      and audit_set_pid, so remove them.
      Signed-off-by: NChris Wright <chrisw@osdl.org>
      Signed-off-by: NDavid Woodhouse <dwmw2@infradead.org>
      c1b773d8