1. 01 4月, 2006 1 次提交
    • J
      [PATCH] sched: reduce overhead of calc_load · db1b1fef
      Jack Steiner 提交于
      Currently, count_active_tasks() calls both nr_running() &
      nr_interruptible().  Each of these functions does a "for_each_cpu" & reads
      values from the runqueue of each cpu.  Although this is not a lot of
      instructions, each runqueue may be located on different node.  Depending on
      the architecture, a unique TLB entry may be required to access each
      runqueue.
      
      Since there may be more runqueues than cpu TLB entries, a scan of all
      runqueues can trash the TLB.  Each memory reference incurs a TLB miss &
      refill.
      
      In addition, the runqueue cacheline that contains nr_running &
      nr_uninterruptible may be evicted from the cache between the two passes.
      This causes unnecessary cache misses.
      
      Combining nr_running() & nr_interruptible() into a single function
      substantially reduces the TLB & cache misses on large systems.  This should
      have no measureable effect on smaller systems.
      
      On a 128p IA64 system running a memory stress workload, the new function
      reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      db1b1fef
  2. 29 3月, 2006 12 次提交
  3. 28 3月, 2006 2 次提交
  4. 27 3月, 2006 1 次提交
  5. 24 3月, 2006 4 次提交
    • I
      [PATCH] timer-irq-driven soft-watchdog, cleanups · 6687a97d
      Ingo Molnar 提交于
      Make the softlockup detector purely timer-interrupt driven, removing
      softirq-context (timer) dependencies.  This means that if the softlockup
      watchdog triggers, it has truly observed a longer than 10 seconds
      scheduling delay of a SCHED_FIFO prio 99 task.
      
      (the patch also turns off the softlockup detector during the initial bootup
      phase and does small style fixes)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6687a97d
    • P
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson 提交于
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c61afb18
    • P
      [PATCH] cpuset memory spread basic implementation · 825a46af
      Paul Jackson 提交于
      This patch provides the implementation and cpuset interface for an alternative
      memory allocation policy that can be applied to certain kinds of memory
      allocations, such as the page cache (file system buffers) and some slab caches
      (such as inode caches).
      
      The policy is called "memory spreading." If enabled, it spreads out these
      kinds of memory allocations over all the nodes allowed to a task, instead of
      preferring to place them on the node where the task is executing.
      
      All other kinds of allocations, including anonymous pages for a tasks stack
      and data regions, are not affected by this policy choice, and continue to be
      allocated preferring the node local to execution, as modified by the NUMA
      mempolicy.
      
      There are two boolean flag files per cpuset that control where the kernel
      allocates pages for the file system buffers and related in kernel data
      structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.
      
      If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
      kernel will spread the file system buffers (page cache) evenly over all the
      nodes that the faulting task is allowed to use, instead of preferring to put
      those pages on the node where the task is running.
      
      If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
      kernel will spread some file system related slab caches, such as for inodes
      and dentries evenly over all the nodes that the faulting task is allowed to
      use, instead of preferring to put those pages on the node where the task is
      running.
      
      The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
      or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
      PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
      subsequently joins that cpuset.  In subsequent patches, the page allocation
      calls for the affected page cache and slab caches are modified to perform an
      inline check for these flags, and if set, a call to a new routine
      cpuset_mem_spread_node() returns the node to prefer for the allocation.
      
      The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
      per-task rotor cpuset_mem_spread_rotor to select the next node in the current
      tasks mems_allowed to prefer for the allocation.
      
      This policy can provide substantial improvements for jobs that need to place
      thread local data on the corresponding node, but that need to access large
      file system data sets that need to be spread across the several nodes in the
      jobs cpuset in order to fit.  Without this patch, especially for jobs that
      might have one thread reading in the data set, the memory allocation across
      the nodes in the jobs cpuset can become very uneven.
      
      A couple of Copyright year ranges are updated as well.  And a couple of email
      addresses that can be found in the MAINTAINERS file are removed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      825a46af
    • J
      2056a782
  6. 12 3月, 2006 1 次提交
  7. 01 3月, 2006 1 次提交
  8. 15 2月, 2006 1 次提交
    • C
      [PATCH] sched: revert "filter affine wakeups" · d6077cb8
      Chen, Kenneth W 提交于
      Revert commit d7102e95:
      
          [PATCH] sched: filter affine wakeups
      
      Apparently caused more than 10% performance regression for aim7 benchmark.
      The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
      disks.  Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
      supplied benchmark results).
      
      The problem is, for aim7, the wake-up pattern is random, but it still needs
      load balancing action in the wake-up path to achieve best performance.  With
      the above commit, lack of load balancing hurts that workload.
      
      However, for workloads like database transaction processing, the requirement
      is exactly opposite.  In the wake up path, best performance is achieved with
      absolutely zero load balancing.  We simply wake up the process on the CPU that
      it was previously run.  Worst performance is obtained when we do load
      balancing at wake up.
      
      There isn't an easy way to auto detect the workload characteristics.  Ingo's
      earlier patch that detects idle CPU and decide whether to load balance or not
      doesn't perform with aim7 either since all CPUs are busy (it causes even
      bigger perf.  regression).
      
      Revert commit d7102e95, which causes more
      than 10% performance regression with aim7.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6077cb8
  9. 10 2月, 2006 1 次提交
  10. 19 1月, 2006 1 次提交
  11. 15 1月, 2006 1 次提交
    • I
      [PATCH] sched: add new SCHED_BATCH policy · b0a9499c
      Ingo Molnar 提交于
      Add a new SCHED_BATCH (3) scheduling policy: such tasks are presumed
      CPU-intensive, and will acquire a constant +5 priority level penalty.  Such
      policy is nice for workloads that are non-interactive, but which do not
      want to give up their nice levels.  The policy is also useful for workloads
      that want a deterministic scheduling policy without interactivity causing
      extra preemptions (between that workload's tasks).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b0a9499c
  12. 13 1月, 2006 3 次提交
    • A
      [PATCH] missing helper - task_stack_page() · 9fc65876
      Al Viro 提交于
      Patchset annotates arch/* uses of ->thread_info.  Ones that really are about
      access of thread_info of given process are simply switched to
      task_thread_info(task); ones that deal with access to objects on stack are
      switched to new helper - task_stack_page().  A _lot_ of the latter are
      actually open-coded instances of "find where pt_regs are"; those are
      consolidated into task_pt_regs(task) (many architectures actually have such
      helper already).
      
      Note that these annotations are not mandatory - any code not converted to
      these helpers still works.  However, they clean up a lot of places and have
      actually caught a number of bugs, so converting out of tree ports would be a
      good idea...
      
      As an example of breakage caught by that stuff, see i386 pt_regs mess - we
      used to have it open-coded in a bunch of places and when back in April Stas
      had fixed a bug in copy_thread(), the rest had been left out of sync.  That
      required two followup patches (the latest - just before 2.6.15) _and_ still
      had left /proc/*/stat eip field broken.  Try ps -eo eip on i386 and watch the
      junk...
      
      This patch:
      
      new helper - task_stack_page(task).  Returns pointer to the memory object
      containing task stack; usually thread_info of task sits in the beginning
      of that object.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9fc65876
    • A
      [PATCH] sched: filter affine wakeups · d7102e95
      akpm@osdl.org 提交于
      )
      
      From: Nick Piggin <nickpiggin@yahoo.com.au>
      
      Track the last waker CPU, and only consider wakeup-balancing if there's a
      match between current waker CPU and the previous waker CPU.  This ensures
      that there is some correlation between two subsequent wakeup events before
      we move the task.  Should help random-wakeup workloads on large SMP
      systems, by reducing the migration attempts by a factor of nr_cpus.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d7102e95
    • A
      [PATCH] scheduler cache-hot-autodetect · 198e2f18
      akpm@osdl.org 提交于
      )
      
      From: Ingo Molnar <mingo@elte.hu>
      
      This is the latest version of the scheduler cache-hot-auto-tune patch.
      
      The first problem was that detection time scaled with O(N^2), which is
      unacceptable on larger SMP and NUMA systems. To solve this:
      
      - I've added a 'domain distance' function, which is used to cache
        measurement results. Each distance is only measured once. This means
        that e.g. on NUMA distances of 0, 1 and 2 might be measured, on HT
        distances 0 and 1, and on SMP distance 0 is measured. The code walks
        the domain tree to determine the distance, so it automatically follows
        whatever hierarchy an architecture sets up. This cuts down on the boot
        time significantly and removes the O(N^2) limit. The only assumption
        is that migration costs can be expressed as a function of domain
        distance - this covers the overwhelming majority of existing systems,
        and is a good guess even for more assymetric systems.
      
        [ People hacking systems that have assymetries that break this
          assumption (e.g. different CPU speeds) should experiment a bit with
          the cpu_distance() function. Adding a ->migration_distance factor to
          the domain structure would be one possible solution - but lets first
          see the problem systems, if they exist at all. Lets not overdesign. ]
      
      Another problem was that only a single cache-size was used for measuring
      the cost of migration, and most architectures didnt set that variable
      up. Furthermore, a single cache-size does not fit NUMA hierarchies with
      L3 caches and does not fit HT setups, where different CPUs will often
      have different 'effective cache sizes'. To solve this problem:
      
      - Instead of relying on a single cache-size provided by the platform and
        sticking to it, the code now auto-detects the 'effective migration
        cost' between two measured CPUs, via iterating through a wide range of
        cachesizes. The code searches for the maximum migration cost, which
        occurs when the working set of the test-workload falls just below the
        'effective cache size'. I.e. real-life optimized search is done for
        the maximum migration cost, between two real CPUs.
      
        This, amongst other things, has the positive effect hat if e.g. two
        CPUs share a L2/L3 cache, a different (and accurate) migration cost
        will be found than between two CPUs on the same system that dont share
        any caches.
      
      (The reliable measurement of migration costs is tricky - see the source
      for details.)
      
      Furthermore i've added various boot-time options to override/tune
      migration behavior.
      
      Firstly, there's a blanket override for autodetection:
      
      	migration_cost=1000,2000,3000
      
      will override the depth 0/1/2 values with 1msec/2msec/3msec values.
      
      Secondly, there's a global factor that can be used to increase (or
      decrease) the autodetected values:
      
      	migration_factor=120
      
      will increase the autodetected values by 20%. This option is useful to
      tune things in a workload-dependent way - e.g. if a workload is
      cache-insensitive then CPU utilization can be maximized by specifying
      migration_factor=0.
      
      I've tested the autodetection code quite extensively on x86, on 3
      P3/Xeon/2MB, and the autodetected values look pretty good:
      
      Dual Celeron (128K L2 cache):
      
       ---------------------
       migration cost matrix (max_cache_size: 131072, cpu: 467 MHz):
       ---------------------
                 [00]    [01]
       [00]:     -     1.7(1)
       [01]:   1.7(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (0) 1.7 (1784008)
       ---------------------
      
      Here the slow memory subsystem dominates system performance, and even
      though caches are small, the migration cost is 1.7 msecs.
      
      Dual HT P4 (512K L2 cache):
      
       ---------------------
       migration cost matrix (max_cache_size: 524288, cpu: 2379 MHz):
       ---------------------
                 [00]    [01]    [02]    [03]
       [00]:     -     0.4(1)  0.0(0)  0.4(1)
       [01]:   0.4(1)    -     0.4(1)  0.0(0)
       [02]:   0.0(0)  0.4(1)    -     0.4(1)
       [03]:   0.4(1)  0.0(0)  0.4(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (33900) 0.4 (448514)
       ---------------------
      
      Here it can be seen that there is no migration cost between two HT
      siblings (CPU#0/2 and CPU#1/3 are separate physical CPUs). A fast memory
      system makes inter-physical-CPU migration pretty cheap: 0.4 msecs.
      
      8-way P3/Xeon [2MB L2 cache]:
      
       ---------------------
       migration cost matrix (max_cache_size: 2097152, cpu: 700 MHz):
       ---------------------
                 [00]    [01]    [02]    [03]    [04]    [05]    [06]    [07]
       [00]:     -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [01]:  19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [02]:  19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [03]:  19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1) 19.2(1)
       [04]:  19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1) 19.2(1)
       [05]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1) 19.2(1)
       [06]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -    19.2(1)
       [07]:  19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1) 19.2(1)    -
       ---------------------
       cacheflush times [2]: 0.0 (0) 19.2 (19281756)
       ---------------------
      
      This one has huge caches and a relatively slow memory subsystem - so the
      migration cost is 19 msecs.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: <wilder@us.ibm.com>
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      198e2f18
  13. 12 1月, 2006 2 次提交
  14. 11 1月, 2006 2 次提交
  15. 10 1月, 2006 1 次提交
  16. 09 1月, 2006 4 次提交
    • D
      [PATCH] keys: Permit running process to instantiate keys · b5f545c8
      David Howells 提交于
      Make it possible for a running process (such as gssapid) to be able to
      instantiate a key, as was requested by Trond Myklebust for NFS4.
      
      The patch makes the following changes:
      
       (1) A new, optional key type method has been added. This permits a key type
           to intercept requests at the point /sbin/request-key is about to be
           spawned and do something else with them - passing them over the
           rpc_pipefs files or netlink sockets for instance.
      
           The uninstantiated key, the authorisation key and the intended operation
           name are passed to the method.
      
       (2) The callout_info is no longer passed as an argument to /sbin/request-key
           to prevent unauthorised viewing of this data using ps or by looking in
           /proc/pid/cmdline.
      
           This means that the old /sbin/request-key program will not work with the
           patched kernel as it will expect to see an extra argument that is no
           longer there.
      
           A revised keyutils package will be made available tomorrow.
      
       (3) The callout_info is now attached to the authorisation key. Reading this
           key will retrieve the information.
      
       (4) A new field has been added to the task_struct. This holds the
           authorisation key currently active for a thread. Searches now look here
           for the caller's set of keys rather than looking for an auth key in the
           lowest level of the session keyring.
      
           This permits a thread to be servicing multiple requests at once and to
           switch between them. Note that this is per-thread, not per-process, and
           so is usable in multithreaded programs.
      
           The setting of this field is inherited across fork and exec.
      
       (5) A new keyctl function (KEYCTL_ASSUME_AUTHORITY) has been added that
           permits a thread to assume the authority to deal with an uninstantiated
           key. Assumption is only permitted if the authorisation key associated
           with the uninstantiated key is somewhere in the thread's keyrings.
      
           This function can also clear the assumption.
      
       (6) A new magic key specifier has been added to refer to the currently
           assumed authorisation key (KEY_SPEC_REQKEY_AUTH_KEY).
      
       (7) Instantiation will only proceed if the appropriate authorisation key is
           assumed first. The assumed authorisation key is discarded if
           instantiation is successful.
      
       (8) key_validate() is moved from the file of request_key functions to the
           file of permissions functions.
      
       (9) The documentation is updated.
      
      From: <Valdis.Kletnieks@vt.edu>
      
          Build fix.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Alexander Zangerl <az@bond.edu.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5f545c8
    • P
      [PATCH] remove get_task_struct_rcu() · d4829cd5
      Paul E. McKenney 提交于
      The latest set of signal-RCU patches does not use get_task_struct_rcu().
      Attached is a patch that removes it.
      Signed-off-by: N"Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d4829cd5
    • I
      [PATCH] RCU signal handling · e56d0903
      Ingo Molnar 提交于
      RCU tasklist_lock and RCU signal handling: send signals RCU-read-locked
      instead of tasklist_lock read-locked.  This is a scalability improvement on
      SMP and a preemption-latency improvement under PREEMPT_RCU.
      Signed-off-by: NPaul E. McKenney <paulmck@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NWilliam Irwin <wli@holomorphy.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e56d0903
    • C
      [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap · 930d9152
      Christoph Lameter 提交于
      Add PF_SWAPWRITE to control a processes permission to write to swap.
      
      - Use PF_SWAPWRITE in may_write_to_queue() instead of checking for kswapd
        and pdflush
      
      - Set PF_SWAPWRITE flag for kswapd and pdflush
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      930d9152
  17. 07 1月, 2006 1 次提交
  18. 29 11月, 2005 1 次提交
    • A
      [PATCH] clean up lock_cpu_hotplug() in cpufreq · a9d9baa1
      Ashok Raj 提交于
      There are some callers in cpufreq hotplug notify path that the lowest
      function calls lock_cpu_hotplug().  The lock is already held during
      cpu_up() and cpu_down() calls when the notify calls are broadcast to
      registered clients.
      
      Ideally if possible, we could disable_preempt() at the highest caller and
      make sure we dont sleep in the path down in cpufreq->driver_target() calls
      but the calls are so intertwined and cumbersome to cleanup.
      
      Hence we consistently use lock_cpu_hotplug() and unlock_cpu_hotplug() in
      all places.
      
       - Removed export of cpucontrol semaphore and made it static.
       - removed explicit uses of up/down with lock_cpu_hotplug()
         so we can keep track of the the callers in same thread context and
         just keep refcounts without calling a down() that causes a deadlock.
       - Removed current_in_hotplug() uses
       - Removed PF_HOTPLUG_CPU in sched.h introduced for the current_in_hotplug()
         temporary workaround.
      
      Tested with insmod of cpufreq_stat.ko, and logical online/offline
      to make sure we dont have any hang situations.
      Signed-off-by: NAshok Raj <ashok.raj@intel.com>
      Cc: Zwane Mwaikambo <zwane@linuxpower.ca>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a9d9baa1