1. 23 6月, 2006 2 次提交
    • J
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe 提交于
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b31dc66a
    • E
      [PATCH] ptrace: document the locking rules · 260ea101
      Eric W. Biederman 提交于
      After a lot of reading the code and thinking about how it behaves I have
      managed to figure out what the current ptrace locking rules are.  The
      current code is in much better that it appears at first glance.  The
      troublesome code paths are actually the code paths that violate the current
      rules.
      
      ptrace uses simple exclusive access as it's locking.  You can only touch
      task->ptrace if the task is stopped and you are the ptracer, or if the task
      is running and are the task itself.
      
      Very simple, very easy to maintain.  It just needs to be documented so
      people know not to touch ptrace from elsewhere.
      
      Currently we do have a few pieces of code that are in violation of this
      rule.  Particularly the core dump code, and ptrace_attach.  But so far the
      code looks fixable.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      260ea101
  2. 20 6月, 2006 1 次提交
  3. 27 4月, 2006 1 次提交
  4. 25 4月, 2006 1 次提交
  5. 20 4月, 2006 1 次提交
  6. 15 4月, 2006 1 次提交
  7. 11 4月, 2006 3 次提交
    • K
      [PATCH] Reinstate const in next_thread() · a9cdf410
      Keith Owens 提交于
      Before commit 47e65328, next_thread() took
      a const task_t.  Reinstate the const qualifier, getting the next thread
      never changes the current thread.
      Signed-off-by: NKeith Owens <kaos@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a9cdf410
    • J
      [PATCH] splice: add direct fd <-> fd splicing support · b92ce558
      Jens Axboe 提交于
      It's more efficient for sendfile() emulation. Basically we cache an
      internal private pipe and just use that as the intermediate area for
      pages. Direct splicing is not available from sys_splice(), it is only
      meant to be used for sendfile() emulation.
      
      Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
      exit for the normal fast path.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b92ce558
    • E
      [PATCH] de_thread: Don't confuse users do_each_thread. · de12a787
      Eric W. Biederman 提交于
      Oleg Nesterov spotted two interesting bugs with the current de_thread
      code.  The simplest is a long standing double decrement of
      __get_cpu_var(process_counts) in __unhash_process.  Caused by
      two processes exiting when only one was created.
      
      The other is that since we no longer detach from the thread_group list
      it is possible for do_each_thread when run under the tasklist_lock to
      see the same task_struct twice.  Once on the task list as a
      thread_group_leader, and once on the thread list of another
      thread.
      
      The double appearance in do_each_thread can cause a double increment
      of mm_core_waiters in zap_threads resulting in problems later on in
      coredump_wait.
      
      To remedy those two problems this patch takes the simple approach
      of changing the old thread group leader into a child thread.
      The only routine in release_task that cares is __unhash_process,
      and it can be trivially seen that we handle cleaning up a
      thread group leader properly.
      
      Since de_thread doesn't change the pid of the exiting leader process
      and instead shares it with the new leader process.  I change
      thread_group_leader to recognize group leadership based on the
      group_leader field and not based on pids.  This should also be
      slightly cheaper then the existing thread_group_leader macro.
      
      I performed a quick audit and I couldn't see any user of
      thread_group_leader that cared about the difference.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      de12a787
  8. 01 4月, 2006 6 次提交
    • E
      [PATCH] pidhash: Refactor the pid hash table · 92476d7f
      Eric W. Biederman 提交于
      Simplifies the code, reduces the need for 4 pid hash tables, and makes the
      code more capable.
      
      In the discussions I had with Oleg it was felt that to a large extent the
      cleanup itself justified the work.  With struct pid being dynamically
      allocated meant we could create the hash table entry when the pid was
      allocated and free the hash table entry when the pid was freed.  Instead of
      playing with the hash lists when ever a process would attach or detach to a
      process.
      
      For myself the fact that it gave what my previous task_ref patch gave for free
      with simpler code was a big win.  The problem is that if you hold a reference
      to struct task_struct you lock in 10K of low memory.  If you do that in a user
      controllable way like /proc does, with an unprivileged but hostile user space
      application with typical resource limits of 1000 fds and 100 processes I can
      trigger the OOM killer by consuming all of low memory with task structs, on a
      machine wight 1GB of low memory.
      
      If I instead hold a reference to struct pid which holds a pointer to my
      task_struct, I don't suffer from that problem because struct pid is 2 orders
      of magnitude smaller.  In fact struct pid is small enough that most other
      kernel data structures dwarf it, so simply limiting the number of referring
      data structures is enough to prevent exhaustion of low memory.
      
      This splits the current struct pid into two structures, struct pid and struct
      pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
      struct pid_link is the per process linkage into the hash tables and lives in
      struct task_struct.  struct pid is given an indepedent lifetime, and holds
      pointers to each of the pid types.
      
      The independent life of struct pid simplifies attach_pid, and detach_pid,
      because we are always manipulating the list of pids and not the hash table.
      In addition in giving struct pid an indpendent life it makes the concept much
      more powerful.
      
      Kernel data structures can now embed a struct pid * instead of a pid_t and
      not suffer from pid wrap around problems or from keeping unnecessarily
      large amounts of memory allocated.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      92476d7f
    • E
      [PATCH] task: RCU protect task->usage · 8c7904a0
      Eric W. Biederman 提交于
      A big problem with rcu protected data structures that are also reference
      counted is that you must jump through several hoops to increase the reference
      count.  I think someone finally implemented atomic_inc_not_zero(&count) to
      automate the common case.  Unfortunately this means you must special case the
      rcu access case.
      
      When data structures are only visible via rcu in a manner that is not
      determined by the reference count on the object (i.e.  tasks are visible until
      their zombies are reaped) there is a much simpler technique we can employ.
      Simply delaying the decrement of the reference count until the rcu interval is
      over.
      
      What that means is that the proc code that looks up a task and later
      wants to sleep can now do:
      
      rcu_read_lock();
      task = find_task_by_pid(some_pid);
      if (task) {
      	get_task_struct(task);
      }
      rcu_read_unlock();
      
      The effect on the rest of the kernel is that put_task_struct becomes cheaper
      and immediate, and in the case where the task has been reaped it frees the
      task immediate instead of unnecessarily waiting an until the rcu interval is
      over.
      
      Cleanup of task_struct does not happen when its reference count drops to
      zero, instead cleanup happens when release_task is called.  Tasks can only
      be looked up via rcu before release_task is called.  All rcu protected
      members of task_struct are freed by release_task.
      
      Therefore we can move call_rcu from put_task_struct into release_task.  And
      we can modify release_task to not immediately release the reference count
      but instead have it call put_task_struct from the function it gives to
      call_rcu.
      
      The end result:
      
      - get_task_struct is safe in an rcu context where we have just looked
        up the task.
      
      - put_task_struct() simplifies into its old pre rcu self.
      
      This reorganization also makes put_task_struct uncallable from modules as
      it is not exported but it does not appear to be called from any modules so
      this should not be an issue, and is trivially fixed.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8c7904a0
    • A
      [PATCH] resurrect __put_task_struct · 158d9ebd
      Andrew Morton 提交于
      This just got nuked in mainline.  Bring it back because Eric's patches use it.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      158d9ebd
    • C
      [PATCH] sched: activate SCHED BATCH expired · d425b274
      Con Kolivas 提交于
      To increase the strength of SCHED_BATCH as a scheduling hint we can
      activate batch tasks on the expired array since by definition they are
      latency insensitive tasks.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d425b274
    • C
      [PATCH] sched: cleanup task_activated() · 3dee386e
      Con Kolivas 提交于
      The activated flag in task_struct is used to track different sleep types and
      its usage is somewhat obfuscated.  Convert the variable to an enum with more
      descriptive names without altering the function.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3dee386e
    • J
      [PATCH] sched: reduce overhead of calc_load · db1b1fef
      Jack Steiner 提交于
      Currently, count_active_tasks() calls both nr_running() &
      nr_interruptible().  Each of these functions does a "for_each_cpu" & reads
      values from the runqueue of each cpu.  Although this is not a lot of
      instructions, each runqueue may be located on different node.  Depending on
      the architecture, a unique TLB entry may be required to access each
      runqueue.
      
      Since there may be more runqueues than cpu TLB entries, a scan of all
      runqueues can trash the TLB.  Each memory reference incurs a TLB miss &
      refill.
      
      In addition, the runqueue cacheline that contains nr_running &
      nr_uninterruptible may be evicted from the cache between the two passes.
      This causes unnecessary cache misses.
      
      Combining nr_running() & nr_interruptible() into a single function
      substantially reduces the TLB & cache misses on large systems.  This should
      have no measureable effect on smaller systems.
      
      On a 128p IA64 system running a memory stress workload, the new function
      reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      db1b1fef
  9. 29 3月, 2006 12 次提交
  10. 28 3月, 2006 2 次提交
  11. 27 3月, 2006 1 次提交
  12. 24 3月, 2006 4 次提交
    • I
      [PATCH] timer-irq-driven soft-watchdog, cleanups · 6687a97d
      Ingo Molnar 提交于
      Make the softlockup detector purely timer-interrupt driven, removing
      softirq-context (timer) dependencies.  This means that if the softlockup
      watchdog triggers, it has truly observed a longer than 10 seconds
      scheduling delay of a SCHED_FIFO prio 99 task.
      
      (the patch also turns off the softlockup detector during the initial bootup
      phase and does small style fixes)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6687a97d
    • P
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson 提交于
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c61afb18
    • P
      [PATCH] cpuset memory spread basic implementation · 825a46af
      Paul Jackson 提交于
      This patch provides the implementation and cpuset interface for an alternative
      memory allocation policy that can be applied to certain kinds of memory
      allocations, such as the page cache (file system buffers) and some slab caches
      (such as inode caches).
      
      The policy is called "memory spreading." If enabled, it spreads out these
      kinds of memory allocations over all the nodes allowed to a task, instead of
      preferring to place them on the node where the task is executing.
      
      All other kinds of allocations, including anonymous pages for a tasks stack
      and data regions, are not affected by this policy choice, and continue to be
      allocated preferring the node local to execution, as modified by the NUMA
      mempolicy.
      
      There are two boolean flag files per cpuset that control where the kernel
      allocates pages for the file system buffers and related in kernel data
      structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.
      
      If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
      kernel will spread the file system buffers (page cache) evenly over all the
      nodes that the faulting task is allowed to use, instead of preferring to put
      those pages on the node where the task is running.
      
      If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
      kernel will spread some file system related slab caches, such as for inodes
      and dentries evenly over all the nodes that the faulting task is allowed to
      use, instead of preferring to put those pages on the node where the task is
      running.
      
      The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
      or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
      PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
      subsequently joins that cpuset.  In subsequent patches, the page allocation
      calls for the affected page cache and slab caches are modified to perform an
      inline check for these flags, and if set, a call to a new routine
      cpuset_mem_spread_node() returns the node to prefer for the allocation.
      
      The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
      per-task rotor cpuset_mem_spread_rotor to select the next node in the current
      tasks mems_allowed to prefer for the allocation.
      
      This policy can provide substantial improvements for jobs that need to place
      thread local data on the corresponding node, but that need to access large
      file system data sets that need to be spread across the several nodes in the
      jobs cpuset in order to fit.  Without this patch, especially for jobs that
      might have one thread reading in the data set, the memory allocation across
      the nodes in the jobs cpuset can become very uneven.
      
      A couple of Copyright year ranges are updated as well.  And a couple of email
      addresses that can be found in the MAINTAINERS file are removed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      825a46af
    • J
      2056a782
  13. 12 3月, 2006 1 次提交
  14. 01 3月, 2006 1 次提交
  15. 15 2月, 2006 1 次提交
    • C
      [PATCH] sched: revert "filter affine wakeups" · d6077cb8
      Chen, Kenneth W 提交于
      Revert commit d7102e95:
      
          [PATCH] sched: filter affine wakeups
      
      Apparently caused more than 10% performance regression for aim7 benchmark.
      The setup in use is 16-cpu HP rx8620, 64Gb of memory and 12 MSA1000s with 144
      disks.  Each disk is 72Gb with a single ext3 filesystem (courtesy of HP, who
      supplied benchmark results).
      
      The problem is, for aim7, the wake-up pattern is random, but it still needs
      load balancing action in the wake-up path to achieve best performance.  With
      the above commit, lack of load balancing hurts that workload.
      
      However, for workloads like database transaction processing, the requirement
      is exactly opposite.  In the wake up path, best performance is achieved with
      absolutely zero load balancing.  We simply wake up the process on the CPU that
      it was previously run.  Worst performance is obtained when we do load
      balancing at wake up.
      
      There isn't an easy way to auto detect the workload characteristics.  Ingo's
      earlier patch that detects idle CPU and decide whether to load balance or not
      doesn't perform with aim7 either since all CPUs are busy (it causes even
      bigger perf.  regression).
      
      Revert commit d7102e95, which causes more
      than 10% performance regression with aim7.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d6077cb8
  16. 10 2月, 2006 1 次提交
  17. 19 1月, 2006 1 次提交