1. 15 7月, 2006 4 次提交
  2. 04 7月, 2006 5 次提交
    • I
      [PATCH] sched: cleanup, convert sched.c-internal typedefs to struct · 70b97a7f
      Ingo Molnar 提交于
      convert:
      
       - runqueue_t to 'struct rq'
       - prio_array_t to 'struct prio_array'
       - migration_req_t to 'struct migration_req'
      
      I was the one who added these but they are both against the kernel coding
      style and also were used inconsistently at places.  So just get rid of them at
      once, now that we are flushing the scheduler patch-queue anyway.
      
      Conversion was mostly scripted, the result was reviewed and all secondary
      whitespace and style impact (if any) was fixed up by hand.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      70b97a7f
    • I
      [PATCH] sched: cleanup, remove task_t, convert to struct task_struct · 36c8b586
      Ingo Molnar 提交于
      cleanup: remove task_t and convert all the uses to struct task_struct. I
      introduced it for the scheduler anno and it was a mistake.
      
      Conversion was mostly scripted, the result was reviewed and all
      secondary whitespace and style impact (if any) was fixed up by hand.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      36c8b586
    • I
      [PATCH] lockdep: core · fbb9ce95
      Ingo Molnar 提交于
      Do 'make oldconfig' and accept all the defaults for new config options -
      reboot into the kernel and if everything goes well it should boot up fine and
      you should have /proc/lockdep and /proc/lockdep_stats files.
      
      Typically if the lock validator finds some problem it will print out
      voluminous debug output that begins with "BUG: ..." and which syslog output
      can be used by kernel developers to figure out the precise locking scenario.
      
      What does the lock validator do?  It "observes" and maps all locking rules as
      they occur dynamically (as triggered by the kernel's natural use of spinlocks,
      rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
      new locking scenario, it validates this new rule against the existing set of
      rules.  If this new rule is consistent with the existing set of rules then the
      new rule is added transparently and the kernel continues as normal.  If the
      new rule could create a deadlock scenario then this condition is printed out.
      
      When determining validity of locking, all possible "deadlock scenarios" are
      considered: assuming arbitrary number of CPUs, arbitrary irq context and task
      context constellations, running arbitrary combinations of all the existing
      locking scenarios.  In a typical system this means millions of separate
      scenarios.  This is why we call it a "locking correctness" validator - for all
      rules that are observed the lock validator proves it with mathematical
      certainty that a deadlock could not occur (assuming that the lock validator
      implementation itself is correct and its internal data structures are not
      corrupted by some other kernel subsystem).  [see more details and conditionals
      of this statement in include/linux/lockdep.h and
      Documentation/lockdep-design.txt]
      
      Furthermore, this "all possible scenarios" property of the validator also
      enables the finding of complex, highly unlikely multi-CPU multi-context races
      via single single-context rules, increasing the likelyhood of finding bugs
      drastically.  In practical terms: the lock validator already found a bug in
      the upstream kernel that could only occur on systems with 3 or more CPUs, and
      which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
      That bug was found and reported on a single-CPU system (!).  So in essence a
      race will be found "piecemail-wise", triggering all the necessary components
      for the race, without having to reproduce the race scenario itself!  In its
      short existence the lock validator found and reported many bugs before they
      actually caused a real deadlock.
      
      To further increase the efficiency of the validator, the mapping is not per
      "lock instance", but per "lock-class".  For example, all struct inode objects
      in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
      then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
      type", and all locking activities that occur against ->inotify_mutex are
      "unified" into this single lock-class.  The advantage of the lock-class
      approach is that all historical ->inotify_mutex uses are mapped into a single
      (and as narrow as possible) set of locking rules - regardless of how many
      different tasks or inode structures it took to build this set of rules.  The
      set of rules persist during the lifetime of the kernel.
      
      To see the rough magnitude of checking that the lock validator does, here's a
      portion of /proc/lockdep_stats, fresh after bootup:
      
       lock-classes:                            694 [max: 2048]
       direct dependencies:                  1598 [max: 8192]
       indirect dependencies:               17896
       all direct dependencies:             16206
       dependency chains:                    1910 [max: 8192]
       in-hardirq chains:                      17
       in-softirq chains:                     105
       in-process chains:                    1065
       stack-trace entries:                 38761 [max: 131072]
       combined max dependencies:         2033928
       hardirq-safe locks:                     24
       hardirq-unsafe locks:                  176
       softirq-safe locks:                     53
       softirq-unsafe locks:                  137
       irq-safe locks:                         59
       irq-unsafe locks:                      176
      
      The lock validator has observed 1598 actual single-thread locking patterns,
      and has validated all possible 2033928 distinct locking scenarios.
      
      More details about the design of the lock validator can be found in
      Documentation/lockdep-design.txt, which can also found at:
      
         http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
      
      [bunk@stusta.de: cleanups]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fbb9ce95
    • I
      [PATCH] lockdep: irqtrace subsystem, core · de30a2b3
      Ingo Molnar 提交于
      Accurate hard-IRQ-flags and softirq-flags state tracing.
      
      This allows us to attach extra functionality to IRQ flags on/off
      events (such as trace-on/off).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      de30a2b3
    • I
      [PATCH] lockdep: better lock debugging · 9a11b49a
      Ingo Molnar 提交于
      Generic lock debugging:
      
       - generalized lock debugging framework. For example, a bug in one lock
         subsystem turns off debugging in all lock subsystems.
      
       - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
         the mutex/rtmutex debugging code: it caused way too much prototype
         hackery, and lockdep will give the same information anyway.
      
       - ability to do silent tests
      
       - check lock freeing in vfree too.
      
       - more finegrained debugging options, to allow distributions to
         turn off more expensive debugging features.
      
      There's no separate 'held mutexes' list anymore - but there's a 'held locks'
      stack within lockdep, which unifies deadlock detection across all lock
      classes.  (this is independent of the lockdep validation stuff - lockdep first
      checks whether we are holding a lock already)
      
      Here are the current debugging options:
      
      CONFIG_DEBUG_MUTEXES=y
      CONFIG_DEBUG_LOCK_ALLOC=y
      
      which do:
      
       config DEBUG_MUTEXES
                bool "Mutex debugging, basic checks"
      
       config DEBUG_LOCK_ALLOC
               bool "Detect incorrect freeing of live mutexes"
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a11b49a
  3. 01 7月, 2006 1 次提交
  4. 28 6月, 2006 8 次提交
    • T
      [PATCH] rtmutex: Propagate priority settings into PI lock chains · 95e02ca9
      Thomas Gleixner 提交于
      When the priority of a task, which is blocked on a lock, changes we must
      propagate this change into the PI lock chain.  Therefor the chain walk code
      is changed to get rid of the references to current to avoid false positives
      in the deadlock detector, as setscheduler might be called by a task which
      holds the lock on which the task whose priority is changed is blocked.
      
      Also add some comments about the get/put_task_struct usage to avoid
      confusion.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95e02ca9
    • I
      [PATCH] pi-futex: futex_lock_pi/futex_unlock_pi support · c87e2837
      Ingo Molnar 提交于
      This adds the actual pi-futex implementation, based on rt-mutexes.
      
      [dino@in.ibm.com: fix an oops-causing race]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NDinakar Guniguntala <dino@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c87e2837
    • T
      [PATCH] pi-futex: rt mutex tester · 61a87122
      Thomas Gleixner 提交于
      RT-mutex tester: scriptable tester for rt mutexes, which allows userspace
      scripting of mutex unit-tests (and dynamic tests as well), using the actual
      rt-mutex implementation of the kernel.
      
      [akpm@osdl.org: fixlet]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      61a87122
    • I
      [PATCH] pi-futex: rt mutex core · 23f78d4a
      Ingo Molnar 提交于
      Core functions for the rt-mutex subsystem.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      23f78d4a
    • I
      [PATCH] pi-futex: scheduler support for pi · b29739f9
      Ingo Molnar 提交于
      Add framework to boost/unboost the priority of RT tasks.
      
      This consists of:
      
       - caching the 'normal' priority in ->normal_prio
       - providing a functions to set/get the priority of the task
       - make sched_setscheduler() aware of boosting
      
      The effective_prio() cleanups also fix a priority-calculation bug pointed out
      by Andrey Gelman, in set_user_nice().
      
      has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Andrey Gelman <agelman@012.net.il>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b29739f9
    • S
      [PATCH] sched: mc/smt power savings sched policy · 5c45bf27
      Siddha, Suresh B 提交于
      sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
      /sys/devices/system/cpu/ control the MC/SMT power savings policy for the
      scheduler.
      
      Based on the values (1-enable, 0-disable) for these controls, sched groups
      cpu power will be determined for different domains.  When power savings
      policy is enabled and under light load conditions, scheduler will minimize
      the physical packages/cpu cores carrying the load and thus conserving
      power(with a perf impact based on the workload characteristics...  see OLS
      2005 CMP kernel scheduler paper for more details..)
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Con Kolivas <kernel@kolivas.org>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5c45bf27
    • S
      [PATCH] sched_domain: handle kmalloc failure · 51888ca2
      Srivatsa Vaddagiri 提交于
      Try to handle mem allocation failures in build_sched_domains by bailing out
      and cleaning up thus-far allocated memory.  The patch has a direct consequence
      that we disable load balancing completely (even at sibling level) upon *any*
      memory allocation failure.
      
      [Lee.Schermerhorn@hp.com: bugfix]
      Signed-off-by: NSrivatsa Vaddagir <vatsa@in.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      51888ca2
    • P
      [PATCH] sched: implement smpnice · 2dd73a4f
      Peter Williams 提交于
      Problem:
      
      The introduction of separate run queues per CPU has brought with it "nice"
      enforcement problems that are best described by a simple example.
      
      For the sake of argument suppose that on a single CPU machine with a
      nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
      task gets 95% of the CPU and the nice==19 task gets 5% of the CPU.  Now
      suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
      2 nice==0 hard spinners running.  The user of this system would be entitled
      to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
      tasks only get 5% each.  However, whether this expectation is met is pretty
      much down to luck as there are four equally likely distributions of the
      tasks to the CPUs that the load balancing code will consider to be balanced
      with loads of 2.0 for each CPU.  Two of these distributions involve one
      nice==0 and one nice==19 task per CPU and in these circumstances the users
      expectations will be met.  The other two distributions both involve both
      nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
      each task will get 50% of a CPU and the user's expectations will not be
      met.
      
      Solution:
      
      The solution to this problem that is implemented in the attached patch is
      to use weighted loads when determining if the system is balanced and, when
      an imbalance is detected, to move an amount of weighted load between run
      queues (as opposed to a number of tasks) to restore the balance.  Once
      again, the easiest way to explain why both of these measures are necessary
      is to use a simple example.  Suppose that (in a slight variation of the
      above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
      hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
      the 4 nice==19 tasks are on the other CPU.  The weighted loads for the two
      CPUs would be 4.0 and 0.2 respectively and the load balancing code would
      move 2 tasks resulting in one CPU with a load of 2.0 and the other with
      load of 2.2.  If this was considered to be a big enough imbalance to
      justify moving a task and that task was moved using the current
      move_tasks() then it would move the highest priority task that it found and
      this would result in one CPU with a load of 3.0 and the other with a load
      of 1.2 which would result in the movement of a task in the opposite
      direction and so on -- infinite loop.  If, on the other hand, an amount of
      load to be moved is calculated from the imbalance (in this case 0.1) and
      move_tasks() skips tasks until it find ones whose contributions to the
      weighted load are less than this amount it would move two of the nice==19
      tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
      loads of 2.1 for each CPU.
      
      One of the advantages of this mechanism is that on a system where all tasks
      have nice==0 the load balancing calculations would be mathematically
      identical to the current load balancing code.
      
      Notes:
      
      struct task_struct:
      
      has a new field load_weight which (in a trade off of space for speed)
      stores the contribution that this task makes to a CPU's weighted load when
      it is runnable.
      
      struct runqueue:
      
      has a new field raw_weighted_load which is the sum of the load_weight
      values for the currently runnable tasks on this run queue.  This field
      always needs to be updated when nr_running is updated so two new inline
      functions inc_nr_running() and dec_nr_running() have been created to make
      sure that this happens.  This also offers a convenient way to optimize away
      this part of the smpnice mechanism when CONFIG_SMP is not defined.
      
      int try_to_wake_up():
      
      in this function the value SCHED_LOAD_BALANCE is used to represent the load
      contribution of a single task in various calculations in the code that
      decides which CPU to put the waking task on.  While this would be a valid
      on a system where the nice values for the runnable tasks were distributed
      evenly around zero it will lead to anomalous load balancing if the
      distribution is skewed in either direction.  To overcome this problem
      SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
      or by the average load_weight per task for the queue in question (as
      appropriate).
      
      int move_tasks():
      
      The modifications to this function were complicated by the fact that
      active_load_balance() uses it to move exactly one task without checking
      whether an imbalance actually exists.  This precluded the simple
      overloading of max_nr_move with max_load_move and necessitated the addition
      of the latter as an extra argument to the function.  The internal
      implementation is then modified to move up to max_nr_move tasks and
      max_load_move of weighted load.  This slightly complicates the code where
      move_tasks() is called and if ever active_load_balance() is changed to not
      use move_tasks() the implementation of move_tasks() should be simplified
      accordingly.
      
      struct sched_group *find_busiest_group():
      
      Similar to try_to_wake_up(), there are places in this function where
      SCHED_LOAD_SCALE is used to represent the load contribution of a single
      task and the same issues are created.  A similar solution is adopted except
      that it is now the average per task contribution to a group's load (as
      opposed to a run queue) that is required.  As this value is not directly
      available from the group it is calculated on the fly as the queues in the
      groups are visited when determining the busiest group.
      
      A key change to this function is that it is no longer to scale down
      *imbalance on exit as move_tasks() uses the load in its scaled form.
      
      void set_user_nice():
      
      has been modified to update the task's load_weight field when it's nice
      value and also to ensure that its run queue's raw_weighted_load field is
      updated if it was runnable.
      
      From: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      
      With smpnice, sched groups with highest priority tasks can mask the imbalance
      between the other sched groups with in the same domain.  This patch fixes some
      of the listed down scenarios by not considering the sched groups which are
      lightly loaded.
      
      a) on a simple 4-way MP system, if we have one high priority and 4 normal
         priority tasks, with smpnice we would like to see the high priority task
         scheduled on one cpu, two other cpus getting one normal task each and the
         fourth cpu getting the remaining two normal tasks.  but with current
         smpnice extra normal priority task keeps jumping from one cpu to another
         cpu having the normal priority task.  This is because of the
         busiest_has_loaded_cpus, nr_loaded_cpus logic..  We are not including the
         cpu with high priority task in max_load calculations but including that in
         total and avg_load calcuations..  leading to max_load < avg_load and load
         balance between cpus running normal priority tasks(2 Vs 1) will always show
         imbalanace as one normal priority and the extra normal priority task will
         keep moving from one cpu to another cpu having normal priority task..
      
      b) 4-way system with HT (8 logical processors).  Package-P0 T0 has a
         highest priority task, T1 is idle.  Package-P1 Both T0 and T1 have 1 normal
         priority task each..  P2 and P3 are idle.  With this patch, one of the
         normal priority tasks on P1 will be moved to P2 or P3..
      
      c) With the current weighted smp nice calculations, it doesn't always make
         sense to look at the highest weighted runqueue in the busy group..
         Consider a load balance scenario on a DP with HT system, with Package-0
         containing one high priority and one low priority, Package-1 containing one
         low priority(with other thread being idle)..  Package-1 thinks that it need
         to take the low priority thread from Package-0.  And find_busiest_queue()
         returns the cpu thread with highest priority task..  And ultimately(with
         help of active load balance) we move high priority task to Package-1.  And
         same continues with Package-0 now, moving high priority task from package-1
         to package-0..  Even without the presence of active load balance, load
         balance will fail to balance the above scenario..  Fix find_busiest_queue
         to use "imbalance" when it is lightly loaded.
      
      [kernel@kolivas.org: sched: store weighted load on up]
      [kernel@kolivas.org: sched: add discrete weighted cpu load function]
      [suresh.b.siddha@intel.com: sched: remove dead code]
      Signed-off-by: NPeter Williams <pwil3058@bigpond.com.au>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Cc: John Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2dd73a4f
  5. 27 6月, 2006 1 次提交
  6. 26 6月, 2006 3 次提交
    • K
      [PATCH] pacct: none-delayed process accounting accumulation · 77787bfb
      KaiGai Kohei 提交于
      In current 2.6.17 implementation, signal_struct refered from task_struct is
      used for per-process data structure.  The pacct facility also uses it as a
      per-process data structure to store stime, utime, minflt, majflt.  But those
      members are saved in __exit_signal().  It's too late.
      
      For example, if some threads exits at same time, pacct facility has a
      possibility to drop accountings for a part of those threads.  (see, the
      following 'The results of original 2.6.17 kernel') I think accounting
      information should be completely collected into the per-process data structure
      before writing out an accounting record.
      
      This patch fixes this matter.  Accumulation of stime, utime, minflt and majflt
      are done before generating accounting record.
      
      [mingo@elte.hu: fix acct_collect() siglock bug found by lockdep]
      Signed-off-by: NKaiGai Kohei <kaigai@ak.jp.nec.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      77787bfb
    • K
      [PATCH] pacct: avoidance to refer the last thread as a representation of the process · f6ec29a4
      KaiGai Kohei 提交于
      When pacct facility generate an 'ac_flag' field in accounting record, it
      refers a task_struct of the thread which died last in the process.  But any
      other task_structs are ignored.
      
      Therefore, pacct facility drops ASU flag even if root-privilege operations are
      used by any other threads except the last one.  In addition, AFORK flag is
      always set when the thread of group-leader didn't die last, although this
      process has called execve() after fork().
      
      We have a same matter in ac_exitcode.  The recorded ac_exitcode is an exit
      code of the last thread in the process.  There is a possibility this exitcode
      is not the group leader's one.
      f6ec29a4
    • K
      [PATCH] pacct: add pacct_struct to fix some pacct bugs. · 0e464814
      KaiGai Kohei 提交于
      The pacct facility need an i/o operation when an accounting record is
      generated.  There is a possibility to wake OOM killer up.  If OOM killer is
      activated, it kills some processes to make them release process memory
      regions.
      
      But acct_process() is called in the killed processes context before calling
      exit_mm(), so those processes cannot release own memory.  In the results, any
      processes stop in this point and it finally cause a system stall.
      0e464814
  7. 23 6月, 2006 2 次提交
    • J
      [PATCH] Kill PF_SYNCWRITE flag · b31dc66a
      Jens Axboe 提交于
      A process flag to indicate whether we are doing sync io is incredibly
      ugly. It also causes performance problems when one does a lot of async
      io and then proceeds to sync it. Part of the io will go out as async,
      and the other part as sync. This causes a disconnect between the
      previously submitted io and the synced io. For io schedulers such as CFQ,
      this will cause us lost merges and suboptimal behaviour in scheduling.
      
      Remove PF_SYNCWRITE completely from the fsync/msync paths, and let
      the O_DIRECT path just directly indicate that the writes are sync
      by using WRITE_SYNC instead.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b31dc66a
    • E
      [PATCH] ptrace: document the locking rules · 260ea101
      Eric W. Biederman 提交于
      After a lot of reading the code and thinking about how it behaves I have
      managed to figure out what the current ptrace locking rules are.  The
      current code is in much better that it appears at first glance.  The
      troublesome code paths are actually the code paths that violate the current
      rules.
      
      ptrace uses simple exclusive access as it's locking.  You can only touch
      task->ptrace if the task is stopped and you are the ptracer, or if the task
      is running and are the task itself.
      
      Very simple, very easy to maintain.  It just needs to be documented so
      people know not to touch ptrace from elsewhere.
      
      Currently we do have a few pieces of code that are in violation of this
      rule.  Particularly the core dump code, and ptrace_attach.  But so far the
      code looks fixable.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      260ea101
  8. 20 6月, 2006 1 次提交
  9. 27 4月, 2006 1 次提交
  10. 25 4月, 2006 1 次提交
  11. 20 4月, 2006 1 次提交
  12. 15 4月, 2006 1 次提交
  13. 11 4月, 2006 3 次提交
    • K
      [PATCH] Reinstate const in next_thread() · a9cdf410
      Keith Owens 提交于
      Before commit 47e65328, next_thread() took
      a const task_t.  Reinstate the const qualifier, getting the next thread
      never changes the current thread.
      Signed-off-by: NKeith Owens <kaos@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a9cdf410
    • J
      [PATCH] splice: add direct fd <-> fd splicing support · b92ce558
      Jens Axboe 提交于
      It's more efficient for sendfile() emulation. Basically we cache an
      internal private pipe and just use that as the intermediate area for
      pages. Direct splicing is not available from sys_splice(), it is only
      meant to be used for sendfile() emulation.
      
      Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at
      exit for the normal fast path.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      b92ce558
    • E
      [PATCH] de_thread: Don't confuse users do_each_thread. · de12a787
      Eric W. Biederman 提交于
      Oleg Nesterov spotted two interesting bugs with the current de_thread
      code.  The simplest is a long standing double decrement of
      __get_cpu_var(process_counts) in __unhash_process.  Caused by
      two processes exiting when only one was created.
      
      The other is that since we no longer detach from the thread_group list
      it is possible for do_each_thread when run under the tasklist_lock to
      see the same task_struct twice.  Once on the task list as a
      thread_group_leader, and once on the thread list of another
      thread.
      
      The double appearance in do_each_thread can cause a double increment
      of mm_core_waiters in zap_threads resulting in problems later on in
      coredump_wait.
      
      To remedy those two problems this patch takes the simple approach
      of changing the old thread group leader into a child thread.
      The only routine in release_task that cares is __unhash_process,
      and it can be trivially seen that we handle cleaning up a
      thread group leader properly.
      
      Since de_thread doesn't change the pid of the exiting leader process
      and instead shares it with the new leader process.  I change
      thread_group_leader to recognize group leadership based on the
      group_leader field and not based on pids.  This should also be
      slightly cheaper then the existing thread_group_leader macro.
      
      I performed a quick audit and I couldn't see any user of
      thread_group_leader that cared about the difference.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      de12a787
  14. 01 4月, 2006 6 次提交
    • E
      [PATCH] pidhash: Refactor the pid hash table · 92476d7f
      Eric W. Biederman 提交于
      Simplifies the code, reduces the need for 4 pid hash tables, and makes the
      code more capable.
      
      In the discussions I had with Oleg it was felt that to a large extent the
      cleanup itself justified the work.  With struct pid being dynamically
      allocated meant we could create the hash table entry when the pid was
      allocated and free the hash table entry when the pid was freed.  Instead of
      playing with the hash lists when ever a process would attach or detach to a
      process.
      
      For myself the fact that it gave what my previous task_ref patch gave for free
      with simpler code was a big win.  The problem is that if you hold a reference
      to struct task_struct you lock in 10K of low memory.  If you do that in a user
      controllable way like /proc does, with an unprivileged but hostile user space
      application with typical resource limits of 1000 fds and 100 processes I can
      trigger the OOM killer by consuming all of low memory with task structs, on a
      machine wight 1GB of low memory.
      
      If I instead hold a reference to struct pid which holds a pointer to my
      task_struct, I don't suffer from that problem because struct pid is 2 orders
      of magnitude smaller.  In fact struct pid is small enough that most other
      kernel data structures dwarf it, so simply limiting the number of referring
      data structures is enough to prevent exhaustion of low memory.
      
      This splits the current struct pid into two structures, struct pid and struct
      pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one.
      struct pid_link is the per process linkage into the hash tables and lives in
      struct task_struct.  struct pid is given an indepedent lifetime, and holds
      pointers to each of the pid types.
      
      The independent life of struct pid simplifies attach_pid, and detach_pid,
      because we are always manipulating the list of pids and not the hash table.
      In addition in giving struct pid an indpendent life it makes the concept much
      more powerful.
      
      Kernel data structures can now embed a struct pid * instead of a pid_t and
      not suffer from pid wrap around problems or from keeping unnecessarily
      large amounts of memory allocated.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      92476d7f
    • E
      [PATCH] task: RCU protect task->usage · 8c7904a0
      Eric W. Biederman 提交于
      A big problem with rcu protected data structures that are also reference
      counted is that you must jump through several hoops to increase the reference
      count.  I think someone finally implemented atomic_inc_not_zero(&count) to
      automate the common case.  Unfortunately this means you must special case the
      rcu access case.
      
      When data structures are only visible via rcu in a manner that is not
      determined by the reference count on the object (i.e.  tasks are visible until
      their zombies are reaped) there is a much simpler technique we can employ.
      Simply delaying the decrement of the reference count until the rcu interval is
      over.
      
      What that means is that the proc code that looks up a task and later
      wants to sleep can now do:
      
      rcu_read_lock();
      task = find_task_by_pid(some_pid);
      if (task) {
      	get_task_struct(task);
      }
      rcu_read_unlock();
      
      The effect on the rest of the kernel is that put_task_struct becomes cheaper
      and immediate, and in the case where the task has been reaped it frees the
      task immediate instead of unnecessarily waiting an until the rcu interval is
      over.
      
      Cleanup of task_struct does not happen when its reference count drops to
      zero, instead cleanup happens when release_task is called.  Tasks can only
      be looked up via rcu before release_task is called.  All rcu protected
      members of task_struct are freed by release_task.
      
      Therefore we can move call_rcu from put_task_struct into release_task.  And
      we can modify release_task to not immediately release the reference count
      but instead have it call put_task_struct from the function it gives to
      call_rcu.
      
      The end result:
      
      - get_task_struct is safe in an rcu context where we have just looked
        up the task.
      
      - put_task_struct() simplifies into its old pre rcu self.
      
      This reorganization also makes put_task_struct uncallable from modules as
      it is not exported but it does not appear to be called from any modules so
      this should not be an issue, and is trivially fixed.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8c7904a0
    • A
      [PATCH] resurrect __put_task_struct · 158d9ebd
      Andrew Morton 提交于
      This just got nuked in mainline.  Bring it back because Eric's patches use it.
      
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      158d9ebd
    • C
      [PATCH] sched: activate SCHED BATCH expired · d425b274
      Con Kolivas 提交于
      To increase the strength of SCHED_BATCH as a scheduling hint we can
      activate batch tasks on the expired array since by definition they are
      latency insensitive tasks.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d425b274
    • C
      [PATCH] sched: cleanup task_activated() · 3dee386e
      Con Kolivas 提交于
      The activated flag in task_struct is used to track different sleep types and
      its usage is somewhat obfuscated.  Convert the variable to an enum with more
      descriptive names without altering the function.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3dee386e
    • J
      [PATCH] sched: reduce overhead of calc_load · db1b1fef
      Jack Steiner 提交于
      Currently, count_active_tasks() calls both nr_running() &
      nr_interruptible().  Each of these functions does a "for_each_cpu" & reads
      values from the runqueue of each cpu.  Although this is not a lot of
      instructions, each runqueue may be located on different node.  Depending on
      the architecture, a unique TLB entry may be required to access each
      runqueue.
      
      Since there may be more runqueues than cpu TLB entries, a scan of all
      runqueues can trash the TLB.  Each memory reference incurs a TLB miss &
      refill.
      
      In addition, the runqueue cacheline that contains nr_running &
      nr_uninterruptible may be evicted from the cache between the two passes.
      This causes unnecessary cache misses.
      
      Combining nr_running() & nr_interruptible() into a single function
      substantially reduces the TLB & cache misses on large systems.  This should
      have no measureable effect on smaller systems.
      
      On a 128p IA64 system running a memory stress workload, the new function
      reduced the overhead of calc_load() from 605 usec/call to 324 usec/call.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      db1b1fef
  15. 29 3月, 2006 2 次提交