1. 09 8月, 2007 6 次提交
    • I
      sched: remove the 'u64 now' parameter from ->pick_next_task() · fb8d4724
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->pick_next_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fb8d4724
    • I
      sched: remove the 'u64 now' parameter from ->dequeue_task() · f02231e5
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->dequeue_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f02231e5
    • I
      sched: remove the 'u64 now' parameter from ->enqueue_task() · fd390f6a
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from ->enqueue_task().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fd390f6a
    • I
      sched: remove the 'u64 now' parameter from print_cfs_rq() · 5cef9eca
      Ingo Molnar 提交于
      remove the 'u64 now' parameter from print_cfs_rq().
      
      ( identity transformation that causes no change in functionality. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5cef9eca
    • P
      sched: fix bug in balance_tasks() · a4ac01c3
      Peter Williams 提交于
      There are two problems with balance_tasks() and how it used:
      
      1. The variables best_prio and best_prio_seen (inherited from the old
      move_tasks()) were only required to handle problems caused by the
      active/expired arrays, the order in which they were processed and the
      possibility that the task with the highest priority could be on either.
        These issues are no longer present and the extra overhead associated
      with their use is unnecessary (and possibly wrong).
      
      2. In the absence of CONFIG_FAIR_GROUP_SCHED being set, the same
      this_best_prio variable needs to be used by all scheduling classes or
      there is a risk of moving too much load.  E.g. if the highest priority
      task on this at the beginning is a fairly low priority task and the rt
      class migrates a task (during its turn) then that moved task becomes the
      new highest priority task on this_rq but when the sched_fair class
      initializes its copy of this_best_prio it will get the priority of the
      original highest priority task as, due to the run queue locks being
      held, the reschedule triggered by pull_task() will not have taken place.
        This could result in inappropriate overriding of skip_for_load and
      excessive load being moved.
      
      The attached patch addresses these problems by deleting all reference to
      best_prio and best_prio_seen and making this_best_prio a reference
      parameter to the various functions involved.
      
      load_balance_fair() has also been modified so that this_best_prio is
      only reset (in the loop) if CONFIG_FAIR_GROUP_SCHED is set.  This should
      preserve the effect of helping spread groups' higher priority tasks
      around the available CPUs while improving system performance when
      CONFIG_FAIR_GROUP_SCHED isn't set.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a4ac01c3
    • P
      sched: simplify move_tasks() · 43010659
      Peter Williams 提交于
      The move_tasks() function is currently multiplexed with two distinct
      capabilities:
      
      1. attempt to move a specified amount of weighted load from one run
      queue to another; and
      2. attempt to move a specified number of tasks from one run queue to
      another.
      
      The first of these capabilities is used in two places, load_balance()
      and load_balance_idle(), and in both of these cases the return value of
      move_tasks() is used purely to decide if tasks/load were moved and no
      notice of the actual number of tasks moved is taken.
      
      The second capability is used in exactly one place,
      active_load_balance(), to attempt to move exactly one task and, as
      before, the return value is only used as an indicator of success or failure.
      
      This multiplexing of sched_task() was introduced, by me, as part of the
      smpnice patches and was motivated by the fact that the alternative, one
      function to move specified load and one to move a single task, would
      have led to two functions of roughly the same complexity as the old
      move_tasks() (or the new balance_tasks()).  However, the new modular
      design of the new CFS scheduler allows a simpler solution to be adopted
      and this patch addresses that solution by:
      
      1. adding a new function, move_one_task(), to be used by
      active_load_balance(); and
      2. making move_tasks() a single purpose function that tries to move a
      specified weighted load and returns 1 for success and 0 for failure.
      
      One of the consequences of these changes is that neither move_one_task()
      or the new move_tasks() care how many tasks sched_class.load_balance()
      moves and this enables its interface to be simplified by returning the
      amount of load moved as its result and removing the load_moved pointer
      from the argument list.  This helps simplify the new move_tasks() and
      slightly reduces the amount of work done in each of
      sched_class.load_balance()'s implementations.
      
      Further simplification, e.g. changes to balance_tasks(), are possible
      but (slightly) complicated by the special needs of load_balance_fair()
      so I've left them to a later patch (if this one gets accepted).
      
      NB Since move_tasks() gets called with two run queue locks held even
      small reductions in overhead are worthwhile.
      
      [ mingo@elte.hu ]
      
      this change also reduces code size nicely:
      
         text    data     bss     dec     hex filename
         39216    3618      24   42858    a76a sched.o.before
         39173    3618      24   42815    a73f sched.o.after
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43010659
  2. 02 8月, 2007 3 次提交
  3. 26 7月, 2007 3 次提交
    • C
      [PATCH] sched: add above_background_load() function · d02c7a8c
      Con Kolivas 提交于
      Add an above_background_load() function which can be used by other
      subsystems to detect if there is anything besides niced tasks running.
      
      Place it in sched.h to allow it to be compiled out if not used.
      
      Unused for now, but it is a useful hint to the IO scheduler and to
      swap-prefetch.
      Signed-off-by: NCon Kolivas <kernel@kolivas.org>
      Cc: Peter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d02c7a8c
    • A
      [PATCH] sched: arch preempt notifier mechanism · e107be36
      Avi Kivity 提交于
      This adds a general mechanism whereby a task can request the scheduler to
      notify it whenever it is preempted or scheduled back in.  This allows the
      task to swap any special-purpose registers like the fpu or Intel's VT
      registers.
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      [ mingo@elte.hu: fixes, cleanups ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e107be36
    • I
      [PATCH] sched: increase SCHED_LOAD_SCALE_FUZZ · b47e8608
      Ingo Molnar 提交于
      increase SCHED_LOAD_SCALE_FUZZ that adds a small amount of
      over-balancing: to help distribute CPU-bound tasks more fairly on SMP
      systems.
      
      the problem of unfair balancing was noticed and reported by Tong N Li.
      
      10 CPU-bound tasks running on 8 CPUs, v2.6.23-rc1:
      
        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
       2572 mingo     20   0  1576  244  196 R  100  0.0   1:03.61 loop
       2578 mingo     20   0  1576  248  196 R  100  0.0   1:03.59 loop
       2576 mingo     20   0  1576  248  196 R  100  0.0   1:03.52 loop
       2571 mingo     20   0  1576  244  196 R  100  0.0   1:03.46 loop
       2569 mingo     20   0  1576  244  196 R   99  0.0   1:03.36 loop
       2570 mingo     20   0  1576  244  196 R   95  0.0   1:00.55 loop
       2577 mingo     20   0  1576  248  196 R   50  0.0   0:31.88 loop
       2574 mingo     20   0  1576  248  196 R   50  0.0   0:31.87 loop
       2573 mingo     20   0  1576  248  196 R   50  0.0   0:31.86 loop
       2575 mingo     20   0  1576  248  196 R   50  0.0   0:31.86 loop
      
      v2.6.23-rc1 + patch:
      
        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
       2681 mingo     20   0  1576  244  196 R   85  0.0   3:51.68 loop
       2688 mingo     20   0  1576  244  196 R   81  0.0   3:46.35 loop
       2682 mingo     20   0  1576  244  196 R   80  0.0   3:43.68 loop
       2685 mingo     20   0  1576  248  196 R   80  0.0   3:45.97 loop
       2683 mingo     20   0  1576  248  196 R   80  0.0   3:40.25 loop
       2679 mingo     20   0  1576  244  196 R   80  0.0   3:33.53 loop
       2680 mingo     20   0  1576  244  196 R   79  0.0   3:43.53 loop
       2686 mingo     20   0  1576  244  196 R   79  0.0   3:39.31 loop
       2687 mingo     20   0  1576  244  196 R   78  0.0   3:33.31 loop
       2684 mingo     20   0  1576  244  196 R   77  0.0   3:27.52 loop
      
      so they now nicely converge to the expected 80% long-term CPU usage.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b47e8608
  4. 20 7月, 2007 3 次提交
  5. 17 7月, 2007 4 次提交
    • S
      user namespace: add unshare · 77ec739d
      Serge E. Hallyn 提交于
      This patch enables the unshare of user namespaces.
      
      It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which
      resets the current user_struct and adds a new root user (uid == 0)
      
      For now, unsharing the user namespace allows a process to reset its
      user_struct accounting and uid 0 in the new user namespace should be contained
      using appropriate means, for instance selinux
      
      The plan, when the full support is complete (all uid checks covered), is to
      keep the original user's rights in the original namespace, and let a process
      become uid 0 in the new namespace, with full capabilities to the new
      namespace.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andrew Morgan <agm@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77ec739d
    • C
      user namespace: add the framework · acce292c
      Cedric Le Goater 提交于
      Basically, it will allow a process to unshare its user_struct table,
      resetting at the same time its own user_struct and all the associated
      accounting.
      
      A new root user (uid == 0) is added to the user namespace upon creation.
      Such root users have full privileges and it seems that theses privileges
      should be controlled through some means (process capabilities ?)
      
      The unshare is not included in this patch.
      
      Changes since [try #4]:
      	- Updated get_user_ns and put_user_ns to accept NULL, and
      	  get_user_ns to return the namespace.
      
      Changes since [try #3]:
      	- moved struct user_namespace to files user_namespace.{c,h}
      
      Changes since [try #2]:
      	- removed struct user_namespace* argument from find_user()
      
      Changes since [try #1]:
      	- removed struct user_namespace* argument from find_user()
      	- added a root_user per user namespace
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Acked-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andrew Morgan <agm@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acce292c
    • M
      Audit: add TTY input auditing · 522ed776
      Miloslav Trmac 提交于
      Add TTY input auditing, used to audit system administrator's actions.  This is
      required by various security standards such as DCID 6/3 and PCI to provide
      non-repudiation of administrator's actions and to allow a review of past
      actions if the administrator seems to overstep their duties or if the system
      becomes misconfigured for unknown reasons.  These requirements do not make it
      necessary to audit TTY output as well.
      
      Compared to an user-space keylogger, this approach records TTY input using the
      audit subsystem, correlated with other audit events, and it is completely
      transparent to the user-space application (e.g.  the console ioctls still
      work).
      
      TTY input auditing works on a higher level than auditing all system calls
      within the session, which would produce an overwhelming amount of mostly
      useless audit events.
      
      Add an "audit_tty" attribute, inherited across fork ().  Data read from TTYs
      by process with the attribute is sent to the audit subsystem by the kernel.
      The audit netlink interface is extended to allow modifying the audit_tty
      attribute, and to allow sending explanatory audit events from user-space (for
      example, a shell might send an event containing the final command, after the
      interactive command-line editing and history expansion is performed, which
      might be difficult to decipher from the TTY input alone).
      
      Because the "audit_tty" attribute is inherited across fork (), it would be set
      e.g.  for sshd restarted within an audited session.  To prevent this, the
      audit_tty attribute is cleared when a process with no open TTY file
      descriptors (e.g.  after daemon startup) opens a TTY.
      
      See https://www.redhat.com/archives/linux-audit/2007-June/msg00000.html for a
      more detailed rationale document for an older version of this patch.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NMiloslav Trmac <mitr@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Paul Fulghum <paulkf@microgate.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Steve Grubb <sgrubb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      522ed776
    • T
      Use boot based time for process start time and boot time in /proc · 924b42d5
      Tomas Janousek 提交于
      Commit 411187fb caused boot time to move and
      process start times to become invalid after suspend.  Using boot based time
      for those restores the old behaviour and fixes the issue.
      
      [akpm@linux-foundation.org: little cleanup]
      Signed-off-by: NTomas Janousek <tjanouse@redhat.com>
      Cc: Tomas Smetana <tsmetana@redhat.com>
      Acked-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      924b42d5
  6. 10 7月, 2007 19 次提交
  7. 09 6月, 2007 1 次提交
    • A
      pi-futex: fix exit races and locking problems · 778e9a9c
      Alexey Kuznetsov 提交于
      1. New entries can be added to tsk->pi_state_list after task completed
         exit_pi_state_list(). The result is memory leakage and deadlocks.
      
      2. handle_mm_fault() is called under spinlock. The result is obvious.
      
      3. results in self-inflicted deadlock inside glibc.
         Sometimes futex_lock_pi returns -ESRCH, when it is not expected
         and glibc enters to for(;;) sleep() to simulate deadlock. This problem
         is quite obvious and I think the patch is right. Though it looks like
         each "if" in futex_lock_pi() got some stupid special case "else if". :-)
      
      4. sometimes futex_lock_pi() returns -EDEADLK,
         when nobody has the lock. The reason is also obvious (see comment
         in the patch), but correct fix is far beyond my comprehension.
         I guess someone already saw this, the chunk:
      
                              if (rt_mutex_trylock(&q.pi_state->pi_mutex))
                                      ret = 0;
      
         is obviously from the same opera. But it does not work, because the
         rtmutex is really taken at this point: wake_futex_pi() of previous
         owner reassigned it to us. My fix works. But it looks very stupid.
         I would think about removal of shift of ownership in wake_futex_pi()
         and making all the work in context of process taking lock.
      
      From: Thomas Gleixner <tglx@linutronix.de>
      
      Fix 1) Avoid the tasklist lock variant of the exit race fix by adding
          an additional state transition to the exit code.
      
          This fixes also the issue, when a task with recursive segfaults
          is not able to release the futexes.
      
      Fix 2) Cleanup the lookup_pi_state() failure path and solve the -ESRCH
          problem finally.
      
      Fix 3) Solve the fixup_pi_state_owner() problem which needs to do the fixup
          in the lock protected section by using the in_atomic userspace access
          functions.
      
          This removes also the ugly lock drop / unqueue inside of fixup_pi_state()
      
      Fix 4) Fix a stale lock in the error path of futex_wake_pi()
      
      Added some error checks for verification.
      
      The -EDEADLK problem is solved by the rtmutex fixups.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778e9a9c
  8. 24 5月, 2007 1 次提交
    • R
      recalc_sigpending_tsk fixes · 7bb44ade
      Roland McGrath 提交于
      Steve Hawkes discovered a problem where recalc_sigpending_tsk was called in
      do_sigaction but no signal_wake_up call was made, preventing later signals
      from waking up blocked threads with TIF_SIGPENDING already set.
      
      In fact, the few other calls to recalc_sigpending_tsk outside the signals
      code are also subject to this problem in other race conditions.
      
      This change makes recalc_sigpending_tsk private to the signals code.  It
      changes the outside calls, as well as do_sigaction, to use the new
      recalc_sigpending_and_wake instead.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: <Steve.Hawkes@motorola.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bb44ade