1. 07 12月, 2007 1 次提交
  2. 06 12月, 2007 2 次提交
    • P
      Avoid potential NULL dereference in unregister_sysctl_table · f1dad166
      Pavel Emelyanov 提交于
      register_sysctl_table() can return NULL sometimes, e.g.  when kmalloc()
      returns NULL or when sysctl check fails.
      
      I've also noticed, that many (most?) code in the kernel doesn't check for
      the return value from register_sysctl_table() and later simply calls the
      unregister_sysctl_table() with potentially NULL argument.
      
      This is unlikely on a common kernel configuration, but in case we're
      dealing with modules and/or fault-injection support, there's a slight
      possibility of an OOPS.
      
      Changing all the users to check for return code from the registering does
      not look like a good solution - there are too many code doing this and
      failure in sysctl tables registration is not a good reason to abort module
      loading (in most of the cases).
      
      So I think, that we can just have this check in unregister_sysctl_table
      just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
      did exactly this, before the start_unregistering() appeared).
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1dad166
    • E
      fix clone(CLONE_NEWPID) · 5cd17569
      Eric W. Biederman 提交于
      Currently we are complicating the code in copy_process, the clone ABI, and
      if we fix the bugs sys_setsid itself, with an unnecessary open coded
      version of sys_setsid.
      
      So just simplify everything and don't special case the session and pgrp of
      the initial process in a pid namespace.
      
      Having this special case actually presents to user space the classic linux
      startup conditions with session == pgrp == 0 for /sbin/init.
      
      We already handle sending signals to processes in a child pid namespace.
      
      We need to handle sending signals to processes in a parent pid namespace
      for cases like SIGCHILD and SIGIO.
      
      This makes nothing extra visible inside a pid namespace.  So this extra
      special case appears to have no redeeming merits.
      
      Further removing this special case increases the flexibility of how we can
      use pid namespaces, by not requiring the initial process in a pid namespace
      to be a daemon.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5cd17569
  3. 05 12月, 2007 8 次提交
    • T
      futex: correctly return -EFAULT not -EINVAL · cde898fa
      Thomas Gleixner 提交于
      return -EFAULT not -EINVAL. Found by review.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cde898fa
    • O
      lockdep: in_range() fix · 54561783
      Oleg Nesterov 提交于
      Torsten Kaiser wrote:
      
      | static inline int in_range(const void *start, const void *addr, const void *end)
      | {
      |         return addr >= start && addr <= end;
      | }
      | This  will return true, if addr is in the range of start (including)
      | to end (including).
      |
      | But debug_check_no_locks_freed() seems does:
      | const void *mem_to = mem_from + mem_len
      | -> mem_to is the last byte of the freed range, that fits in_range
      | lock_from = (void *)hlock->instance;
      | -> first byte of the lock
      | lock_to = (void *)(hlock->instance + 1);
      | -> first byte of the next lock, not last byte of the lock that is being checked!
      |
      | The test is:
      | if (!in_range(mem_from, lock_from, mem_to) &&
      |                                         !in_range(mem_from, lock_to, mem_to))
      |                         continue;
      | So it tests, if the first byte of the lock is in the range that is freed ->OK
      | And if the first byte of the *next* lock is in the range that is freed
      | -> Not OK.
      
      We can also simplify in_range checks, we need only 2 comparisons, not 4.
      If the lock is not in memory range, it should be either at the left of range
      or at the right.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      54561783
    • I
      lockdep: fix debug_show_all_locks() · 85684873
      Ingo Molnar 提交于
      fix the oops that can be seen in:
      
         http://bugzilla.kernel.org/attachment.cgi?id=13828&action=view
      
      it is not safe to print the locks of running tasks.
      
      (even with this fix we have a small race - but this is a debug
       function after all.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      85684873
    • I
      sched: style cleanups · 41a2d6cf
      Ingo Molnar 提交于
      style cleanup of various changes that were done recently.
      
      no code changed:
      
            text    data     bss     dec     hex filename
           23680    2542      28   26250    668a sched.o.before
           23680    2542      28   26250    668a sched.o.after
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      41a2d6cf
    • S
      futex: fix for futex_wait signal stack corruption · ce6bd420
      Steven Rostedt 提交于
      David Holmes found a bug in the -rt tree with respect to
      pthread_cond_timedwait. After trying his test program on the latest git
      from mainline, I found the bug was there too.  The bug he was seeing
      that his test program showed, was that if one were to do a "Ctrl-Z" on a
      process that was in the pthread_cond_timedwait, and then did a "bg" on
      that process, it would return with a "-ETIMEDOUT" but early. That is,
      the timer would go off early.
      
      Looking into this, I found the source of the problem. And it is a rather
      nasty bug at that.
      
      Here's the relevant code from kernel/futex.c: (not in order in the file)
      
      [...]
      smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                                struct timespec __user *utime, u32 __user *uaddr2,
                                u32 val3)
      {
              struct timespec ts;
              ktime_t t, *tp = NULL;
              u32 val2 = 0;
              int cmd = op & FUTEX_CMD_MASK;
      
              if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                      if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                              return -EFAULT;
                      if (!timespec_valid(&ts))
                              return -EINVAL;
      
                      t = timespec_to_ktime(ts);
                      if (cmd == FUTEX_WAIT)
                              t = ktime_add(ktime_get(), t);
                      tp = &t;
              }
      [...]
              return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
      }
      
      [...]
      
      long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                      u32 __user *uaddr2, u32 val2, u32 val3)
      {
              int ret;
              int cmd = op & FUTEX_CMD_MASK;
              struct rw_semaphore *fshared = NULL;
      
              if (!(op & FUTEX_PRIVATE_FLAG))
                      fshared = &current->mm->mmap_sem;
      
              switch (cmd) {
              case FUTEX_WAIT:
                      ret = futex_wait(uaddr, fshared, val, timeout);
      
      [...]
      
      static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                            u32 val, ktime_t *abs_time)
      {
      [...]
                     struct restart_block *restart;
                      restart = &current_thread_info()->restart_block;
                      restart->fn = futex_wait_restart;
                      restart->arg0 = (unsigned long)uaddr;
                      restart->arg1 = (unsigned long)val;
                      restart->arg2 = (unsigned long)abs_time;
                      restart->arg3 = 0;
                      if (fshared)
                              restart->arg3 |= ARG3_SHARED;
                      return -ERESTART_RESTARTBLOCK;
      [...]
      
      static long futex_wait_restart(struct restart_block *restart)
      {
              u32 __user *uaddr = (u32 __user *)restart->arg0;
              u32 val = (u32)restart->arg1;
              ktime_t *abs_time = (ktime_t *)restart->arg2;
              struct rw_semaphore *fshared = NULL;
      
              restart->fn = do_no_restart_syscall;
              if (restart->arg3 & ARG3_SHARED)
                      fshared = &current->mm->mmap_sem;
              return (long)futex_wait(uaddr, fshared, val, abs_time);
      }
      
      So when the futex_wait is interrupt by a signal we break out of the
      hrtimer code and set up or return from signal. This code does not return
      back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
      save the "abs_time" which is a pointer to the stack variable "ktime_t t"
      from sys_futex.
      
      This returns and unwinds the stack before we get to call our signal. On
      return from the signal we go to futex_wait_restart, where we update all
      the parameters for futex_wait and call it. But here we have a problem
      where abs_time is no longer valid.
      
      I verified this with print statements, and sure enough, what abs_time
      was set to ends up being garbage when we get to futex_wait_restart.
      
      The solution I did to solve this (with input from Linus Torvalds)
      was to add unions to the restart_block to allow system calls to
      use the restart with specific parameters.  This way the futex code now
      saves the time in a 64bit value in the restart block instead of storing
      it on the stack.
      
      Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
      in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
      Not sure what that is there for.  If this turns out to be a problem, I've
      tested this with using "unsigned int" for u32 and "unsigned long long" for
      u64 and it worked just the same. I'm using u32 and u64 just to be
      consistent with what the futex code uses.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce6bd420
    • D
      [SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string. · 874a5f87
      David S. Miller 提交于
      Based upon a report by Mikael Pettersson.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      874a5f87
    • I
      sched: default to more agressive yield for SCHED_BATCH tasks · db292ca3
      Ingo Molnar 提交于
      do more agressive yield for SCHED_BATCH tuned tasks: they are all
      about throughput anyway. This allows a gentler migration path for
      any apps that relied on stronger yield.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      db292ca3
    • I
      sched: fix crash in sys_sched_rr_get_interval() · 77034937
      Ingo Molnar 提交于
      Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
      crashes for SCHED_OTHER tasks that are on an idle runqueue.
      
      The fix is to return a 0 timeslice for tasks that are on an idle
      runqueue. (and which are not running, obviously)
      
      this also shrinks the code a bit:
      
         text    data     bss     dec     hex filename
        47903    3934     336   52173    cbcd sched.o.before
        47885    3934     336   52155    cbbb sched.o.after
      Reported-by: NLuiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77034937
  4. 04 12月, 2007 1 次提交
  5. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  6. 30 11月, 2007 4 次提交
  7. 28 11月, 2007 5 次提交
  8. 27 11月, 2007 5 次提交
  9. 20 11月, 2007 5 次提交
  10. 19 11月, 2007 1 次提交
  11. 17 11月, 2007 2 次提交
    • D
      ntp: fix typo that makes sync_cmos_clock erratic · fa6a1a55
      David P. Reed 提交于
      Fix a typo in ntp.c that has caused updating of the persistent (RTC)
      clock when synced to NTP to behave erratically.
      
      When debugging a freeze that arises on my AMD64 machines when I
      run the ntpd service, I added a number of printk's to monitor the
      sync_cmos_clock procedure.  I discovered that it was not syncing to
      cmos RTC every 11 minutes as documented, but instead would keep trying
      every second for hours at a time.  The reason turned out to be a typo
      in sync_cmos_clock, where it attempts to ensure that
      update_persistent_clock is called very close to 500 msec. after a 1
      second boundary (required by the PC RTC's spec). That typo referred to
      "xtime" in one spot, rather than "now", which is derived from "xtime"
      but not equal to it.  This makes the test erratic, creating a
      "coin-flip" that decides when update_persistent_clock is called - when
      it is called, which is rarely, it may be at any time during the one
      second period, rather than close to 500 msec, so the value written is
      needlessly incorrect, too.
      
      Signed-off-by: David P. Reed
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      fa6a1a55
    • I
      x86: ignore the sys_getcpu() tcache parameter · 4307d1e5
      Ingo Molnar 提交于
      dont use the vgetcpu tcache - it's causing problems with tasks
      migrating, they'll see the old cache up to a jiffy after the
      migration, further increasing the costs of the migration.
      
      In the worst case they see a complete bogus information from
      the tcache, when a sys_getcpu() call "invalidated" the cache
      info by incrementing the jiffies _and_ the cpuid info in the
      cache and the following vdso_getcpu() call happens after
      vdso_jiffies have been incremented.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4307d1e5
  12. 16 11月, 2007 5 次提交
    • I
      sched: reorder SCHED_FEAT_ bits · 9612633a
      Ingo Molnar 提交于
      reorder SCHED_FEAT_ bits so that the used ones come first. Makes
      tuning instructions easier.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9612633a
    • A
      sched: make sched_nr_latency static · 518b22e9
      Adrian Bunk 提交于
      sched_nr_latency can now become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      518b22e9
    • D
      sched: remove activate_idle_task() · 94bc9a7b
      Dmitry Adamushko 提交于
      cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
      at the beginning of the queue.
      
      So get rid of activate_idle_task() and make use of activate_task() instead.
      It is the same as activate_task(), except for the update_rq_clock(rq) call
      that is redundant.
      
      Code size goes down:
      
         text    data     bss     dec     hex filename
        47853    3934     336   52123    cb9b sched.o.before
        47828    3934     336   52098    cb82 sched.o.after
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94bc9a7b
    • D
      sched: fix __set_task_cpu() SMP race · ce96b5ac
      Dmitry Adamushko 提交于
      Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
      system, which crashes can only be explained via runqueue corruption.
      
      there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
      a new value, task_rq_lock(p, ...) can be successfuly executed on another
      CPU. We must ensure that updates of per-task data have been completed by
      this moment.
      
      this bug has been hiding in the Linux scheduler for an eternity (we never
      had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
      introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
      accidentally put after the task->cpu update. It also probably needs a
      sufficiently out-of-order CPU to trigger.
      Reported-by: NGrant Wilson <grant.wilson@zen.co.uk>
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce96b5ac
    • O
      sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED · dae51f56
      Oleg Nesterov 提交于
      Suppose that the SCHED_FIFO task does
      
      	switch_uid(new_user);
      
      Now, p->se.cfs_rq and p->se.parent both point into the old
      user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
      for !fair_sched_class case.
      
      Suppose that old user_struct/task_group is freed/reused, and the task
      does
      
      	sched_setscheduler(SCHED_NORMAL);
      
      __setscheduler() sets fair_sched_class, but doesn't update
      ->se.cfs_rq/parent which point to the freed memory.
      
      This means that check_preempt_wakeup() doing
      
      		while (!is_same_group(se, pse)) {
      			se = parent_entity(se);
      			pse = parent_entity(pse);
      		}
      
      may OOPS in a similar way if rq->curr or p did something like above.
      
      Perhaps we need something like the patch below, note that
      __setscheduler() can't do set_task_cfs_rq().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dae51f56