1. 05 12月, 2007 3 次提交
    • S
      futex: fix for futex_wait signal stack corruption · ce6bd420
      Steven Rostedt 提交于
      David Holmes found a bug in the -rt tree with respect to
      pthread_cond_timedwait. After trying his test program on the latest git
      from mainline, I found the bug was there too.  The bug he was seeing
      that his test program showed, was that if one were to do a "Ctrl-Z" on a
      process that was in the pthread_cond_timedwait, and then did a "bg" on
      that process, it would return with a "-ETIMEDOUT" but early. That is,
      the timer would go off early.
      
      Looking into this, I found the source of the problem. And it is a rather
      nasty bug at that.
      
      Here's the relevant code from kernel/futex.c: (not in order in the file)
      
      [...]
      smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                                struct timespec __user *utime, u32 __user *uaddr2,
                                u32 val3)
      {
              struct timespec ts;
              ktime_t t, *tp = NULL;
              u32 val2 = 0;
              int cmd = op & FUTEX_CMD_MASK;
      
              if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                      if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                              return -EFAULT;
                      if (!timespec_valid(&ts))
                              return -EINVAL;
      
                      t = timespec_to_ktime(ts);
                      if (cmd == FUTEX_WAIT)
                              t = ktime_add(ktime_get(), t);
                      tp = &t;
              }
      [...]
              return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
      }
      
      [...]
      
      long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                      u32 __user *uaddr2, u32 val2, u32 val3)
      {
              int ret;
              int cmd = op & FUTEX_CMD_MASK;
              struct rw_semaphore *fshared = NULL;
      
              if (!(op & FUTEX_PRIVATE_FLAG))
                      fshared = &current->mm->mmap_sem;
      
              switch (cmd) {
              case FUTEX_WAIT:
                      ret = futex_wait(uaddr, fshared, val, timeout);
      
      [...]
      
      static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                            u32 val, ktime_t *abs_time)
      {
      [...]
                     struct restart_block *restart;
                      restart = &current_thread_info()->restart_block;
                      restart->fn = futex_wait_restart;
                      restart->arg0 = (unsigned long)uaddr;
                      restart->arg1 = (unsigned long)val;
                      restart->arg2 = (unsigned long)abs_time;
                      restart->arg3 = 0;
                      if (fshared)
                              restart->arg3 |= ARG3_SHARED;
                      return -ERESTART_RESTARTBLOCK;
      [...]
      
      static long futex_wait_restart(struct restart_block *restart)
      {
              u32 __user *uaddr = (u32 __user *)restart->arg0;
              u32 val = (u32)restart->arg1;
              ktime_t *abs_time = (ktime_t *)restart->arg2;
              struct rw_semaphore *fshared = NULL;
      
              restart->fn = do_no_restart_syscall;
              if (restart->arg3 & ARG3_SHARED)
                      fshared = &current->mm->mmap_sem;
              return (long)futex_wait(uaddr, fshared, val, abs_time);
      }
      
      So when the futex_wait is interrupt by a signal we break out of the
      hrtimer code and set up or return from signal. This code does not return
      back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
      save the "abs_time" which is a pointer to the stack variable "ktime_t t"
      from sys_futex.
      
      This returns and unwinds the stack before we get to call our signal. On
      return from the signal we go to futex_wait_restart, where we update all
      the parameters for futex_wait and call it. But here we have a problem
      where abs_time is no longer valid.
      
      I verified this with print statements, and sure enough, what abs_time
      was set to ends up being garbage when we get to futex_wait_restart.
      
      The solution I did to solve this (with input from Linus Torvalds)
      was to add unions to the restart_block to allow system calls to
      use the restart with specific parameters.  This way the futex code now
      saves the time in a 64bit value in the restart block instead of storing
      it on the stack.
      
      Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
      in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
      Not sure what that is there for.  If this turns out to be a problem, I've
      tested this with using "unsigned int" for u32 and "unsigned long long" for
      u64 and it worked just the same. I'm using u32 and u64 just to be
      consistent with what the futex code uses.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce6bd420
    • I
      sched: default to more agressive yield for SCHED_BATCH tasks · db292ca3
      Ingo Molnar 提交于
      do more agressive yield for SCHED_BATCH tuned tasks: they are all
      about throughput anyway. This allows a gentler migration path for
      any apps that relied on stronger yield.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      db292ca3
    • I
      sched: fix crash in sys_sched_rr_get_interval() · 77034937
      Ingo Molnar 提交于
      Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
      crashes for SCHED_OTHER tasks that are on an idle runqueue.
      
      The fix is to return a 0 timeslice for tasks that are on an idle
      runqueue. (and which are not running, obviously)
      
      this also shrinks the code a bit:
      
         text    data     bss     dec     hex filename
        47903    3934     336   52173    cbcd sched.o.before
        47885    3934     336   52155    cbbb sched.o.after
      Reported-by: NLuiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77034937
  2. 04 12月, 2007 1 次提交
  3. 03 12月, 2007 1 次提交
    • S
      sched: cpu accounting controller (V2) · d842de87
      Srivatsa Vaddagiri 提交于
      Commit cfb52856 removed a useful feature for
      us, which provided a cpu accounting resource controller.  This feature would be
      useful if someone wants to group tasks only for accounting purpose and doesnt
      really want to exercise any control over their cpu consumption.
      
      The patch below reintroduces the feature. It is based on Paul Menage's
      original patch (Commit 62d0df64), with
      these differences:
      
              - Removed load average information. I felt it needs more thought (esp
      	  to deal with SMP and virtualized platforms) and can be added for
      	  2.6.25 after more discussions.
              - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
      	  stats are) and invoke cpuacct_charge() from the respective scheduler
      	  classes
      	- Make accounting scalable on SMP systems by splitting the usage
      	  counter to be per-cpu
      	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
      	  code is not big enough to warrant a new file and also this rightly
      	  needs to live inside the scheduler. Also things like accessing
      	  rq->lock while reading cpu usage becomes easier if the code lived in
      	  kernel/sched.c)
      
      The patch also modifies the cpu controller not to provide the same accounting
      information.
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      
       Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
       some simple tests like cpuspin (spin on the cpu), ran several tasks in
       the same group and timed them. Compared their time stamps with
       cpuacct.usage.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d842de87
  4. 30 11月, 2007 4 次提交
  5. 28 11月, 2007 5 次提交
  6. 27 11月, 2007 5 次提交
  7. 20 11月, 2007 5 次提交
  8. 19 11月, 2007 1 次提交
  9. 17 11月, 2007 2 次提交
    • D
      ntp: fix typo that makes sync_cmos_clock erratic · fa6a1a55
      David P. Reed 提交于
      Fix a typo in ntp.c that has caused updating of the persistent (RTC)
      clock when synced to NTP to behave erratically.
      
      When debugging a freeze that arises on my AMD64 machines when I
      run the ntpd service, I added a number of printk's to monitor the
      sync_cmos_clock procedure.  I discovered that it was not syncing to
      cmos RTC every 11 minutes as documented, but instead would keep trying
      every second for hours at a time.  The reason turned out to be a typo
      in sync_cmos_clock, where it attempts to ensure that
      update_persistent_clock is called very close to 500 msec. after a 1
      second boundary (required by the PC RTC's spec). That typo referred to
      "xtime" in one spot, rather than "now", which is derived from "xtime"
      but not equal to it.  This makes the test erratic, creating a
      "coin-flip" that decides when update_persistent_clock is called - when
      it is called, which is rarely, it may be at any time during the one
      second period, rather than close to 500 msec, so the value written is
      needlessly incorrect, too.
      
      Signed-off-by: David P. Reed
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      fa6a1a55
    • I
      x86: ignore the sys_getcpu() tcache parameter · 4307d1e5
      Ingo Molnar 提交于
      dont use the vgetcpu tcache - it's causing problems with tasks
      migrating, they'll see the old cache up to a jiffy after the
      migration, further increasing the costs of the migration.
      
      In the worst case they see a complete bogus information from
      the tcache, when a sys_getcpu() call "invalidated" the cache
      info by incrementing the jiffies _and_ the cpuid info in the
      cache and the following vdso_getcpu() call happens after
      vdso_jiffies have been incremented.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4307d1e5
  10. 16 11月, 2007 7 次提交
    • I
      sched: reorder SCHED_FEAT_ bits · 9612633a
      Ingo Molnar 提交于
      reorder SCHED_FEAT_ bits so that the used ones come first. Makes
      tuning instructions easier.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9612633a
    • A
      sched: make sched_nr_latency static · 518b22e9
      Adrian Bunk 提交于
      sched_nr_latency can now become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      518b22e9
    • D
      sched: remove activate_idle_task() · 94bc9a7b
      Dmitry Adamushko 提交于
      cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
      at the beginning of the queue.
      
      So get rid of activate_idle_task() and make use of activate_task() instead.
      It is the same as activate_task(), except for the update_rq_clock(rq) call
      that is redundant.
      
      Code size goes down:
      
         text    data     bss     dec     hex filename
        47853    3934     336   52123    cb9b sched.o.before
        47828    3934     336   52098    cb82 sched.o.after
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94bc9a7b
    • D
      sched: fix __set_task_cpu() SMP race · ce96b5ac
      Dmitry Adamushko 提交于
      Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
      system, which crashes can only be explained via runqueue corruption.
      
      there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
      a new value, task_rq_lock(p, ...) can be successfuly executed on another
      CPU. We must ensure that updates of per-task data have been completed by
      this moment.
      
      this bug has been hiding in the Linux scheduler for an eternity (we never
      had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
      introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
      accidentally put after the task->cpu update. It also probably needs a
      sufficiently out-of-order CPU to trigger.
      Reported-by: NGrant Wilson <grant.wilson@zen.co.uk>
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce96b5ac
    • O
      sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED · dae51f56
      Oleg Nesterov 提交于
      Suppose that the SCHED_FIFO task does
      
      	switch_uid(new_user);
      
      Now, p->se.cfs_rq and p->se.parent both point into the old
      user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
      for !fair_sched_class case.
      
      Suppose that old user_struct/task_group is freed/reused, and the task
      does
      
      	sched_setscheduler(SCHED_NORMAL);
      
      __setscheduler() sets fair_sched_class, but doesn't update
      ->se.cfs_rq/parent which point to the freed memory.
      
      This means that check_preempt_wakeup() doing
      
      		while (!is_same_group(se, pse)) {
      			se = parent_entity(se);
      			pse = parent_entity(pse);
      		}
      
      may OOPS in a similar way if rq->curr or p did something like above.
      
      Perhaps we need something like the patch below, note that
      __setscheduler() can't do set_task_cfs_rq().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dae51f56
    • C
      sched: fix accounting of interrupts during guest execution on s390 · 9778385d
      Christian Borntraeger 提交于
      Currently the scheduler checks for PF_VCPU to decide if this timeslice
      has to be accounted as guest time. On s390 host interrupts are not
      disabled during guest execution. This causes theses interrupts to be
      accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
      is to check if an interrupt triggered account_system_time. As the tick
      is timer interrupt based, we have to subtract hardirq_offset.
      
      I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
      x86_64. Seems to work.
      
      CC: Avi Kivity <avi@qumranet.com>
      CC: Laurent Vivier <Laurent.Vivier@bull.net>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9778385d
    • R
      wait_task_stopped: Check p->exit_state instead of TASK_TRACED · a3474224
      Roland McGrath 提交于
      The original meaning of the old test (p->state > TASK_STOPPED) was
      "not dead", since it was before TASK_TRACED existed and before the
      state/exit_state split.  It was a wrong correction in commit
      14bf01bb to make this test for
      TASK_TRACED instead.  It should have been changed when TASK_TRACED
      was introducted and again when exit_state was introduced.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Cc: Kees Cook <kees@ubuntu.com>
      Acked-by: NScott James Remnant <scott@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3474224
  11. 15 11月, 2007 6 次提交