1. 17 11月, 2007 2 次提交
    • D
      ntp: fix typo that makes sync_cmos_clock erratic · fa6a1a55
      David P. Reed 提交于
      Fix a typo in ntp.c that has caused updating of the persistent (RTC)
      clock when synced to NTP to behave erratically.
      
      When debugging a freeze that arises on my AMD64 machines when I
      run the ntpd service, I added a number of printk's to monitor the
      sync_cmos_clock procedure.  I discovered that it was not syncing to
      cmos RTC every 11 minutes as documented, but instead would keep trying
      every second for hours at a time.  The reason turned out to be a typo
      in sync_cmos_clock, where it attempts to ensure that
      update_persistent_clock is called very close to 500 msec. after a 1
      second boundary (required by the PC RTC's spec). That typo referred to
      "xtime" in one spot, rather than "now", which is derived from "xtime"
      but not equal to it.  This makes the test erratic, creating a
      "coin-flip" that decides when update_persistent_clock is called - when
      it is called, which is rarely, it may be at any time during the one
      second period, rather than close to 500 msec, so the value written is
      needlessly incorrect, too.
      
      Signed-off-by: David P. Reed
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      fa6a1a55
    • I
      x86: ignore the sys_getcpu() tcache parameter · 4307d1e5
      Ingo Molnar 提交于
      dont use the vgetcpu tcache - it's causing problems with tasks
      migrating, they'll see the old cache up to a jiffy after the
      migration, further increasing the costs of the migration.
      
      In the worst case they see a complete bogus information from
      the tcache, when a sys_getcpu() call "invalidated" the cache
      info by incrementing the jiffies _and_ the cpuid info in the
      cache and the following vdso_getcpu() call happens after
      vdso_jiffies have been incremented.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      4307d1e5
  2. 16 11月, 2007 7 次提交
    • I
      sched: reorder SCHED_FEAT_ bits · 9612633a
      Ingo Molnar 提交于
      reorder SCHED_FEAT_ bits so that the used ones come first. Makes
      tuning instructions easier.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9612633a
    • A
      sched: make sched_nr_latency static · 518b22e9
      Adrian Bunk 提交于
      sched_nr_latency can now become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      518b22e9
    • D
      sched: remove activate_idle_task() · 94bc9a7b
      Dmitry Adamushko 提交于
      cpu_down() code is ok wrt sched_idle_next() placing the 'idle' task not
      at the beginning of the queue.
      
      So get rid of activate_idle_task() and make use of activate_task() instead.
      It is the same as activate_task(), except for the update_rq_clock(rq) call
      that is redundant.
      
      Code size goes down:
      
         text    data     bss     dec     hex filename
        47853    3934     336   52123    cb9b sched.o.before
        47828    3934     336   52098    cb82 sched.o.after
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94bc9a7b
    • D
      sched: fix __set_task_cpu() SMP race · ce96b5ac
      Dmitry Adamushko 提交于
      Grant Wilson has reported rare SCHED_FAIR_USER crashes on his quad-core
      system, which crashes can only be explained via runqueue corruption.
      
      there is a narrow SMP race in __set_task_cpu(): after ->cpu is set up to
      a new value, task_rq_lock(p, ...) can be successfuly executed on another
      CPU. We must ensure that updates of per-task data have been completed by
      this moment.
      
      this bug has been hiding in the Linux scheduler for an eternity (we never
      had any explicit barrier for task->cpu in set_task_cpu() - so the bug was
      introduced in 2.5.1), but only became visible via set_task_cfs_rq() being
      accidentally put after the task->cpu update. It also probably needs a
      sufficiently out-of-order CPU to trigger.
      Reported-by: NGrant Wilson <grant.wilson@zen.co.uk>
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ce96b5ac
    • O
      sched: fix SCHED_FIFO tasks & FAIR_GROUP_SCHED · dae51f56
      Oleg Nesterov 提交于
      Suppose that the SCHED_FIFO task does
      
      	switch_uid(new_user);
      
      Now, p->se.cfs_rq and p->se.parent both point into the old
      user_struct->tg because sched_move_task() doesn't call set_task_cfs_rq()
      for !fair_sched_class case.
      
      Suppose that old user_struct/task_group is freed/reused, and the task
      does
      
      	sched_setscheduler(SCHED_NORMAL);
      
      __setscheduler() sets fair_sched_class, but doesn't update
      ->se.cfs_rq/parent which point to the freed memory.
      
      This means that check_preempt_wakeup() doing
      
      		while (!is_same_group(se, pse)) {
      			se = parent_entity(se);
      			pse = parent_entity(pse);
      		}
      
      may OOPS in a similar way if rq->curr or p did something like above.
      
      Perhaps we need something like the patch below, note that
      __setscheduler() can't do set_task_cfs_rq().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dae51f56
    • C
      sched: fix accounting of interrupts during guest execution on s390 · 9778385d
      Christian Borntraeger 提交于
      Currently the scheduler checks for PF_VCPU to decide if this timeslice
      has to be accounted as guest time. On s390 host interrupts are not
      disabled during guest execution. This causes theses interrupts to be
      accounted as guest time if CONFIG_VIRT_CPU_ACCOUNTING is set. Solution
      is to check if an interrupt triggered account_system_time. As the tick
      is timer interrupt based, we have to subtract hardirq_offset.
      
      I tested the patch on s390 with CONFIG_VIRT_CPU_ACCOUNTING and on
      x86_64. Seems to work.
      
      CC: Avi Kivity <avi@qumranet.com>
      CC: Laurent Vivier <Laurent.Vivier@bull.net>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9778385d
    • R
      wait_task_stopped: Check p->exit_state instead of TASK_TRACED · a3474224
      Roland McGrath 提交于
      The original meaning of the old test (p->state > TASK_STOPPED) was
      "not dead", since it was before TASK_TRACED existed and before the
      state/exit_state split.  It was a wrong correction in commit
      14bf01bb to make this test for
      TASK_TRACED instead.  It should have been changed when TASK_TRACED
      was introducted and again when exit_state was introduced.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Alexey Dobriyan <adobriyan@sw.ru>
      Cc: Kees Cook <kees@ubuntu.com>
      Acked-by: NScott James Remnant <scott@ubuntu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3474224
  3. 15 11月, 2007 10 次提交
  4. 14 11月, 2007 1 次提交
  5. 13 11月, 2007 1 次提交
  6. 10 11月, 2007 16 次提交
    • D
      [FUTEX] Fix address computation in compat code. · 3c5fd9c7
      David Miller 提交于
      compat_exit_robust_list() computes a pointer to the
      futex entry in userspace as follows:
      
      	(void __user *)entry + futex_offset
      
      'entry' is a 'struct robust_list __user *', and
      'futex_offset' is a 'compat_long_t' (typically a 's32').
      
      Things explode if the 32-bit sign bit is set in futex_offset.
      
      Type promotion sign extends futex_offset to a 64-bit value before
      adding it to 'entry'.
      
      This triggered a problem on sparc64 running 32-bit applications which
      would lock up a cpu looping forever in the fault handling for the
      userspace load in handle_futex_death().
      
      Compat userspace runs with address masking (wherein the cpu zeros out
      the top 32-bits of every effective address given to a memory operation
      instruction) so the sparc64 fault handler accounts for this by
      zero'ing out the top 32-bits of the fault address too.
      
      Since the kernel properly uses the compat_uptr interfaces, kernel side
      accesses to compat userspace work too since they will only use
      addresses with the top 32-bit clear.
      
      Because of this compat futex layer bug we get into the following loop
      when executing the get_user() load near the top of handle_futex_death():
      
      1) load from address '0xfffffffff7f16bd8', FAULT
      2) fault handler clears upper 32-bits, processes fault
         for address '0xf7f16bd8' which succeeds
      3) goto #1
      
      I want to thank Bernd Zeimetz, Josip Rodin, and Fabio Massimo Di Nitto
      for their tireless efforts helping me track down this bug.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c5fd9c7
    • A
      sched: proper prototype for kernel/sched.c:migration_init() · e6fe6649
      Adrian Bunk 提交于
      This patch adds a proper prototype for migration_init() in
      include/linux/sched.h
      
      Since there's no point in always returning 0 to a caller that doesn't check
      the return value it also changes the function to return void.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e6fe6649
    • P
      sched: avoid large irq-latencies in smp-balancing · b82d9fdd
      Peter Zijlstra 提交于
      SMP balancing is done with IRQs disabled and can iterate the full rq.
      When rqs are large this can cause large irq-latencies. Limit the nr of
      iterations on each run.
      
      This fixes a scheduling latency regression reported by the -rt folks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Tested-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b82d9fdd
    • S
      sched: fix copy_namespace() <-> sched_fork() dependency in do_fork · 3c90e6e9
      Srivatsa Vaddagiri 提交于
      Sukadev Bhattiprolu reported a kernel crash with control groups.
      There are couple of problems discovered by Suka's test:
      
      - The test requires the cgroup filesystem to be mounted with
        atleast the cpu and ns options (i.e both namespace and cpu 
        controllers are active in the same hierarchy). 
      
      	# mkdir /dev/cpuctl
      	# mount -t cgroup -ocpu,ns none cpuctl
      	(or simply)
      	# mount -t cgroup none cpuctl -> Will activate all controllers
      					 in same hierarchy.
      
      - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
        to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
        cgroup_clone) and the child is attached to the new group (cgroup_clone->
        attach_task->sched_move_task). At this point in time, the child's scheduler 
        related fields are uninitialized (including its on_rq field, which it has
        inherited from parent). As a result sched_move_task thinks its on
        runqueue, when it isn't.
      
        As a solution to this problem, I moved sched_fork() call, which
        initializes scheduler related fields on a new task, before
        copy_namespaces(). I am not sure though whether moving up will
        cause other side-effects. Do you see any issue?
      
      - The second problem exposed by this test is that task_new_fair()
        assumes that parent and child will be part of the same group (which 
        needn't be as this test shows). As a result, cfs_rq->curr can be NULL
        for the child.
      
        The solution is to test for curr pointer being NULL in
        task_new_fair().
      
      With the patch below, I could run ns_exec() fine w/o a crash.
      Reported-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c90e6e9
    • I
      sched: clean up the wakeup preempt check, #2 · 502d26b5
      Ingo Molnar 提交于
      clean up the preemption check to not use unnecessary 64-bit
      variables. This improves code size:
      
         text    data     bss     dec     hex filename
        44227    3326      36   47589    b9e5 sched.o.before
        44201    3326      36   47563    b9cb sched.o.after
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      502d26b5
    • I
      sched: clean up the wakeup preempt check · 77d9cc44
      Ingo Molnar 提交于
      clean up the wakeup preemption check. No code changed:
      
         text    data     bss     dec     hex filename
        44227    3326      36   47589    b9e5 sched.o.before
        44227    3326      36   47589    b9e5 sched.o.after
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77d9cc44
    • I
      sched: wakeup preemption fix · 8bc6767a
      Ingo Molnar 提交于
      wakeup preemption fix: do not make it dependent on p->prio.
      Preemption purely depends on ->vruntime.
      
      This improves preemption in mixed-nice-level workloads.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8bc6767a
    • I
      sched: remove PREEMPT_RESTRICT · 3e3e13f3
      Ingo Molnar 提交于
      remove PREEMPT_RESTRICT. (this is a separate commit so that any
      regression related to the removal itself is bisectable)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e3e13f3
    • I
      sched: turn off PREEMPT_RESTRICT · 52d3da1a
      Ingo Molnar 提交于
      PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
      related preemption. It has a disadvantage though, it can prevent
      legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
      that clears peer_preempt.
      
      Now that the wakeup preemption has been cleaned up we dont seem to have
      excessive preemptions anymore, so this feature can be turned off. (and
      removed in the next patch)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      52d3da1a
    • E
      sched: cleanup, use NSEC_PER_MSEC and NSEC_PER_SEC · d6322faf
      Eric Dumazet 提交于
      1) hardcoded 1000000000 value is used five times in places where
         NSEC_PER_SEC might be more readable.
      
      2) A conversion from nsec to msec uses the hardcoded 1000000 value,
         which is a candidate for NSEC_PER_MSEC.
      
      no code changed:
      
          text    data     bss     dec     hex filename
         44359    3326      36   47721    ba69 sched.o.before
         44359    3326      36   47721    ba69 sched.o.after
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6322faf
    • I
      sched: reintroduce SMP tunings again · 19978ca6
      Ingo Molnar 提交于
      Yanmin Zhang reported an aim7 regression and bisected it down to:
      
       |  commit 38ad464d
       |  Author: Ingo Molnar <mingo@elte.hu>
       |  Date:   Mon Oct 15 17:00:02 2007 +0200
       |
       |     sched: uniform tunings
       |
       |     use the same defaults on both UP and SMP.
      
      fix this by reintroducing similar SMP tunings again. This resolves
      the regression.
      
      (also update the comments to match the ilog2(nr_cpus) tuning effect)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19978ca6
    • P
      sched: restore deterministic CPU accounting on powerpc · fa13a5a1
      Paul Mackerras 提交于
      Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
      deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
      broken on powerpc, because we end up counting user time twice: once in
      timer_interrupt() and once in update_process_times().
      
      This fixes the problem by pulling the code in update_process_times
      that updates utime and stime into a separate function called
      account_process_tick.  If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
      there is a version of account_process_tick in kernel/timer.c that
      simply accounts a whole tick to either utime or stime as before.  If
      CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
      implement account_process_tick.
      
      This also lets us simplify the s390 code a bit; it means that the s390
      timer interrupt can now call update_process_times even when
      CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
      suitable account_process_tick().
      
      account_process_tick() now takes the task_struct * as an argument.
      Tested both with and without CONFIG_VIRT_CPU_ACCOUNTING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fa13a5a1
    • B
      sched: fix delay accounting regression · 9a41785c
      Balbir Singh 提交于
      Fix the delay accounting regression introduced by commit
      75d4ef16. rq no longer has sched_info
      data associated with it. task_struct sched_info structure is used by delay
      accounting to provide back statistics to user space.
      
      also remove direct use of sched_clock() (which is not a valid thing to
      do anymore) and use rq->clock instead.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a41785c
    • P
      sched: reintroduce the sched_min_granularity tunable · b2be5e96
      Peter Zijlstra 提交于
      we lost the sched_min_granularity tunable to a clever optimization
      that uses the sched_latency/min_granularity ratio - but the ratio
      is quite unintuitive to users and can also crash the kernel if the
      ratio is set to 0. So reintroduce the min_granularity tunable,
      while keeping the ratio maintained internally.
      
      no functionality changed.
      
      [ mingo@elte.hu: some fixlets. ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b2be5e96
    • P
      sched: documentation: place_entity() comments · 2cb8600e
      Peter Zijlstra 提交于
      Add a few comments to place_entity(). No code changed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2cb8600e
    • P
      sched: fix vslice · 10b77724
      Peter Zijlstra 提交于
      vslice was missing a factor NICE_0_LOAD, as weight is in
      weight*NICE_0_LOAD units.
      
      the effect of this bug was larger initial slices and
      thus latency-noisier forks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      10b77724
  7. 06 11月, 2007 2 次提交
  8. 05 11月, 2007 1 次提交