1. 25 9月, 2008 1 次提交
  2. 23 9月, 2008 8 次提交
  3. 22 9月, 2008 2 次提交
    • I
      sched: turn off WAKEUP_OVERLAP · f681bbd6
      Ingo Molnar 提交于
      WAKEUP_OVERLAP is not a winner on a 16way box, running psql+sysbench:
      
             .27-rc7-NO_WAKEUP_OVERLAP  .27-rc7-WAKEUP_OVERLAP
      -------------------------------------------------
          1:             694              811    +14.39%
          2:            1454             1427    -1.86%
          4:            3017             3070    +1.70%
          8:            5694             5808    +1.96%
         16:           10592            10612    +0.19%
         32:            9693             9647    -0.48%
         64:            8507             8262    -2.97%
        128:            8402             7087    -18.55%
        256:            8419             5124    -64.30%
        512:            7990             3671    -117.62%
      -------------------------------------------------
        SUM:           64466            55524    -16.11%
      
      ... so turn it off by default.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f681bbd6
    • P
      sched: wakeup preempt when small overlap · 15afe09b
      Peter Zijlstra 提交于
      Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.
      
      The difference seems to come from different preemption agressiveness,
      which affects the cache footprint of the workload and its effective
      cache trashing.
      
      Aggresively preempt a task if its avg overlap is very small, this should
      avoid the task going to sleep and find it still running when we schedule
      back to it - saving a wakeup.
      Reported-by: NLin Ming <ming.m.lin@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      15afe09b
  4. 17 9月, 2008 1 次提交
    • T
      clockevents: make device shutdown robust · 2344abbc
      Thomas Gleixner 提交于
      The device shut down does not cleanup the next_event variable of the
      clock event device. So when the device is reactivated the possible
      stale next_event value can prevent the device to be reprogrammed as it
      claims to wait on a event already.
      
      This is the root cause of the resurfacing suspend/resume problem,
      where systems need key press to come back to life.
      
      Fix this by setting next_event to KTIME_MAX when the device is shut
      down. Use a separate function for shutdown which takes care of that
      and only keep the direct set mode call in the broadcast code, where we
      can not touch the next_event value.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      2344abbc
  5. 14 9月, 2008 1 次提交
  6. 11 9月, 2008 2 次提交
    • H
      sched: fix deadlock in setting scheduler parameter to zero · ec5d4989
      Hiroshi Shimamoto 提交于
      Andrei Gusev wrote:
      
      > I played witch scheduler settings. After doing something like:
      > echo -n 1000000 >sched_rt_period_us
      >
      > command is locked. I found in kernel.log:
      >
      > Sep 11 00:39:34 zaratustra
      > Sep 11 00:39:34 zaratustra Pid: 4495, comm: bash Tainted: G        W
      > (2.6.26.3 #12)
      > Sep 11 00:39:34 zaratustra EIP: 0060:[<c0213fc7>] EFLAGS: 00210246 CPU: 0
      > Sep 11 00:39:34 zaratustra EIP is at div64_u64+0x57/0x80
      > Sep 11 00:39:34 zaratustra EAX: 0000389f EBX: 00000000 ECX: 00000000
      > EDX: 00000000
      > Sep 11 00:39:34 zaratustra ESI: d9800000 EDI: d9800000 EBP: 0000389f
      > ESP: ea7a6edc
      > Sep 11 00:39:34 zaratustra DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
      > Sep 11 00:39:34 zaratustra Process bash (pid: 4495, ti=ea7a6000
      > task=ea744000 task.ti=ea7a6000)
      > Sep 11 00:39:34 zaratustra Stack: 00000000 000003e8 d9800000 0000389f
      > c0119042 00000000 00000000 00000001
      > Sep 11 00:39:34 zaratustra 00000000 00000000 ea7a6f54 00010000 00000000
      > c04d2e80 00000001 000e7ef0
      > Sep 11 00:39:34 zaratustra c01191a3 00000000 00000000 ea7a6fa0 00000001
      > ffffffff c04d2e80 ea5b2480
      > Sep 11 00:39:34 zaratustra Call Trace:
      > Sep 11 00:39:34 zaratustra [<c0119042>] __rt_schedulable+0x52/0x130
      > Sep 11 00:39:34 zaratustra [<c01191a3>] sched_rt_handler+0x83/0x120
      > Sep 11 00:39:34 zaratustra [<c01a76a6>] proc_sys_call_handler+0xb6/0xd0
      > Sep 11 00:39:34 zaratustra [<c01a76c0>] proc_sys_write+0x0/0x20
      > Sep 11 00:39:34 zaratustra [<c01a76d9>] proc_sys_write+0x19/0x20
      > Sep 11 00:39:34 zaratustra [<c016cc68>] vfs_write+0xa8/0x140
      > Sep 11 00:39:34 zaratustra [<c016cdd1>] sys_write+0x41/0x80
      > Sep 11 00:39:34 zaratustra [<c0103051>] sysenter_past_esp+0x6a/0x91
      > Sep 11 00:39:34 zaratustra =======================
      > Sep 11 00:39:34 zaratustra Code: c8 41 0f ad f3 d3 ee f6 c1 20 0f 45 de
      > 31 f6 0f ad ef d3 ed f6 c1 20 0f 45 fd 0f 45 ee 31 c9 39 eb 89 fe 89 ea
      > 77 08 89 e8 31 d2 <f7> f3 89 c1 89 f0 8b 7c 24 08 f7 f3 8b 74 24 04 89
      > ca 8b 1c 24
      > Sep 11 00:39:34 zaratustra EIP: [<c0213fc7>] div64_u64+0x57/0x80 SS:ESP
      > 0068:ea7a6edc
      > Sep 11 00:39:34 zaratustra ---[ end trace 4eaa2a86a8e2da22 ]---
      
      fix the boundary condition.
      
      sysctl_sched_rt_period=0 makes exception at to_ratio().
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ec5d4989
    • Z
      sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly · baf25731
      Zhang, Yanmin 提交于
      On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.
      
      Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
      to 0. When every cpu is up, per-cpu migration_thread is created and it runs
      very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
      quickly. After all cpus are up, with below calling chain:
      
         sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
      => cpu_attach_domain => rq_attach_root => set_rq_online => ...
      => _enable_runtime
      
      _enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
      reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
      do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
      scheduled to run on that cpu. here is RT task migration_thread which is
      woken up when a task is migrated to another cpu.
      
      Below patch fixes it against 2.6.27-rc5.
      Signed-off-by: NZhang Yanmin <yanmin_zhang@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      baf25731
  7. 10 9月, 2008 2 次提交
    • T
      clockevents: remove WARN_ON which was used to gather information · e75b986a
      Thomas Gleixner 提交于
      The issue of the endless reprogramming loop due to a too small
      min_delta_ns was fixed with the previous updates of the clock events
      code, but we had no information about the spread of this problem. I
      added a WARN_ON to get automated information via kerneloops.org and to
      get some direct reports, which allowed me to analyse the affected
      machines.
      
      The WARN_ON has served its purpose and would be annoying for a release
      kernel. Remove it and just keep the information about the increase of
      the min_delta_ns value.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e75b986a
    • T
      clockevents: remove WARN_ON which was used to gather information · 61c22c34
      Thomas Gleixner 提交于
      The issue of the endless reprogramming loop due to a too small
      min_delta_ns was fixed with the previous updates of the clock events
      code, but we had no information about the spread of this problem. I
      added a WARN_ON to get automated information via kerneloops.org and to
      get some direct reports, which allowed me to analyse the affected
      machines.
      
      The WARN_ON has served its purpose and would be annoying for a release
      kernel. Remove it and just keep the information about the increase of
      the min_delta_ns value.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      61c22c34
  8. 07 9月, 2008 2 次提交
    • M
      sched: arch_reinit_sched_domains() must destroy domains to force rebuild · dfb512ec
      Max Krasnyansky 提交于
      What I realized recently is that calling rebuild_sched_domains() in
      arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
      partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
      and will not actually rebuild anything if new domain masks match the old ones.
      
      What this means is that doing
           echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
      on a system with cpusets enabled will not take affect untill something changes
      in the cpuset setup (ie new sets created or deleted).
      
      This patch fixes restore correct behaviour where domains must be rebuilt in
      order to enable MC powersaving flags.
      
      Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
      Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
      as expected.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Tested-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dfb512ec
    • M
      kernel/cpu.c: Move the CPU_DYING notifiers · 3ba35573
      Manfred Spraul 提交于
      When a cpu is taken offline, the CPU_DYING notifiers are called on the
      dying cpu. According to <linux/notifiers.h>, the cpu should be "not
      running any task, not handling interrupts, soon dead".
      
      For the current implementation, this is not true:
      - __cpu_disable can fail. If it fails, then the cpu will remain alive
        and happy.
      - At least on x86, __cpu_disable() briefly enables the local interrupts
        to handle any outstanding interrupts.
      
      What about moving CPU_DYING down a few lines, behind the __cpu_disable()
      line?
      There are only two CPU_DYING handlers in the kernel right now: one in
      kvm, one in the scheduler. Both should work with the patch applied
      [and: I'm not sure if either one handles a failing __cpu_disable()]
      
      The patch survives simple offlining a cpu. kvm untested due to lack
      of a test setup.
      Signed-off-By: NManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3ba35573
  9. 06 9月, 2008 6 次提交
    • G
      sched: fix __load_balance_iterator() for cfq with only one task · 38736f47
      Gautham R Shenoy 提交于
      The __load_balance_iterator() returns a NULL when there's only one
      sched_entity which is a task. It is caused by the following code-path.
      
      	/* Skip over entities that are not tasks */
      	do {
      		se = list_entry(next, struct sched_entity, group_node);
      		next = next->next;
      	} while (next != &cfs_rq->tasks && !entity_is_task(se));
      
      	if (next == &cfs_rq->tasks)
      		return NULL;
      	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            This will return NULL even when se is a task.
      
      As a side-effect, there was a regression in sched_mc behavior since 2.6.25,
      since iter_move_one_task() when it calls load_balance_start_fair(),
      would not get any tasks to move!
      
      Fix this by checking if the last entity was a task or not.
      Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      38736f47
    • M
      ntp: fix calculation of the next jiffie to trigger RTC sync · 4ff4b9e1
      Maciej W. Rozycki 提交于
      We have a bug in the calculation of the next jiffie to trigger the RTC
      synchronisation.  The aim here is to run sync_cmos_clock() as close as
      possible to the middle of a second.  Which means we want this function to
      be called less than or equal to half a jiffie away from when now.tv_nsec
      equals 5e8 (500000000).
      
      If this is not the case for a given call to the function, for this purpose
      instead of updating the RTC we calculate the offset in nanoseconds to the
      next point in time where now.tv_nsec will be equal 5e8.  The calculated
      offset is then converted to jiffies as these are the unit used by the
      timer.
      
      Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
      where the resulting value is rounded up.  As a result the range of
      now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
      rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.
      
      As a result if for example sync_cmos_clock() happens to be called at the
      time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
      TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
      same range of now.tv_nsec again.  Similarly for cases offsetted by an
      integer multiple of TICK_NSEC.
      
      This change addresses the problem by subtracting TICK_NSEC / 2 from the
      nanosecond offset to the next point in time where now.tv_nsec will be
      equal 5e8, effectively shifting the following rounding in
      timespec_to_jiffies() so that it produces a rounded-to-nearest result.
      Signed-off-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ff4b9e1
    • K
      sched: compilation fix with gcc 3.4.6 · c8bfff6d
      Krzysztof Helt 提交于
      I found that 2.6.27-rc5-mm1 does not compile with gcc 3.4.6.
      The error is:
        CC      kernel/sched.o
      kernel/sched.c: In function `start_rt_bandwidth':
      kernel/sched.c:208: sorry, unimplemented: inlining failed in call to 'rt_bandwidth_enabled': function body not available
      kernel/sched.c:214: sorry, unimplemented: called from here
      make[1]: *** [kernel/sched.o] Error 1
      make: *** [kernel] Error 2
      
      It seems that the gcc 3.4.6 requires full inline definition before first usage.
      The patch below fixes the compilation problem.
      
      Signed-off-by: Krzysztof Helt <krzysztof.h1@wp.pl> (if needed>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c8bfff6d
    • T
      clockevents: broadcast fixup possible waiters · 7300711e
      Thomas Gleixner 提交于
      Until the C1E patches arrived there where no users of periodic broadcast
      before switching to oneshot mode. Now we need to trigger a possible
      waiter for a periodic broadcast when switching to oneshot mode.
      Otherwise we can starve them for ever.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7300711e
    • B
      sched: fix process time monotonicity · 49048622
      Balbir Singh 提交于
      Spencer reported a problem where utime and stime were going negative despite
      the fixes in commit b27f03d4. The suspected
      reason for the problem is that signal_struct maintains it's own utime and
      stime (of exited tasks), these are not updated using the new task_utime()
      routine, hence sig->utime can go backwards and cause the same problem
      to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
      fixes the problem
      
      TODO: using max(task->prev_utime, derived utime) works for now, but a more
      generic solution is to implement cputime_max() and use the cputime_gt()
      function for comparison.
      
      Reported-by: spencer@bluehost.com
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49048622
    • P
      sched_clock: fix NOHZ interaction · 56c7426b
      Peter Zijlstra 提交于
      If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
      actual process times. Fix this by re-calibrating the clock against GTOD when
      leaving nohz mode.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAvi Kivity <avi@qumranet.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      56c7426b
  10. 05 9月, 2008 6 次提交
    • T
      clockevents: prevent endless loop lockup · 1fb9b7d2
      Thomas Gleixner 提交于
      The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
      too small value of the HPET minumum delta for programming an event.
      
      The clockevents code needs to enforce an interrupt event on the clock event
      device in some cases. The enforcement code was stupid and naive, as it just
      added the minimum delta to the current time and tried to reprogram the device.
      When the minimum delta is too small, then this loops forever.
      
      Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
      and double the minimum delta value to make sure, that this does not happen again.
      Use the same function for both tick-oneshot and tick-broadcast code.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1fb9b7d2
    • T
      clockevents: prevent multiple init/shutdown · 9c17bcda
      Thomas Gleixner 提交于
      While chasing the C1E/HPET bugreports I went through the clock events
      code inch by inch and found that the broadcast device can be initialized
      and shutdown multiple times. Multiple shutdowns are not critical, but
      useless waste of time. Multiple initializations are simply broken. Another
      CPU might have the device in use already after the first initialization and
      the second init could just render it unusable again.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9c17bcda
    • T
      clockevents: enforce reprogram in oneshot setup · 7205656a
      Thomas Gleixner 提交于
      In tick_oneshot_setup we program the device to the given next_event,
      but we do not check the return value. We need to make sure that the
      device is programmed enforced so the interrupt handler engine starts
      working. Split out the reprogramming function from tick_program_event()
      and call it with the device, which was handed in to tick_setup_oneshot().
      Set the force argument, so the devices is firing an interrupt.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7205656a
    • T
      clockevents: prevent endless loop in periodic broadcast handler · d4496b39
      Thomas Gleixner 提交于
      The reprogramming of the periodic broadcast handler was broken,
      when the first programming returned -ETIME. The clockevents code
      stores the new expiry value in the clock events device next_event field
      only when the programming time has not been elapsed yet. The loop in
      question calculates the new expiry value from the next_event value
      and therefor never increases.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d4496b39
    • V
      clockevents: prevent clockevent event_handler ending up handler_noop · 7c1e7689
      Venkatesh Pallipadi 提交于
      There is a ordering related problem with clockevents code, due to which
      clockevents_register_device() called after tickless/highres switch
      will not work. The new clockevent ends up with clockevents_handle_noop as
      event handler, resulting in no timer activity.
      
      The problematic path seems to be
      
      * old device already has hrtimer_interrupt as the event_handler
      * new clockevent device registers with a higher rating
      * tick_check_new_device() is called
        * clockevents_exchange_device() gets called
          * old->event_handler is set to clockevents_handle_noop
        * tick_setup_device() is called for the new device
          * which sets new->event_handler using the old->event_handler which is noop.
      
      Change the ordering so that new device inherits the proper handler.
      
      This does not have any issue in normal case as most likely all the clockevent
      devices are setup before the highres switch. But, can potentially be affecting
      some corner case where HPET force detect happens after the highres switch.
      This was a problem with HPET in MSI mode code that we have been experimenting
      with.
      Signed-off-by: NVenkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7c1e7689
    • A
      forgotten refcount on sysctl root table · b380b0d4
      Al Viro 提交于
      We should've set refcount on the root sysctl table; otherwise we'll blow
      up the first time we get down to zero dynamically registered sysctl
      tables.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Tested-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b380b0d4
  11. 03 9月, 2008 5 次提交
  12. 02 9月, 2008 1 次提交
  13. 30 8月, 2008 2 次提交
  14. 29 8月, 2008 1 次提交