1. 30 1月, 2008 16 次提交
  2. 15 1月, 2008 1 次提交
    • S
      Kick CPUS that might be sleeping in cpus_idle_wait · 40d6a146
      Steven Rostedt 提交于
      Sometimes cpu_idle_wait gets stuck because it might miss CPUS that are
      already in idle, have no tasks waiting to run and have no interrupts going
      to them.  This is common on bootup when switching cpu idle governors.
      
      This patch gives those CPUS that don't check in an IPI kick.
      
       Background:
       -----------
      I notice this while developing the mcount patches, that every once in a
      while the system would hang. Looking deeper, the hang was always at boot
      up when registering init_menu of the cpu_idle menu governor. Talking
      with Thomas Gliexner, we discovered that one of the CPUS had no timer
      events scheduled for it and it was in idle (running with NO_HZ). So the
      CPU would not set the cpu_idle_state bit.
      
      Hitting sysrq-t a few times would eventually route the interrupt to the
      stuck CPU and the system would continue.
      
      Note, I would have used the PDA isidle but that is set after the
      cpu_idle_state bit is cleared, and would leave a window open where we
      may miss being kicked.
      
      hmm, looking closer at this, we still have a small race window between
      clearing the cpu_idle_state and disabling interrupts (hence the RFC).
      
          CPU0:                          CPU 1:
        ---------                       ---------
       cpu_idle_wait():                 cpu_idle():
            |                           __cpu_cpu_var(is_idle) = 1;
            |                           if (__get_cpu_var(cpu_idle_state)) /* == 0 */
       per_cpu(cpu_idle_state, 1) = 1;         |
       if (per_cpu(is_idle, 1)) /* == 1 */     |
       smp_call_function(1)                    |
            |                             receives ipi and runs do_nothing.
       wait on map == empty               idle();
         /* waits forever */
      
      So really we need interrupts off for most of this then. One might think
      that we could simply clear the cpu_idle_state from do_nothing, but I'm
      assuming that cpu_idle governors can be removed, and this might cause a
      race that a governor might be used after the module was removed.
      
      Venki said:
      
        I think your RFC patch is the right solution here.  As I see it, there is
        no race with your RFC patch.  As long as you call a dummy smp_call_function
        on all CPUs, we should be OK.  We can get rid of cpu_idle_state and the
        current wait forever logic altogether with dummy smp_call_function.  And so
        there wont be any wait forever scenario.
      
        The whole point of cpu_idle_wait() is to make all CPUs come out of idle
        loop atleast once.  The caller will use cpu_idle_wait something like this.
      
        // Want to change idle handler
      
        - Switch global idle handler to always present default_idle
      
        - call cpu_idle_wait so that all cpus come out of idle for an instant
          and stop using old idle pointer and start using default idle
      
        - Change the idle handler to a new handler
      
        - optional cpu_idle_wait if you want all cpus to start using the new
          handler immediately.
      
      Maybe the below 1s patch is safe bet for .24.  But for .25, I would say we
      just replace all complicated logic by simple dummy smp_call_function and
      remove cpu_idle_state altogether.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Len Brown <lenb@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40d6a146
  3. 20 12月, 2007 1 次提交
  4. 20 10月, 2007 2 次提交
  5. 14 10月, 2007 1 次提交
    • D
      Delete filenames in comments. · 835c34a1
      Dave Jones 提交于
      Since the x86 merge, lots of files that referenced their own filenames
      are no longer correct.  Rather than keep them up to date, just delete
      them, as they add no real value.
      
      Additionally:
      - fix up comment formatting in scx200_32.c
      - Remove a credit from myself in setup_64.c from a time when we had no SCM
      - remove longwinded history from tsc_32.c which can be figured out from
        git.
      Signed-off-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      835c34a1
  6. 11 10月, 2007 2 次提交
  7. 22 7月, 2007 1 次提交
  8. 17 7月, 2007 1 次提交
  9. 13 5月, 2007 1 次提交
  10. 09 5月, 2007 1 次提交
  11. 03 5月, 2007 3 次提交
    • J
      [PATCH] i386: Convert PDA into the percpu section · 7c3576d2
      Jeremy Fitzhardinge 提交于
      Currently x86 (similar to x84-64) has a special per-cpu structure
      called "i386_pda" which can be easily and efficiently referenced via
      the %fs register.  An ELF section is more flexible than a structure,
      allowing any piece of code to use this area.  Indeed, such a section
      already exists: the per-cpu area.
      
      So this patch:
      (1) Removes the PDA and uses per-cpu variables for each current member.
      (2) Replaces the __KERNEL_PDA segment with __KERNEL_PERCPU.
      (3) Creates a per-cpu mirror of __per_cpu_offset called this_cpu_off, which
          can be used to calculate addresses for this CPU's variables.
      (4) Simplifies startup, because %fs doesn't need to be loaded with a
          special segment at early boot; it can be deferred until the first
          percpu area is allocated (or never for UP).
      
      The result is less code and one less x86-specific concept.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      7c3576d2
    • R
      [PATCH] i386: i386 separate hardware-defined TSS from Linux additions · a75c54f9
      Rusty Russell 提交于
      On Thu, 2007-03-29 at 13:16 +0200, Andi Kleen wrote:
      > Please clean it up properly with two structs.
      
      Not sure about this, now I've done it.  Running it here.
      
      If you like it, I can do x86-64 as well.
      
      ==
      lguest defines its own TSS struct because the "struct tss_struct"
      contains linux-specific additions.  Andi asked me to split the struct
      in processor.h.
      
      Unfortunately it makes usage a little awkward.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      a75c54f9
    • A
      [PATCH] x86: Don't use MWAIT on AMD Family 10 · f039b754
      Andi Kleen 提交于
      It doesn't put the CPU into deeper sleep states, so it's better to use the standard
      idle loop to save power. But allow to reenable it anyways for benchmarking.
      
      I also removed the obsolete idle=halt on i386
      
      Cc: andreas.herrmann@amd.com
      Signed-off-by: NAndi Kleen <ak@suse.de>
      f039b754
  12. 27 2月, 2007 1 次提交
    • L
      Revert "[PATCH] i386: add idle notifier" · ea3d5226
      Linus Torvalds 提交于
      This reverts commit 2ff2d3d7.
      
      Uwe Bugla reports that he cannot mount a floppy drive any more, and Jiri
      Slaby bisected it down to this commit.
      
      Benjamin LaHaise also points out that this is a big hot-path, and that
      interrupt delivery while idle is very common and should not go through
      all these expensive gyrations.
      
      Fix up conflicts in arch/i386/kernel/apic.c and arch/i386/kernel/irq.c
      due to other unrelated irq changes.
      
      Cc: Stephane Eranian <eranian@hpl.hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Uwe Bugla <uwe.bugla@gmx.de>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3d5226
  13. 17 2月, 2007 1 次提交
  14. 13 2月, 2007 4 次提交
    • S
      [PATCH] i386: add idle notifier · 2ff2d3d7
      Stephane Eranian 提交于
      Add a notifier mechanism to the low level idle loop.  You can register a
      callback function which gets invoked on entry and exit from the low level idle
      loop.  The low level idle loop is defined as the polling loop, low-power call,
      or the mwait instruction.  Interrupts processed by the idle thread are not
      considered part of the low level loop.
      
      The notifier can be used to measure precisely how much is spent in useless
      execution (or low power mode).  The perfmon subsystem uses it to turn on/off
      monitoring.
      Signed-off-by: Nstephane eranian <eranian@hpl.hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      2ff2d3d7
    • Z
      [PATCH] i386: iOPL handling for paravirt guests · 8b151144
      Zachary Amsden 提交于
      I found a clever way to make the extra IOPL switching invisible to
      non-paravirt compiles - since kernel_rpl is statically defined to be zero
      there, and only non-zero rpl kernel have a problem restoring IOPL, as popf
      does not restore IOPL flags unless run at CPL-0.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      8b151144
    • Z
      [PATCH] i386: paravirt CPU hypercall batching mode · 9226d125
      Zachary Amsden 提交于
      The VMI ROM has a mode where hypercalls can be queued and batched.  This turns
      out to be a significant win during context switch, but must be done at a
      specific point before side effects to CPU state are visible to subsequent
      instructions.  This is similar to the MMU batching hooks already provided.
      The same hooks could be used by the Xen backend to implement a context switch
      multicall.
      
      To explain a bit more about lazy modes in the paravirt patches, basically, the
      idea is that only one of lazy CPU or MMU mode can be active at any given time.
       Lazy MMU mode is similar to this lazy CPU mode, and allows for batching of
      multiple PTE updates (say, inside a remap loop), but to avoid keeping some
      kind of state machine about when to flush cpu or mmu updates, we just allow
      one or the other to be active.  Although there is no real reason a more
      comprehensive scheme could not be implemented, there is also no demonstrated
      need for this extra complexity.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      9226d125
    • J
      [PATCH] i386: Convert i386 PDA code to use %fs · 464d1a78
      Jeremy Fitzhardinge 提交于
      Convert the PDA code to use %fs rather than %gs as the segment for
      per-processor data.  This is because some processors show a small but
      measurable performance gain for reloading a NULL segment selector (as %fs
      generally is in user-space) versus a non-NULL one (as %gs generally is).
      
      On modern processors the difference is very small, perhaps undetectable.
      Some old AMD "K6 3D+" processors are noticably slower when %fs is used
      rather than %gs; I have no idea why this might be, but I think they're
      sufficiently rare that it doesn't matter much.
      
      This patch also fixes the math emulator, which had not been adjusted to
      match the changed struct pt_regs.
      
      [frederik.deweerdt@gmail.com: fixit with gdb]
      [mingo@elte.hu: Fix KVM too]
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Cc: Ian Campbell <Ian.Campbell@XenSource.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NZachary Amsden <zach@vmware.com>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NFrederik Deweerdt <frederik.deweerdt@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      464d1a78
  15. 23 12月, 2006 1 次提交
    • I
      [PATCH] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code · 0888f06a
      Ingo Molnar 提交于
      Fernando Lopez-Lezcano reported frequent scheduling latencies and audio
      xruns starting at the 2.6.18-rt kernel, and those problems persisted all
      until current -rt kernels. The latencies were serious and unjustified by
      system load, often in the milliseconds range.
      
      After a patient and heroic multi-month effort of Fernando, where he
      tested dozens of kernels, tried various configs, boot options,
      test-patches of mine and provided latency traces of those incidents, the
      following 'smoking gun' trace was captured by him:
      
                       _------=> CPU#
                      / _-----=> irqs-off
                     | / _----=> need-resched
                     || / _---=> hardirq/softirq
                     ||| / _--=> preempt-depth
                     |||| /
                     |||||     delay
         cmd     pid ||||| time  |   caller
            \   /    |||||   \   |   /
        IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
        IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
        IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
        IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
        IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
        ...
        <idle>-0     1...1   11us!: default_idle (cpu_idle)
        ...
        <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
        ...
         <...>-5856  0D..2  618us : __switch_to (__schedule)
         <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
         <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
         <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
         <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)
      
      what is visible in this trace is that CPU#1 ran try_to_wake_up() for
      PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task()
      for CPU#0. But it decided to not send an IPI that no CPU - due to
      TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set,
      and only rescheduled to PID:5856 upon the next lapic timer IRQ. The
      result was a 600+ usecs latency and a missed wakeup!
      
      the bug turned out to be an idle-wakeup bug introduced into the mainline
      kernel this summer via an optimization in the x86_64 tree:
      
          commit 495ab9c0
          Author: Andi Kleen <ak@suse.de>
          Date:   Mon Jun 26 13:59:11 2006 +0200
      
          [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
      
          During some profiling I noticed that default_idle causes a lot of
          memory traffic. I think that is caused by the atomic operations
          to clear/set the polling flag in thread_info. There is actually
          no reason to make this atomic - only the idle thread does it
          to itself, other CPUs only read it. So I moved it into ti->status.
      
      the problem is this type of change:
      
              if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
      -               clear_thread_flag(TIF_POLLING_NRFLAG);
      +               current_thread_info()->status &= ~TS_POLLING;
                      smp_mb__after_clear_bit();
                      while (!need_resched()) {
                              local_irq_disable();
      
      this changes clear_thread_flag() to an explicit clearing of TS_POLLING.
      clear_thread_flag() is defined as:
      
              clear_bit(flag, &ti->flags);
      
      and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:
      
        static inline void clear_bit(int nr, volatile unsigned long * addr)
        {
                __asm__ __volatile__( LOCK_PREFIX
                        "btrl %1,%0"
      
      hence smp_mb__after_clear_bit() is defined as a simple compile barrier:
      
        #define smp_mb__after_clear_bit()       barrier()
      
      but the explicit TS_POLLING clearing introduced by the patch:
      
      +               current_thread_info()->status &= ~TS_POLLING;
      
      is not an atomic op! So the clearing of the TS_POLLING bit is freely
      reorderable with the reading of the NEED_RESCHED bit - and both now
      reside in different memory addresses.
      
      CPU idle wakeup very much depends on ordered memory ops, the clearing of
      the TS_POLLING flag must always be done before we test need_resched()
      and hit the idle instruction(s). [Symmetrically, the wakeup code needs
      to set NEED_RESCHED before it tests the TS_POLLING flag, so memory
      ordering is paramount.]
      
      Fernando's dual-core Athlon64 system has a sufficiently advanced memory
      ordering model so that it triggered this scenario very often.
      
      ( And it also turned out that the reason why these latencies never
        triggered on my testsystems is that i routinely use idle=poll, which
        was the only idle variant not affected by this bug. )
      
      The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to
      act as an absolute barrier between the TS_POLLING write and the
      NEED_RESCHED read. This affects almost all idling methods (default,
      ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NFernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0888f06a
  16. 07 12月, 2006 3 次提交