1. 09 11月, 2005 2 次提交
    • N
      [PATCH] sched: resched and cpu_idle rework · 64c7c8f8
      Nick Piggin 提交于
      Make some changes to the NEED_RESCHED and POLLING_NRFLAG to reduce
      confusion, and make their semantics rigid.  Improves efficiency of
      resched_task and some cpu_idle routines.
      
      * In resched_task:
      - TIF_NEED_RESCHED is only cleared with the task's runqueue lock held,
        and as we hold it during resched_task, then there is no need for an
        atomic test and set there. The only other time this should be set is
        when the task's quantum expires, in the timer interrupt - this is
        protected against because the rq lock is irq-safe.
      
      - If TIF_NEED_RESCHED is set, then we don't need to do anything. It
        won't get unset until the task get's schedule()d off.
      
      - If we are running on the same CPU as the task we resched, then set
        TIF_NEED_RESCHED and no further action is required.
      
      - If we are running on another CPU, and TIF_POLLING_NRFLAG is *not* set
        after TIF_NEED_RESCHED has been set, then we need to send an IPI.
      
      Using these rules, we are able to remove the test and set operation in
      resched_task, and make clear the previously vague semantics of
      POLLING_NRFLAG.
      
      * In idle routines:
      - Enter cpu_idle with preempt disabled. When the need_resched() condition
        becomes true, explicitly call schedule(). This makes things a bit clearer
        (IMO), but haven't updated all architectures yet.
      
      - Many do a test and clear of TIF_NEED_RESCHED for some reason. According
        to the resched_task rules, this isn't needed (and actually breaks the
        assumption that TIF_NEED_RESCHED is only cleared with the runqueue lock
        held). So remove that. Generally one less locked memory op when switching
        to the idle thread.
      
      - Many idle routines clear TIF_POLLING_NRFLAG, and only set it in the inner
        most polling idle loops. The above resched_task semantics allow it to be
        set until before the last time need_resched() is checked before going into
        a halt requiring interrupt wakeup.
      
        Many idle routines simply never enter such a halt, and so POLLING_NRFLAG
        can be always left set, completely eliminating resched IPIs when rescheduling
        the idle task.
      
        POLLING_NRFLAG width can be increased, to reduce the chance of resched IPIs.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Con Kolivas <kernel@kolivas.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      64c7c8f8
    • N
      [PATCH] sched: disable preempt in idle tasks · 5bfb5d69
      Nick Piggin 提交于
      Run idle threads with preempt disabled.
      
      Also corrected a bugs in arm26's cpu_idle (make it actually call schedule()).
      How did it ever work before?
      
      Might fix the CPU hotplugging hang which Nigel Cunningham noted.
      
      We think the bug hits if the idle thread is preempted after checking
      need_resched() and before going to sleep, then the CPU offlined.
      
      After calling stop_machine_run, the CPU eventually returns from preemption and
      into the idle thread and goes to sleep.  The CPU will continue executing
      previous idle and have no chance to call play_dead.
      
      By disabling preemption until we are ready to explicitly schedule, this bug is
      fixed and the idle threads generally become more robust.
      
      From: alexs <ashepard@u.washington.edu>
      
        PPC build fix
      
      From: Yoichi Yuasa <yuasa@hh.iij4u.or.jp>
      
        MIPS build fix
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NYoichi Yuasa <yuasa@hh.iij4u.or.jp>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5bfb5d69
  2. 27 9月, 2005 1 次提交
  3. 05 9月, 2005 4 次提交
    • Z
      [PATCH] x86: make IOPL explicit · a5201129
      Zachary Amsden 提交于
      The pushf/popf in switch_to are ONLY used to switch IOPL.  Making this
      explicit in C code is more clear.  This pushf/popf pair was added as a
      bugfix for leaking IOPL to unprivileged processes when using
      sysenter/sysexit based system calls (sysexit does not restore flags).
      
      When requesting an IOPL change in sys_iopl(), it is just as easy to change
      the current flags and the flags in the stack image (in case an IRET is
      required), but there is no reason to force an IRET if we came in from the
      SYSENTER path.
      
      This change is the minimal solution for supporting a paravirtualized Linux
      kernel that allows user processes to run with I/O privilege.  Other
      solutions require radical rewrites of part of the low level fault / system
      call handling code, or do not fully support sysenter based system calls.
      
      Unfortunately, this added one field to the thread_struct.  But as a bonus,
      on P4, the fastest time measured for switch_to() went from 312 to 260
      cycles, a win of about 17% in the fast case through this performance
      critical path.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a5201129
    • Z
      [PATCH] x86: more asm cleanups · f2ab4461
      Zachary Amsden 提交于
      Some more assembler cleanups I noticed along the way.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2ab4461
    • Z
      [PATCH] i386: load_tls() fix · e7a2ff59
      Zachary Amsden 提交于
      Subtle fix: load_TLS has been moved after saving %fs and %gs segments to avoid
      creating non-reversible segments.  This could conceivably cause a bug if the
      kernel ever needed to save and restore fs/gs from the NMI handler.  It
      currently does not, but this is the safest approach to avoiding fs/gs
      corruption.  SMIs are safe, since SMI saves the descriptor hidden state.
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e7a2ff59
    • Z
      [PATCH] i386: inline asm cleanup · 4bb0d3ec
      Zachary Amsden 提交于
      i386 Inline asm cleanup.  Use cr/dr accessor functions.
      
      Also, a potential bugfix.  Also, some CR accessors really should be volatile.
      Reads from CR0 (numeric state may change in an exception handler), writes to
      CR4 (flipping CR4.TSD) and reads from CR2 (page fault) prevent instruction
      re-ordering.  I did not add memory clobber to CR3 / CR4 / CR0 updates, as it
      was not there to begin with, and in no case should kernel memory be clobbered,
      except when doing a TLB flush, which already has memory clobber.
      
      I noticed that page invalidation does not have a memory clobber.  I can't find
      a bug as a result, but there is definitely a potential for a bug here:
      
      #define __flush_tlb_single(addr) \
      	__asm__ __volatile__("invlpg %0": :"m" (*(char *) addr))
      Signed-off-by: NZachary Amsden <zach@vmware.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4bb0d3ec
  4. 28 7月, 2005 1 次提交
  5. 23 7月, 2005 1 次提交
    • L
      Fix up incorrect "unlikely()" on %gs reload in x86 __switch_to · b339a18b
      Linus Torvalds 提交于
      These days %gs is normally the TLS segment, so it's no longer zero.  As
      a result, we shouldn't just assume that %fs/%gs tend to be zero
      together, but test them independently instead.
      
      Also, fix setting of debug registers to use the "next" pointer instead
      of "current".  It so happens that the scheduler will have set the new
      current pointer before calling __switch_to(), but that's just an
      implementation detail.
      b339a18b
  6. 28 6月, 2005 1 次提交
    • A
      [PATCH] seccomp: tsc disable · ffaa8bd6
      Andrea Arcangeli 提交于
      I believe at least for seccomp it's worth to turn off the tsc, not just for
      HT but for the L2 cache too.  So it's up to you, either you turn it off
      completely (which isn't very nice IMHO) or I recommend to apply this below
      patch.
      
      This has been tested successfully on x86-64 against current cogito
      repository (i686 compiles so I didn't bother testing ;).  People selling
      the cpu through cpushare may appreciate this bit for a peace of mind.
      
      There's no way to get any timing info anymore with this applied
      (gettimeofday is forbidden of course).  The seccomp environment is
      completely deterministic so it can't be allowed to get timing info, it has
      to be deterministic so in the future I can enable a computing mode that
      does a parallel computing for each task with server side transparent
      checkpointing and verification that the output is the same from all the 2/3
      seller computers for each task, without the buyer even noticing (for now
      the verification is left to the buyer client side and there's no
      checkpointing, since that would require more kernel changes to track the
      dirty bits but it'll be easy to extend once the basic mode is finished).
      
      Eliminating a cold-cache read of the cr4 global variable will save one
      cacheline during the tlb flush while making the code per-cpu-safe at the
      same time.  Thanks to Mikael Pettersson for noticing the tlb flush wasn't
      per-cpu-safe.
      
      The global tlb flush can run from irq (IPI calling do_flush_tlb_all) but
      it'll be transparent to the switch_to code since the IPI won't make any
      change to the cr4 contents from the point of view of the interrupted code
      and since it's now all per-cpu stuff, it will not race.  So no need to
      disable irqs in switch_to slow path.
      Signed-off-by: NAndrea Arcangeli <andrea@cpushare.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ffaa8bd6
  7. 26 6月, 2005 3 次提交
    • L
      [PATCH] cpu state clean after hot remove · e1367daf
      Li Shaohua 提交于
      Clean CPU states in order to reuse smp boot code for CPU hotplug.
      
      Signed-off-by: Li Shaohua<shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e1367daf
    • L
      [PATCH] init call cleanup · 0bb3184d
      Li Shaohua 提交于
      Trival patch for CPU hotplug.  In CPU identify part, only did cleaup for intel
      CPUs.  Need do for other CPUs if they support S3 SMP.
      
      Signed-off-by: Li Shaohua<shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0bb3184d
    • Z
      [PATCH] i386 CPU hotplug · f3705136
      Zwane Mwaikambo 提交于
      (The i386 CPU hotplug patch provides infrastructure for some work which Pavel
      is doing as well as for ACPI S3 (suspend-to-RAM) work which Li Shaohua
      <shaohua.li@intel.com> is doing)
      
      The following provides i386 architecture support for safely unregistering and
      registering processors during runtime, updated for the current -mm tree.  In
      order to avoid dumping cpu hotplug code into kernel/irq/* i dropped the
      cpu_online check in do_IRQ() by modifying fixup_irqs().  The difference being
      that on cpu offline, fixup_irqs() is called before we clear the cpu from
      cpu_online_map and a long delay in order to ensure that we never have any
      queued external interrupts on the APICs.  There are additional changes to s390
      and ppc64 to account for this change.
      
      1) Add CONFIG_HOTPLUG_CPU
      2) disable local APIC timer on dead cpus.
      3) Disable preempt around irq balancing to prevent CPUs going down.
      4) Print irq stats for all possible cpus.
      5) Debugging check for interrupts on offline cpus.
      6) Hacky fixup_irqs() to redirect irqs when cpus go off/online.
      7) play_dead() for offline cpus to spin inside.
      8) Handle offline cpus set in flush_tlb_others().
      9) Grab lock earlier in smp_call_function() to prevent CPUs going down.
      10) Implement __cpu_disable() and __cpu_die().
      11) Enable local interrupts in cpu_enable() after fixup_irqs()
      12) Don't fiddle with NMI on dead cpu, but leave intact on other cpus.
      13) Program IRQ affinity whilst cpu is still in cpu_online_map on offline.
      Signed-off-by: NZwane Mwaikambo <zwane@linuxpower.ca>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3705136
  8. 24 6月, 2005 4 次提交
    • H
      [PATCH] kprobes: function-return probes · b94cce92
      Hien Nguyen 提交于
      This patch adds function-return probes to kprobes for the i386
      architecture.  This enables you to establish a handler to be run when a
      function returns.
      
      1. API
      
      Two new functions are added to kprobes:
      
      	int register_kretprobe(struct kretprobe *rp);
      	void unregister_kretprobe(struct kretprobe *rp);
      
      2. Registration and unregistration
      
      2.1 Register
      
        To register a function-return probe, the user populates the following
        fields in a kretprobe object and calls register_kretprobe() with the
        kretprobe address as an argument:
      
        kp.addr - the function's address
      
        handler - this function is run after the ret instruction executes, but
        before control returns to the return address in the caller.
      
        maxactive - The maximum number of instances of the probed function that
        can be active concurrently.  For example, if the function is non-
        recursive and is called with a spinlock or mutex held, maxactive = 1
        should be enough.  If the function is non-recursive and can never
        relinquish the CPU (e.g., via a semaphore or preemption), NR_CPUS should
        be enough.  maxactive is used to determine how many kretprobe_instance
        objects to allocate for this particular probed function.  If maxactive <=
        0, it is set to a default value (if CONFIG_PREEMPT maxactive=max(10, 2 *
        NR_CPUS) else maxactive=NR_CPUS)
      
        For example:
      
          struct kretprobe rp;
          rp.kp.addr = /* entrypoint address */
          rp.handler = /*return probe handler */
          rp.maxactive = /* e.g., 1 or NR_CPUS or 0, see the above explanation */
          register_kretprobe(&rp);
      
        The following field may also be of interest:
      
        nmissed - Initialized to zero when the function-return probe is
        registered, and incremented every time the probed function is entered but
        there is no kretprobe_instance object available for establishing the
        function-return probe (i.e., because maxactive was set too low).
      
      2.2 Unregister
      
        To unregiter a function-return probe, the user calls
        unregister_kretprobe() with the same kretprobe object as registered
        previously.  If a probed function is running when the return probe is
        unregistered, the function will return as expected, but the handler won't
        be run.
      
      3. Limitations
      
      3.1 This patch supports only the i386 architecture, but patches for
          x86_64 and ppc64 are anticipated soon.
      
      3.2 Return probes operates by replacing the return address in the stack
          (or in a known register, such as the lr register for ppc).  This may
          cause __builtin_return_address(0), when invoked from the return-probed
          function, to return the address of the return-probes trampoline.
      
      3.3 This implementation uses the "Multiprobes at an address" feature in
          2.6.12-rc3-mm3.
      
      3.4 Due to a limitation in multi-probes, you cannot currently establish
          a return probe and a jprobe on the same function.  A patch to remove
          this limitation is being tested.
      
      This feature is required by SystemTap (http://sourceware.org/systemtap),
      and reflects ideas contributed by several SystemTap developers, including
      Will Cohen and Ananth Mavinakayanahalli.
      Signed-off-by: NHien Nguyen <hien@us.ibm.com>
      Signed-off-by: NPrasanna S Panchamukhi <prasanna@in.ibm.com>
      Signed-off-by: NFrederik Deweerdt <frederik.deweerdt@laposte.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b94cce92
    • V
      [PATCH] xen: x86: Use more usermode macro · 717b594a
      Vincent Hanquez 提交于
      Use the user_mode macro where it's possible.
      Signed-off-by: NVincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
      Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      717b594a
    • V
      [PATCH] xen: x86: Use new macro for debugreg · 1cc6f12e
      Vincent Hanquez 提交于
      Make use of the 2 new macro set_debugreg and get_debugreg.
      Signed-off-by: NVincent Hanquez <vincent.hanquez@cl.cam.ac.uk>
      Cc: Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1cc6f12e
    • A
      [PATCH] Remove i386_ksyms.c, almost. · 129f6946
      Alexey Dobriyan 提交于
      * EXPORT_SYMBOL's moved to other files
      * #include <linux/config.h>, <linux/module.h> where needed
      * #include's in i386_ksyms.c cleaned up
      * After copy-paste, redundant due to Makefiles rules preprocessor directives
        removed:
      
      	#ifdef CONFIG_FOO
      	EXPORT_SYMBOL(foo);
      	#endif
      
      	obj-$(CONFIG_FOO) += foo.o
      
      * Tiny reformat to fit in 80 columns
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      129f6946
  9. 06 5月, 2005 1 次提交
    • A
      [PATCH] x86 stack initialisation fix · f48d9663
      Alexander Nyberg 提交于
      The recent change fix-crash-in-entrys-restore_all.patch
      
       	childregs->esp = esp;
      
       	p->thread.esp = (unsigned long) childregs;
      -	p->thread.esp0 = (unsigned long) (childregs+1);
      +	p->thread.esp0 = (unsigned long) (childregs+1) - 8;
      
       	p->thread.eip = (unsigned long) ret_from_fork;
      
      introduces an inconsistency between esp and esp0 before the task is run the
      first time.  esp0 is no longer the actual start of the stack, but 8 bytes
      off.
      
      This shows itself clearly in a scenario when a ptracer that is set to also
      ptrace eventual children traces program1 which then clones thread1.  Now
      the ptracer wants to modify the registers of thread1.  The x86 ptrace
      implementation bases it's knowledge about saved user-space registers upon
      p->thread.esp0.  But this will be a few bytes off causing certain writes to
      the kernel stack to overwrite a saved kernel function address making the
      kernel when actually running thread1 jump out into user-space.  Very
      spectacular.
      
      The testcase I've used is:
      /* start with strace -f ./a.out */
      #include <pthread.h>
      #include <stdio.h>
      
      void *do_thread(void *p)
      {
      	for (;;);
      }
      
      int main()
      {
      	pthread_t one;
      	pthread_create(&one, NULL, &do_thread, NULL);
      	for (;;);
      	return 0;
      }
      
      So, my solution is to instead of just adjusting esp0 that creates an
      inconsitent state I adjust where the user-space registers are saved with -8
      bytes.  This gives us the wanted extra bytes on the start of the stack and
      esp0 is now correct.  This solves the issues I saw from the original
      testcase from Mateusz Berezecki and has survived testing here.  I think
      this should go into -mm a round or two first however as there might be some
      cruft around depending on pt_regs lying on the start of the stack.  That
      however would have broken with the first change too!
      
      It's actually a 2-line diff but I had to move the comment of why the -8 bytes
      are there a few lines up. Thanks to Zwane for helping me with this.
      Signed-off-by: NAlexander Nyberg <alexn@telia.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f48d9663
  10. 01 5月, 2005 1 次提交
    • H
      [PATCH] i386/x86_64 segment register access update · fd51f666
      H. J. Lu 提交于
      The new i386/x86_64 assemblers no longer accept instructions for moving
      between a segment register and a 32bit memory location, i.e.,
      
              movl (%eax),%ds
              movl %ds,(%eax)
      
      To generate instructions for moving between a segment register and a
      16bit memory location without the 16bit operand size prefix, 0x66,
      
              mov (%eax),%ds
              mov %ds,(%eax)
      
      should be used. It will work with both new and old assemblers. The
      assembler starting from 2.16.90.0.1 will also support
      
              movw (%eax),%ds
              movw %ds,(%eax)
      
      without the 0x66 prefix. I am enclosing patches for 2.4 and 2.6 kernels
      here. The resulting kernel binaries should be unchanged as before, with
      old and new assemblers, if gcc never generates memory access for
      
                     unsigned gsindex;
                     asm volatile("movl %%gs,%0" : "=g" (gsindex));
      
      If gcc does generate memory access for the code above, the upper bits
      in gsindex are undefined and the new assembler doesn't allow it.
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fd51f666
  11. 17 4月, 2005 3 次提交
    • R
      [PATCH] i386: Use loaddebug macro consistently · ecd02ddd
      Roland McGrath 提交于
      This moves the macro loaddebug from asm-i386/suspend.h to
      asm-i386/processor.h, which is the place that makes sense for it to be
      defined, removes the extra copy of the same macro in
      arch/i386/kernel/process.c, and makes arch/i386/kernel/signal.c use the
      macro in place of its expansion.
      
      This is a purely cosmetic cleanup for the normal i386 kernel.  However, it
      is handy for Xen to be able to just redefine the loaddebug macro once
      instead of also changing the signal.c code.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ecd02ddd
    • S
      [PATCH] fix crash in entry.S restore_all · 5df24082
      Stas Sergeev 提交于
      Fix the access-above-bottom-of-stack crash.
      
      1. Allows to preserve the valueable optimization
      
      2. Works for NMIs
      
      3.  Doesn't care whether or not there are more of the like instances
         where the stack is left empty.
      
      4. Seems to work for me without the crashes:) 
      
      (akpm: this is still under discussion, although I _think_ it's OK.  You might
      want to hold off)
      
      Signed-off-by: Stas Sergeev <stsp@aknet.ru> 
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5df24082
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4