1. 14 10月, 2008 1 次提交
    • S
      ftrace: x86 mcount stub · 0a37605c
      Steven Rostedt 提交于
      x86 now sets up the mcount locations through the build and no longer
      needs to record the ip when the function is executed. This patch changes
      the initial mcount to simply return. There's no need to do any other work.
      If the ftrace start up test fails, the original mcount will be what everything
      will use, so having this as fast as possible is a good thing.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0a37605c
  2. 13 10月, 2008 6 次提交
  3. 24 7月, 2008 1 次提交
    • R
      i386 syscall audit fast-path · af0575bb
      Roland McGrath 提交于
      This adds fast paths for 32-bit syscall entry and exit when
      TIF_SYSCALL_AUDIT is set, but no other kind of syscall tracing.
      These paths does not need to save and restore all registers as
      the general case of tracing does.  Avoiding the iret return path
      when syscall audit is enabled helps performance a lot.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      af0575bb
  4. 19 7月, 2008 1 次提交
  5. 17 7月, 2008 2 次提交
    • R
      x86 ptrace: unify syscall tracing · d4d67150
      Roland McGrath 提交于
      This unifies and cleans up the syscall tracing code on i386 and x86_64.
      
      Using a single function for entry and exit tracing on 32-bit made the
      do_syscall_trace() into some terrible spaghetti.  The logic is clear and
      simple using separate syscall_trace_enter() and syscall_trace_leave()
      functions as on 64-bit.
      
      The unification adds PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP support
      on x86_64, for 32-bit ptrace() callers and for 64-bit ptrace() callers
      tracing either 32-bit or 64-bit tasks.  It behaves just like 32-bit.
      
      Changing syscall_trace_enter() to return the syscall number shortens
      all the assembly paths, while adding the SYSEMU feature in a simple way.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      d4d67150
    • R
      x86 ptrace: unify TIF_SINGLESTEP · 64f09733
      Roland McGrath 提交于
      This unifies the treatment of TIF_SINGLESTEP on i386 and x86_64.
      The bit is now excluded from _TIF_WORK_MASK on i386 as it has been
      on x86_64.  This means the do_notify_resume() path using it is never
      used, so TIF_SINGLESTEP is not cleared on returning to user mode.
      
      Both now leave TIF_SINGLESTEP set when returning to user, so that
      it's already set on an int $0x80 system call entry.  This removes
      the need for testing TF on the system_call path.  Doing it this way
      fixes the regression for PTRACE_SINGLESTEP into a sigreturn syscall,
      introduced by commit 1e2e99f0.
      
      The clear_TF_reenable case that sets TIF_SINGLESTEP can only happen
      on a non-exception kernel entry, i.e. sysenter/syscall instruction.
      That will always get to the syscall exit tracing path.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      64f09733
  6. 12 7月, 2008 1 次提交
  7. 08 7月, 2008 1 次提交
    • J
      x86/paravirt: split sysret and sysexit · d75cd22f
      Jeremy Fitzhardinge 提交于
      Don't conflate sysret and sysexit; they're different instructions with
      different semantics, and may be in use at the same time (at least
      within the same kernel, depending on whether its an Intel or AMD
      system).
      
      sysexit - just return to userspace, does no register restoration of
          any kind; must explicitly atomically enable interrupts.
      
      sysret - reloads flags from r11, so no need to explicitly enable
          interrupts on 64-bit, responsible for restoring usermode %gs
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citirx.com>
      Cc: xen-devel <xen-devel@lists.xensource.com>
      Cc: Stephen Tweedie <sct@redhat.com>
      Cc: Eduardo Habkost <ehabkost@redhat.com>
      Cc: Mark McLoughlin <markmc@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d75cd22f
  8. 24 6月, 2008 1 次提交
  9. 13 6月, 2008 1 次提交
    • P
      x86: fix lockdep warning during suspend-to-ram · e32e58a9
      Peter Zijlstra 提交于
      Andrew Morton wrote:
      
      > I've been seeing the below for a long time during suspend-to-ram on the Vaio.
      >
      >
      > PM: Syncing filesystems ... done.
      > PM: Preparing system for mem sleep
      > Freezing user space processes ... <4>------------[ cut here ]------------
      > WARNING: at kernel/lockdep.c:2658 check_flags+0x4c/0x127()
      > Modules linked in: i915 drm ipw2200 sonypi ipv6 autofs4 hidp l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables acpi_cpufreq nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy sr_mod snd_seq_oss cdrom snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss ieee80211 pcspkr ieee80211_crypt snd_pcm i2c_i801 snd_timer i2c_core ide_pci_generic piix snd soundcore snd_page_alloc button ext3 jbd ide_disk ide_core [last unloaded: ipw2200]
      > Pid: 3250, comm: zsh Not tainted 2.6.26-rc5 #1
      >  [<c011c5f5>] warn_on_slowpath+0x41/0x6d
      >  [<c01080e6>] ? native_sched_clock+0x82/0x96
      >  [<c013789c>] ? mark_held_locks+0x41/0x5c
      >  [<c0315688>] ? _spin_unlock_irqrestore+0x36/0x58
      >  [<c0137a29>] ? trace_hardirqs_on+0xe6/0x10d
      >  [<c0138637>] ? __lock_acquire+0xae3/0xb2b
      >  [<c0313413>] ? schedule+0x39b/0x3b4
      >  [<c0135596>] check_flags+0x4c/0x127
      >  [<c01386b9>] lock_acquire+0x3a/0x86
      >  [<c0315075>] _spin_lock+0x26/0x53
      >  [<c0140660>] ? refrigerator+0x13/0xc3
      >  [<c0140660>] refrigerator+0x13/0xc3
      >  [<c012684a>] get_signal_to_deliver+0x3c/0x31e
      >  [<c0102fe7>] do_notify_resume+0x91/0x6ee
      >  [<c01359fd>] ? lock_release_holdtime+0x50/0x56
      >  [<c0315688>] ? _spin_unlock_irqrestore+0x36/0x58
      >  [<c0235d24>] ? read_chan+0x0/0x58c
      >  [<c0137a29>] ? trace_hardirqs_on+0xe6/0x10d
      >  [<c0315694>] ? _spin_unlock_irqrestore+0x42/0x58
      >  [<c0230afa>] ? tty_ldisc_deref+0x5c/0x63
      >  [<c0233104>] ? tty_read+0x66/0x98
      >  [<c014b3f0>] ? audit_syscall_exit+0x2aa/0x2c5
      >  [<c0109430>] ? do_syscall_trace+0x6b/0x16f
      >  [<c0103a9c>] work_notifysig+0x13/0x1b
      >  =======================
      > ---[ end trace 25b49fe59a25afa5 ]---
      > possible reason: unannotated irqs-off.
      > irq event stamp: 58919
      > hardirqs last  enabled at (58919): [<c0103afd>] syscall_exit_work+0x11/0x26
      
      Joy - I so love entry.S
      
      Best I can make of it:
      
      syscall_exit_work
        resume_userspace
          DISABLE_INTERRUPTS
          (no TRACE_IRQS_OFF)
            work_pending
              work_notifysig
                do_notify_resume()
                  do_signal()
                    get_signal_to_deliver()
                      try_to_freeze()
                        refrigerator()
                          task_lock() -> check_flags() -> BANG
      
      The normal path is:
      
      syscall_exit_work
        resume_userspace
          DISABLE_INTERRUPTS
          restore_all
            TRACE_IRQS_IRET
            iret
      
      No idea why that would not warn..
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e32e58a9
  10. 24 5月, 2008 2 次提交
    • S
      ftrace: use dynamic patching for updating mcount calls · d61f82d0
      Steven Rostedt 提交于
      This patch replaces the indirect call to the mcount function
      pointer with a direct call that will be patched by the
      dynamic ftrace routines.
      
      On boot up, the mcount function calls the ftace_stub function.
      When the dynamic ftrace code is initialized, the ftrace_stub
      is replaced with a call to the ftrace_record_ip, which records
      the instruction pointers of the locations that call it.
      
      Later, the ftraced daemon will call kstop_machine and patch all
      the locations to nops.
      
      When a ftrace is enabled, the original calls to mcount will now
      be set top call ftrace_caller, which will do a direct call
      to the registered ftrace function. This direct call is also patched
      when the function that should be called is updated.
      
      All patching is performed by a kstop_machine routine to prevent any
      type of race conditions that is associated with modifying code
      on the fly.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d61f82d0
    • A
      ftrace: add basic support for gcc profiler instrumentation · 16444a8a
      Arnaldo Carvalho de Melo 提交于
      If CONFIG_FTRACE is selected and /proc/sys/kernel/ftrace_enabled is
      set to a non-zero value the ftrace routine will be called everytime
      we enter a kernel function that is not marked with the "notrace"
      attribute.
      
      The ftrace routine will then call a registered function if a function
      happens to be registered.
      
      [ This code has been highly hacked by Steven Rostedt and Ingo Molnar,
        so don't blame Arnaldo for all of this ;-) ]
      
      Update:
        It is now possible to register more than one ftrace function.
        If only one ftrace function is registered, that will be the
        function that ftrace calls directly. If more than one function
        is registered, then ftrace will call a function that will loop
        through the functions to call.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      16444a8a
  11. 13 5月, 2008 1 次提交
  12. 25 4月, 2008 4 次提交
  13. 20 4月, 2008 1 次提交
  14. 17 4月, 2008 2 次提交
  15. 19 2月, 2008 1 次提交
  16. 10 2月, 2008 1 次提交
  17. 30 1月, 2008 4 次提交
  18. 17 10月, 2007 1 次提交
    • J
      paravirt: refactor struct paravirt_ops into smaller pv_*_ops · 93b1eab3
      Jeremy Fitzhardinge 提交于
      This patch refactors the paravirt_ops structure into groups of
      functionally related ops:
      
      pv_info - random info, rather than function entrypoints
      pv_init_ops - functions used at boot time (some for module_init too)
      pv_misc_ops - lazy mode, which didn't fit well anywhere else
      pv_time_ops - time-related functions
      pv_cpu_ops - various privileged instruction ops
      pv_irq_ops - operations for managing interrupt state
      pv_apic_ops - APIC operations
      pv_mmu_ops - operations for managing pagetables
      
      There are several motivations for this:
      
      1. Some of these ops will be general to all x86, and some will be
         i386/x86-64 specific.  This makes it easier to share common stuff
         while allowing separate implementations where needed.
      
      2. At the moment we must export all of paravirt_ops, but modules only
         need selected parts of it.  This allows us to export on a case by case
         basis (and also choose which export license we want to apply).
      
      3. Functional groupings make things a bit more readable.
      
      Struct paravirt_ops is now only used as a template to generate
      patch-site identifiers, and to extract function pointers for inserting
      into jmp/calls when patching.  It is only instantiated when needed.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Zach Amsden <zach@vmware.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Anthony Liguory <aliguori@us.ibm.com>
      Cc: "Glauber de Oliveira Costa" <glommer@gmail.com>
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      93b1eab3
  19. 12 10月, 2007 1 次提交
  20. 11 10月, 2007 3 次提交
  21. 19 7月, 2007 1 次提交
  22. 18 7月, 2007 2 次提交
    • J
      xen: use iret directly when possible · 9ec2b804
      Jeremy Fitzhardinge 提交于
      Most of the time we can simply use the iret instruction to exit the
      kernel, rather than having to use the iret hypercall - the only
      exception is if we're returning into vm86 mode, or from delivering an
      NMI (which we don't support yet).
      
      When running native, iret has the behaviour of testing for a pending
      interrupt atomically with re-enabling interrupts.  Unfortunately
      there's no way to do this with Xen, so there's a window in which we
      could get a recursive exception after enabling events but before
      actually returning to userspace.
      
      This causes a problem: if the nested interrupt causes one of the
      task's TIF_WORK_MASK flags to be set, they will not be checked again
      before returning to userspace.  This means that pending work may be
      left pending indefinitely, until the process enters and leaves the
      kernel again.  The net effect is that a pending signal or reschedule
      event could be delayed for an unbounded amount of time.
      
      To deal with this, the xen event upcall handler checks to see if the
      EIP is within the critical section of the iret code, after events
      are (potentially) enabled up to the iret itself.  If its within this
      range, it calls the iret critical section fixup, which adjusts the
      stack to deal with any unrestored registers, and then shifts the
      stack frame up to replace the previous invocation.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      9ec2b804
    • J
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge 提交于
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8
  23. 07 7月, 2007 1 次提交
    • J
      i386: fix regression, endless loop in ptrace singlestep over an int80 · 1e2e99f0
      Jason Wessel 提交于
      The commit 635cf99a introduced a
      regression.  Executing a ptrace single step after certain int80
      accesses will infinitely loop and never advance the PC.
      
      The TIF_SINGLESTEP check should be done on the return from the syscall
      and not before it.
      
      I loops on each single step on the pop right after the int80 which writes out
      to the console.  At that point you can issue as many single steps as you want
      and it will not advance any further.
      
      The test case is below:
      
      /* Test whether singlestep through an int80 syscall works.
       */
      #define _GNU_SOURCE
      #include <stdio.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <sys/ptrace.h>
      #include <sys/wait.h>
      #include <sys/mman.h>
      #include <asm/user.h>
      #include <string.h>
      
      static int child, status;
      static struct user_regs_struct regs;
      
      static void do_child()
      {
      	char str[80] = "child: int80 test\n";
      
      	ptrace(PTRACE_TRACEME, 0, 0, 0);
      	kill(getpid(), SIGUSR1);
      	write(fileno(stdout),str,strlen(str));
      	asm ("int $0x80" : : "a" (20)); /* getpid */
      }
      
      static void do_parent()
      {
      	unsigned long eip, expected = 0;
      again:
      	waitpid(child, &status, 0);
      	if (WIFEXITED(status) || WIFSIGNALED(status))
      		return;
      
      	if (WIFSTOPPED(status)) {
      		ptrace(PTRACE_GETREGS, child, 0, &regs);
      		eip = regs.eip;
      		if (expected)
      			fprintf(stderr, "child stop @ %08lx, expected %08lx %s\n",
      					eip, expected,
      					eip == expected ? "" : " <== ERROR");
      
      		if (*(unsigned short *)eip == 0x80cd) {
      			fprintf(stderr, "int 0x80 at %08x\n", (unsigned int)eip);
      			expected = eip + 2;
      		} else
      			expected = 0;
      
      		ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
      	}
      	goto again;
      }
      
      int main(int argc, char * const argv[])
      {
      	child = fork();
      	if (child)
      		do_parent();
      	else
      		do_child();
      	return 0;
      }
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: <stable@kernel.org>
      Cc: Chuck Ebbert <76306.1226@compuserve.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e2e99f0