1. 06 6月, 2011 1 次提交
  2. 12 1月, 2011 1 次提交
  3. 27 7月, 2010 2 次提交
  4. 10 9月, 2009 1 次提交
    • J
      xen: make -fstack-protector work under Xen · 577eebea
      Jeremy Fitzhardinge 提交于
      -fstack-protector uses a special per-cpu "stack canary" value.
      gcc generates special code in each function to test the canary to make
      sure that the function's stack hasn't been overrun.
      
      On x86-64, this is simply an offset of %gs, which is the usual per-cpu
      base segment register, so setting it up simply requires loading %gs's
      base as normal.
      
      On i386, the stack protector segment is %gs (rather than the usual kernel
      percpu %fs segment register).  This requires setting up the full kernel
      GDT and then loading %gs accordingly.  We also need to make sure %gs is
      initialized when bringing up secondary cpus too.
      
      To keep things consistent, we do the full GDT/segment register setup on
      both architectures.
      
      Because we need to avoid -fstack-protected code before setting up the GDT
      and because there's no way to disable it on a per-function basis, several
      files need to have stack-protector inhibited.
      
      [ Impact: allow Xen booting with stack-protector enabled ]
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      577eebea
  5. 20 8月, 2009 1 次提交
  6. 16 5月, 2009 1 次提交
    • J
      x86: Fix performance regression caused by paravirt_ops on native kernels · b4ecc126
      Jeremy Fitzhardinge 提交于
      Xiaohui Xin and some other folks at Intel have been looking into what's
      behind the performance hit of paravirt_ops when running native.
      
      It appears that the hit is entirely due to the paravirtualized
      spinlocks introduced by:
      
       | commit 8efcbab6
       | Date:   Mon Jul 7 12:07:51 2008 -0700
       |
       |     paravirt: introduce a "lock-byte" spinlock implementation
      
      The extra call/return in the spinlock path is somehow
      causing an increase in the cycles/instruction of somewhere around 2-7%
      (seems to vary quite a lot from test to test).  The working theory is
      that the CPU's pipeline is getting upset about the
      call->call->locked-op->return->return, and seems to be failing to
      speculate (though I haven't seen anything definitive about the precise
      reasons).  This doesn't entirely make sense, because the performance
      hit is also visible on unlock and other operations which don't involve
      locked instructions.  But spinlock operations clearly swamp all the
      other pvops operations, even though I can't imagine that they're
      nearly as common (there's only a .05% increase in instructions
      executed).
      
      If I disable just the pv-spinlock calls, my tests show that pvops is
      identical to non-pvops performance on native (my measurements show that
      it is actually about .1% faster, but Xiaohui shows a .05% slowdown).
      
      Summary of results, averaging 10 runs of the "mmperf" test, using a
      no-pvops build as baseline:
      
      		nopv		Pv-nospin	Pv-spin
      CPU cycles	100.00%		99.89%		102.18%
      instructions	100.00%		100.10%		100.15%
      CPI		100.00%		99.79%		102.03%
      cache ref	100.00%		100.84%		100.28%
      cache miss	100.00%		90.47%		88.56%
      cache miss rate	100.00%		89.72%		88.31%
      branches	100.00%		99.93%		100.04%
      branch miss	100.00%		103.66%		107.72%
      branch miss rt	100.00%		103.73%		107.67%
      wallclock	100.00%		99.90%		102.20%
      
      The clear effect here is that the 2% increase in CPI is
      directly reflected in the final wallclock time.
      
      (The other interesting effect is that the more ops are
      out of line calls via pvops, the lower the cache access
      and miss rates.  Not too surprising, but it suggests that
      the non-pvops kernel is over-inlined.  On the flipside,
      the branch misses go up correspondingly...)
      
      So, what's the fix?
      
      Paravirt patching turns all the pvops calls into direct calls, so
      _spin_lock etc do end up having direct calls.  For example, the compiler
      generated code for paravirtualized _spin_lock is:
      
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq  *0xffffffff805a5b30
      <_spin_lock+22>:	retq
      
      The indirect call will get patched to:
      <_spin_lock+0>:		mov    %gs:0xb4c8,%rax
      <_spin_lock+9>:		incl   0xffffffffffffe044(%rax)
      <_spin_lock+15>:	callq <__ticket_spin_lock>
      <_spin_lock+20>:	nop; nop		/* or whatever 2-byte nop */
      <_spin_lock+22>:	retq
      
      One possibility is to inline _spin_lock, etc, when building an
      optimised kernel (ie, when there's no spinlock/preempt
      instrumentation/debugging enabled).  That will remove the outer
      call/return pair, returning the instruction stream to a single
      call/return, which will presumably execute the same as the non-pvops
      case.  The downsides arel 1) it will replicate the
      preempt_disable/enable code at eack lock/unlock callsite; this code is
      fairly small, but not nothing; and 2) the spinlock definitions are
      already a very heavily tangled mass of #ifdefs and other preprocessor
      magic, and making any changes will be non-trivial.
      
      The other obvious answer is to disable pv-spinlocks.  Making them a
      separate config option is fairly easy, and it would be trivial to
      enable them only when Xen is enabled (as the only non-default user).
      But it doesn't really address the common case of a distro build which
      is going to have Xen support enabled, and leaves the open question of
      whether the native performance cost of pv-spinlocks is worth the
      performance improvement on a loaded Xen system (10% saving of overall
      system CPU when guests block rather than spin).  Still it is a
      reasonable short-term workaround.
      
      [ Impact: fix pvops performance regression when running native ]
      Analysed-by: N"Xin Xiaohui" <xiaohui.xin@intel.com>
      Analysed-by: N"Li Xin" <xin.li@intel.com>
      Analysed-by: N"Nakajima Jun" <jun.nakajima@intel.com>
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Xen-devel <xen-devel@lists.xensource.com>
      LKML-Reference: <4A0B62F7.5030802@goop.org>
      [ fixed the help text ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b4ecc126
  7. 05 2月, 2009 1 次提交
  8. 21 10月, 2008 1 次提交
  9. 21 8月, 2008 1 次提交
  10. 31 7月, 2008 1 次提交
  11. 24 7月, 2008 1 次提交
  12. 16 7月, 2008 1 次提交
  13. 27 5月, 2008 2 次提交
  14. 25 4月, 2008 3 次提交
  15. 11 10月, 2007 1 次提交
  16. 18 7月, 2007 7 次提交
    • J
      xen: Attempt to patch inline versions of common operations · 6487673b
      Jeremy Fitzhardinge 提交于
      This patchs adds the mechanism to allow us to patch inline versions of
      common operations.
      
      The implementations of the direct-access versions save_fl, restore_fl,
      irq_enable and irq_disable are now in assembler, and the same code is
      used for both out of line and inline uses.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Keir Fraser <keir@xensource.com>
      6487673b
    • J
      xen: handle external requests for shutdown, reboot and sysrq · 3e2b8fbe
      Jeremy Fitzhardinge 提交于
      The guest domain can be asked to shutdown or reboot itself, or have a
      sysrq key injected, via xenbus.  This patch adds a watcher for those
      events, and does the appropriate action.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Chris Wright <chrisw@sous-sol.org>
      3e2b8fbe
    • J
      xen: SMP guest support · f87e4cac
      Jeremy Fitzhardinge 提交于
      This is a fairly straightforward Xen implementation of smp_ops.
      
      Xen has its own IPI mechanisms, and has no dependency on any
      APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
      allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
      operation is a single apic_read for the apic version number).
      
      One subtle point which needs to be addressed is unpinning pagetables
      when another cpu may have a lazy tlb reference to the pagetable. Xen
      will not allow an in-use pagetable to be unpinned, so we must find any
      other cpus with a reference to the pagetable and get them to shoot
      down their references.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      f87e4cac
    • J
      xen: time implementation · 15c84731
      Jeremy Fitzhardinge 提交于
      Xen maintains a base clock which measures nanoseconds since system
      boot.  This is provided to guests via a shared page which contains a
      base time in ns, a tsc timestamp at that point and tsc frequency
      parameters.  Guests can compute the current time by reading the tsc
      and using it to extrapolate the current time from the basetime.  The
      hypervisor makes sure that the frequency parameters are updated
      regularly, paricularly if the tsc changes rate or stops.
      
      This is implemented as a clocksource, so the interface to the rest of
      the kernel is a simple clocksource which simply returns the current
      time directly in nanoseconds.
      
      Xen also provides a simple timer mechanism, which allows a timeout to
      be set in the future.  When that time arrives, a timer event is sent
      to the guest.  There are two timer interfaces:
       - An old one which also delivers a stream of (unused) ticks at 100Hz,
         and on the same event, the actual timer events.  The 100Hz ticks
         cause a lot of spurious wakeups, but are basically harmless.
       - The new timer interface doesn't have the 100Hz ticks, and can also
         fail if the specified time is in the past.
      
      This code presents the Xen timer as a clockevent driver, and uses the
      new interface by preference.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      15c84731
    • J
      xen: event channels · e46cdb66
      Jeremy Fitzhardinge 提交于
      Xen implements interrupts in terms of event channels.  Each guest
      domain gets 1024 event channels which can be used for a variety of
      purposes, such as Xen timer events, inter-domain events,
      inter-processor events (IPI) or for real hardware IRQs.
      
      Within the kernel, we map the event channels to IRQs, and implement
      the whole interrupt handling using a Xen irq_chip.
      
      Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
      accomodate Xen, we create a dynamic mapping between event channels and
      IRQs.  Ideally, Linux will eventually move towards dynamically
      allocating per-irq structures, and we can use a 1:1 mapping between
      event channels and irqs.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      e46cdb66
    • J
      xen: virtual mmu · 3b827c1b
      Jeremy Fitzhardinge 提交于
      Xen pagetable handling, including the machinery to implement direct
      pagetables.
      
      Xen presents the real CPU's pagetables directly to guests, with no
      added shadowing or other layer of abstraction.  Naturally this means
      the hypervisor must maintain close control over what the guest can put
      into the pagetable.
      
      When the guest modifies the pte/pmd/pgd, it must convert its
      domain-specific notion of a "physical" pfn into a global machine frame
      number (mfn) before inserting the entry into the pagetable.  Xen will
      check to make sure the domain is allowed to create a mapping of the
      given mfn.
      
      Xen also requires that all mappings the guest has of its own active
      pagetable are read-only.  This is relatively easy to implement in
      Linux because all pagetables share the same pte pages for kernel
      mappings, so updating the pte in one pagetable will implicitly update
      the mapping in all pagetables.
      
      Normally a pagetable becomes active when you point to it with cr3 (or
      the Xen equivalent), but when you do so, Xen must check the whole
      pagetable for correctness, which is clearly a performance problem.
      
      Xen solves this with pinning which keeps a pagetable effectively
      active even if its currently unused, which means that all the normal
      update rules are enforced.  This means that it need not revalidate the
      pagetable when loading cr3.
      
      This patch has a first-cut implementation of pinning, but it is more
      fully implemented in a later patch.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      3b827c1b
    • J
      xen: Core Xen implementation · 5ead97c8
      Jeremy Fitzhardinge 提交于
      This patch is a rollup of all the core pieces of the Xen
      implementation, including:
       - booting and setup
       - pagetable setup
       - privileged instructions
       - segmentation
       - interrupt flags
       - upcalls
       - multicall batching
      
      BOOTING AND SETUP
      
      The vmlinux image is decorated with ELF notes which tell the Xen
      domain builder what the kernel's requirements are; the domain builder
      then constructs the address space accordingly and starts the kernel.
      
      Xen has its own entrypoint for the kernel (contained in an ELF note).
      The ELF notes are set up by xen-head.S, which is included into head.S.
      In principle it could be linked separately, but it seems to provoke
      lots of binutils bugs.
      
      Because the domain builder starts the kernel in a fairly sane state
      (32-bit protected mode, paging enabled, flat segments set up), there's
      not a lot of setup needed before starting the kernel proper.  The main
      steps are:
        1. Install the Xen paravirt_ops, which is simply a matter of a
           structure assignment.
        2. Set init_mm to use the Xen-supplied pagetables (analogous to the
           head.S generated pagetables in a native boot).
        3. Reserve address space for Xen, since it takes a chunk at the top
           of the address space for its own use.
        4. Call start_kernel()
      
      PAGETABLE SETUP
      
      Once we hit the main kernel boot sequence, it will end up calling back
      via paravirt_ops to set up various pieces of Xen specific state.  One
      of the critical things which requires a bit of extra care is the
      construction of the initial init_mm pagetable.  Because Xen places
      tight constraints on pagetables (an active pagetable must always be
      valid, and must always be mapped read-only to the guest domain), we
      need to be careful when constructing the new pagetable to keep these
      constraints in mind.  It turns out that the easiest way to do this is
      use the initial Xen-provided pagetable as a template, and then just
      insert new mappings for memory where a mapping doesn't already exist.
      
      This means that during pagetable setup, it uses a special version of
      xen_set_pte which ignores any attempt to remap a read-only page as
      read-write (since Xen will map its own initial pagetable as RO), but
      lets other changes to the ptes happen, so that things like NX are set
      properly.
      
      PRIVILEGED INSTRUCTIONS AND SEGMENTATION
      
      When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
      This means that it is more privileged than user-mode in ring 3, but it
      still can't run privileged instructions directly.  Non-performance
      critical instructions are dealt with by taking a privilege exception
      and trapping into the hypervisor and emulating the instruction, but
      more performance-critical instructions have their own specific
      paravirt_ops.  In many cases we can avoid having to do any hypercalls
      for these instructions, or the Xen implementation is quite different
      from the normal native version.
      
      The privileged instructions fall into the broad classes of:
        Segmentation: setting up the GDT and the GDT entries, LDT,
           TLS and so on.  Xen doesn't allow the GDT to be directly
           modified; all GDT updates are done via hypercalls where the new
           entries can be validated.  This is important because Xen uses
           segment limits to prevent the guest kernel from damaging the
           hypervisor itself.
        Traps and exceptions: Xen uses a special format for trap entrypoints,
           so when the kernel wants to set an IDT entry, it needs to be
           converted to the form Xen expects.  Xen sets int 0x80 up specially
           so that the trap goes straight from userspace into the guest kernel
           without going via the hypervisor.  sysenter isn't supported.
        Kernel stack: The esp0 entry is extracted from the tss and provided to
           Xen.
        TLB operations: the various TLB calls are mapped into corresponding
           Xen hypercalls.
        Control registers: all the control registers are privileged.  The most
           important is cr3, which points to the base of the current pagetable,
           and we handle it specially.
      
      Another instruction we treat specially is CPUID, even though its not
      privileged.  We want to control what CPU features are visible to the
      rest of the kernel, and so CPUID ends up going into a paravirt_op.
      Xen implements this mainly to disable the ACPI and APIC subsystems.
      
      INTERRUPT FLAGS
      
      Xen maintains its own separate flag for masking events, which is
      contained within the per-cpu vcpu_info structure.  Because the guest
      kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
      ignored (and must be, because even if a guest domain disables
      interrupts for itself, it can't disable them overall).
      
      (A note on terminology: "events" and interrupts are effectively
      synonymous.  However, rather than using an "enable flag", Xen uses a
      "mask flag", which blocks event delivery when it is non-zero.)
      
      There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
      are implemented to manage the Xen event mask state.  The only thing
      worth noting is that when events are unmasked, we need to explicitly
      see if there's a pending event and call into the hypervisor to make
      sure it gets delivered.
      
      UPCALLS
      
      Xen needs a couple of upcall (or callback) functions to be implemented
      by each guest.  One is the event upcalls, which is how events
      (interrupts, effectively) are delivered to the guests.  The other is
      the failsafe callback, which is used to report errors in either
      reloading a segment register, or caused by iret.  These are
      implemented in i386/kernel/entry.S so they can jump into the normal
      iret_exc path when necessary.
      
      MULTICALL BATCHING
      
      Xen provides a multicall mechanism, which allows multiple hypercalls
      to be issued at once in order to mitigate the cost of trapping into
      the hypervisor.  This is particularly useful for context switches,
      since the 4-5 hypercalls they would normally need (reload cr3, update
      TLS, maybe update LDT) can be reduced to one.  This patch implements a
      generic batching mechanism for hypercalls, which gets used in many
      places in the Xen code.
      Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      Cc: Ian Pratt <ian.pratt@xensource.com>
      Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
      Cc: Adrian Bunk <bunk@stusta.de>
      5ead97c8