提交 · 6abcd98ffafbff81f0bfd7ee1d129e634af13245 · xiphi1978 / linux

30 1月, 2008 2 次提交

由 Glauber de Oliveira Costa 提交于 1月 30, 2008

This patch consolidates the irqflags include files containing common
paravirt definitions. The native definition for interrupt handling, halt,
and such, are the same for 32 and 64 bit, and they are kept in irqflags.h.
the differences are split in the arch-specific files.

The syscall function, irq_enable_sysexit, has a very specific i386 naming,
and its name is then changed to a more general one.
Signed-off-by: NGlauber de Oliveira Costa <gcosta@redhat.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

6abcd98f

x86: use u32 for some lapic functions · 42e0a9aa

由 Thomas Gleixner 提交于 1月 30, 2008

Use u32 so 32 and 64bit have the same interface.

Andrew Morton: xen, lguest build fixes
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

42e0a9aa

24 1月, 2008 1 次提交

xen: disable vcpu_info placement for now · f9c4cfe9

由 Jeremy Fitzhardinge 提交于 1月 23, 2008

There have been several reports of Xen guest domains locking up when
using vcpu_info structure placement. Disable it for now.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9c4cfe9

11 12月, 2007 1 次提交

xen: relax signature check · 7999f4b4

由 Jeremy Fitzhardinge 提交于 12月 10, 2007

Some versions of Xen 3.x set their magic number to "xen-3.[12]", so
relax the test to match them.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7999f4b4

18 10月, 2007 1 次提交

i386: Clean up duplicate includes in arch/i386/xen/ · fb893e99

由 Jesper Juhl 提交于 10月 17, 2007

This patch cleans up duplicate includes in
	arch/i386/xen/

[ tglx: arch/x86 adaptation ]
Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: NAndi Kleen <ak@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

fb893e99

17 10月, 2007 8 次提交

[x86] remove uses of magic macros for boot_params access · 30c82645

由 H. Peter Anvin 提交于 10月 15, 2007

Instead of using magic macros for boot_params access, simply use the
boot_params structure.
Signed-off-by: NH. Peter Anvin <hpa@zytor.com>

30c82645

xen: fix incorrect vcpu_register_vcpu_info hypercall argument · e3d26976

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

The kernel's copy of struct vcpu_register_vcpu_info was out of date,
at best causing the hypercall to fail and the guest kernel to fall
back to the old mechanism, or worse, causing random memory corruption.

[ Stable folks: applies to 2.6.23 ]
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Stable Kernel <stable@kernel.org>
Cc: Morten =?utf-8?q?B=C3=B8geskov?= <xen-users@morten.bogeskov.dk>
Cc: Mark Williamson <mark.williamson@cl.cam.ac.uk>

e3d26976

xen: ask the hypervisor how much space it needs reserved · fb1d8404

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

Ask the hypervisor how much space it needs reserved, since 32-on-64
doesn't need any space, and it may change in future.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>

fb1d8404

xen: lock pte pages while pinning/unpinning · 74260714

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

When a pagetable is created, it is made globally visible in the rmap
prio tree before it is pinned via arch_dup_mmap(), and remains in the
rmap tree while it is unpinned with arch_exit_mmap().

This means that other CPUs may race with the pinning/unpinning
process, and see a pte between when it gets marked RO and actually
pinned, causing any pte updates to fail with write-protect faults.

As a result, all pte pages must be properly locked, and only unlocked
once the pinning/unpinning process has finished.

In order to avoid taking spinlocks for the whole pagetable - which may
overflow the PREEMPT_BITS portion of preempt counter - it locks and pins
each pte page individually, and then finally pins the whole pagetable.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickens <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@suse.de>
Cc: Keir Fraser <keir@xensource.com>
Cc: Jan Beulich <jbeulich@novell.com>

74260714

xen: deal with stale cr3 values when unpinning pagetables · 9f79991d

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

When a pagetable is no longer in use, it must be unpinned so that its
pages can be freed.  However, this is only possible if there are no
stray uses of the pagetable.  The code currently deals with all the
usual cases, but there's a rare case where a vcpu is changing cr3, but
is doing so lazily, and the change hasn't actually happened by the time
the pagetable is unpinned, even though it appears to have been completed.

This change adds a second per-cpu cr3 variable - xen_current_cr3 -
which tracks the actual state of the vcpu cr3.  It is only updated once
the actual hypercall to set cr3 has been completed.  Other processors
wishing to unpin a pagetable can check other vcpu's xen_current_cr3
values to see if any cross-cpu IPIs are needed to clean things up.

[ Stable folks: 2.6.23 bugfix ]
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Stable Kernel <stable@kernel.org>

9f79991d

Clean up duplicate includes in arch/i386/xen/ · d626a1f1

由 Jesper Juhl 提交于 10月 16, 2007

This patch cleans up duplicate includes in
	arch/i386/xen/
Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>

d626a1f1

paravirt: clean up lazy mode handling · 8965c1c0

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

Currently, the set_lazy_mode pv_op is overloaded with 5 functions:
 1. enter lazy cpu mode
 2. leave lazy cpu mode
 3. enter lazy mmu mode
 4. leave lazy mmu mode
 5. flush pending batched operations

This complicates each paravirt backend, since it needs to deal with
all the possible state transitions, handling flushing, etc. In
particular, flushing is quite distinct from the other 4 functions, and
seems to just cause complication.

This patch removes the set_lazy_mode operation, and adds "enter" and
"leave" lazy mode operations on mmu_ops and cpu_ops.  All the logic
associated with enter and leaving lazy states is now in common code
(basically BUG_ONs to make sure that no mode is current when entering
a lazy mode, and make sure that the mode is current when leaving).
Also, flush is handled in a common way, by simply leaving and
re-entering the lazy mode.

The result is that the Xen, lguest and VMI lazy mode implementations
are much simpler.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Zach Amsden <zach@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Anthony Liguory <aliguori@us.ibm.com>
Cc: "Glauber de Oliveira Costa" <glommer@gmail.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>

8965c1c0

paravirt: refactor struct paravirt_ops into smaller pv_*_ops · 93b1eab3

由 Jeremy Fitzhardinge 提交于 10月 16, 2007

This patch refactors the paravirt_ops structure into groups of
functionally related ops:

pv_info - random info, rather than function entrypoints
pv_init_ops - functions used at boot time (some for module_init too)
pv_misc_ops - lazy mode, which didn't fit well anywhere else
pv_time_ops - time-related functions
pv_cpu_ops - various privileged instruction ops
pv_irq_ops - operations for managing interrupt state
pv_apic_ops - APIC operations
pv_mmu_ops - operations for managing pagetables

There are several motivations for this:

1. Some of these ops will be general to all x86, and some will be
   i386/x86-64 specific.  This makes it easier to share common stuff
   while allowing separate implementations where needed.

2. At the moment we must export all of paravirt_ops, but modules only
   need selected parts of it.  This allows us to export on a case by case
   basis (and also choose which export license we want to apply).

3. Functional groupings make things a bit more readable.

Struct paravirt_ops is now only used as a template to generate
patch-site identifiers, and to extract function pointers for inserting
into jmp/calls when patching.  It is only instantiated when needed.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Cc: Zach Amsden <zach@vmware.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Anthony Liguory <aliguori@us.ibm.com>
Cc: "Glauber de Oliveira Costa" <glommer@gmail.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>

93b1eab3

11 10月, 2007 1 次提交

i386: move xen · 9702785a

由 Thomas Gleixner 提交于 10月 11, 2007

Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9702785a

20 9月, 2007 1 次提交

xen: don't bother trying to set cr4 · 389a3c02

由 Jeremy Fitzhardinge 提交于 9月 18, 2007

Xen ignores all updates to cr4, and some versions will kill the domain if
you try to change its value.  Just ignore all changes.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

389a3c02

12 8月, 2007 1 次提交

i386: Make patching more robust, fix paravirt issue · ab144f5e

由 Andi Kleen 提交于 8月 10, 2007

Commit 19d36ccd "x86: Fix alternatives
and kprobes to remap write-protected kernel text" uses code which is
being patched for patching.

In particular, paravirt_ops does patching in two stages: first it
calls paravirt_ops.patch, then it fills any remaining instructions
with nop_out().  nop_out calls text_poke() which calls
lookup_address() which calls pgd_val() (aka paravirt_ops.pgd_val):
that call site is one of the places we patch.

If we always do patching as one single call to text_poke(), we only
need make sure we're not patching the memcpy in text_poke itself.
This means the prototype to paravirt_ops.patch needs to change, to
marshal the new code into a buffer rather than patching in place as it
does now.  It also means all patching goes through text_poke(), which
is known to be safe (apply_alternatives is also changed to make a
single patch).

AK: fix compilation on x86-64 (bad rusty!)
AK: fix boot on x86-64 (sigh)
AK: merged with other patches
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndi Kleen <ak@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab144f5e

18 7月, 2007 15 次提交

xen: use iret directly when possible · 9ec2b804

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Most of the time we can simply use the iret instruction to exit the
kernel, rather than having to use the iret hypercall - the only
exception is if we're returning into vm86 mode, or from delivering an
NMI (which we don't support yet).

When running native, iret has the behaviour of testing for a pending
interrupt atomically with re-enabling interrupts. Unfortunately
there's no way to do this with Xen, so there's a window in which we
could get a recursive exception after enabling events but before
actually returning to userspace.

This causes a problem: if the nested interrupt causes one of the
task's TIF_WORK_MASK flags to be set, they will not be checked again
before returning to userspace. This means that pending work may be
left pending indefinitely, until the process enters and leaves the
kernel again. The net effect is that a pending signal or reschedule
event could be delayed for an unbounded amount of time.

To deal with this, the xen event upcall handler checks to see if the
EIP is within the critical section of the iret code, after events
are (potentially) enabled up to the iret itself. If its within this
range, it calls the iret critical section fixup, which adjusts the
stack to deal with any unrestored registers, and then shifts the
stack frame up to replace the previous invocation.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>

9ec2b804

xen: Attempt to patch inline versions of common operations · 6487673b

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

This patchs adds the mechanism to allow us to patch inline versions of
common operations.

The implementations of the direct-access versions save_fl, restore_fl,
irq_enable and irq_disable are now in assembler, and the same code is
used for both out of line and inline uses.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Keir Fraser <keir@xensource.com>

6487673b

xen: Place vcpu_info structure into per-cpu memory · 60223a32

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

An experimental patch for Xen allows guests to place their vcpu_info
structs anywhere.  We try to use this to place the vcpu_info into the
PDA, which allows direct access.

If this works, then switch to using direct access operations for
irq_enable, disable, save_fl and restore_fl.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Keir Fraser <keir@xensource.com>

60223a32

xen: machine operations · fefa629a

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Make the appropriate hypercalls to halt and reboot the virtual machine.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: NChris Wright <chrisw@sous-sol.org>

fefa629a

xen: hack to prevent bad segment register reload · 8b84ad94

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

The hypervisor saves and restores the segment registers as part of the
state is saves while context switching.  If, during a context switch,
the next process doesn't use the TLS segments, it invalidates the GDT
entry, causing the segment register reload to fault.  This fault
effectively doubles the cost of a context switch.

This patch is a band-aid workaround which clears the usermode %gs
after it has been saved for the previous process, but before it gets
reloaded for the next, and it avoids having the hypervisor attempt to
erroneously reload it.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>

8b84ad94

xen: lazy-mmu operations · d66bf8fc

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

This patch uses the lazy-mmu hooks to batch mmu operations where
possible.  This is primarily useful for batching operations applied to
active pagetables, which happens during mprotect, munmap, mremap and
the like (mmap does not do bulk pagetable operations, so it isn't
helped).
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: NChris Wright <chrisw@sous-sol.org>

d66bf8fc

xen: Add support for preemption · f120f13e

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Add Xen support for preemption.  This is mostly a cleanup of existing
preempt_enable/disable calls, or just comments to explain the current
usage.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>

f120f13e

xen: SMP guest support · f87e4cac

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

This is a fairly straightforward Xen implementation of smp_ops.

Xen has its own IPI mechanisms, and has no dependency on any
APIC-based IPI.  The smp_ops hooks and the flush_tlb_others pv_op
allow a Xen guest to avoid all APIC code in arch/i386 (the only apic
operation is a single apic_read for the apic version number).

One subtle point which needs to be addressed is unpinning pagetables
when another cpu may have a lazy tlb reference to the pagetable. Xen
will not allow an in-use pagetable to be unpinned, so we must find any
other cpus with a reference to the pagetable and get them to shoot
down their references.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andi Kleen <ak@suse.de>

f87e4cac

xen: Implement sched_clock · ab550288

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Implement xen_sched_clock, which returns the number of ns the current
vcpu has been actually in an unstolen state (ie, running or blocked,
vs runnable-but-not-running, or offline) since boot.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: NChris Wright <chrisw@sous-sol.org>
Cc: john stultz <johnstul@us.ibm.com>

ab550288

xen: ignore RW mapping of RO pages in pagetable_init · 9a4029fd

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

When setting up the initial pagetable, which includes mappings of all
low physical memory, ignore a mapping which tries to set the RW bit on
an RO pte.  An RO pte indicates a page which is part of the current
pagetable, and so it cannot be allowed to become RW.

Once xen_pagetable_setup_done is called, set_pte reverts to its normal
behaviour.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Acked-by: NChris Wright <chrisw@sous-sol.org>
Cc: ebiederm@xmission.com (Eric W. Biederman)

9a4029fd

xen: Complete pagetable pinning · f4f97b3e

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Xen requires all active pagetables to be marked read-only.  When the
base of the pagetable is loaded into %cr3, the hypervisor validates
the entire pagetable and only allows the load to proceed if it all
checks out.

This is pretty slow, so to mitigate this cost Xen has a notion of
pinned pagetables.  Pinned pagetables are pagetables which are
considered to be active even if no processor's cr3 is pointing to is.
This means that it must remain read-only and all updates are validated
by the hypervisor.  This makes context switches much cheaper, because
the hypervisor doesn't need to revalidate the pagetable each time.

This also adds a new paravirt hook which is called during setup once
the zones and memory allocator have been initialized.  When the
init_mm pagetable is first built, the struct page array does not yet
exist, and so there's nowhere to put he init_mm pagetable's PG_pinned
flags.  Once the zones are initialized and the struct page array
exists, we can set the PG_pinned flags for those pages.

This patch also adds the Xen support for pte pages allocated out of
highmem (highpte) by implementing xen_kmap_atomic_pte.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Cc: Zach Amsden <zach@vmware.com>

f4f97b3e

xen: time implementation · 15c84731

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Xen maintains a base clock which measures nanoseconds since system
boot.  This is provided to guests via a shared page which contains a
base time in ns, a tsc timestamp at that point and tsc frequency
parameters.  Guests can compute the current time by reading the tsc
and using it to extrapolate the current time from the basetime.  The
hypervisor makes sure that the frequency parameters are updated
regularly, paricularly if the tsc changes rate or stops.

This is implemented as a clocksource, so the interface to the rest of
the kernel is a simple clocksource which simply returns the current
time directly in nanoseconds.

Xen also provides a simple timer mechanism, which allows a timeout to
be set in the future.  When that time arrives, a timer event is sent
to the guest.  There are two timer interfaces:
 - An old one which also delivers a stream of (unused) ticks at 100Hz,
   and on the same event, the actual timer events.  The 100Hz ticks
   cause a lot of spurious wakeups, but are basically harmless.
 - The new timer interface doesn't have the 100Hz ticks, and can also
   fail if the specified time is in the past.

This code presents the Xen timer as a clockevent driver, and uses the
new interface by preference.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>

15c84731

xen: event channels · e46cdb66

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Xen implements interrupts in terms of event channels.  Each guest
domain gets 1024 event channels which can be used for a variety of
purposes, such as Xen timer events, inter-domain events,
inter-processor events (IPI) or for real hardware IRQs.

Within the kernel, we map the event channels to IRQs, and implement
the whole interrupt handling using a Xen irq_chip.

Rather than setting NR_IRQ to 1024 under PARAVIRT in order to
accomodate Xen, we create a dynamic mapping between event channels and
IRQs.  Ideally, Linux will eventually move towards dynamically
allocating per-irq structures, and we can use a 1:1 mapping between
event channels and irqs.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric W. Biederman <ebiederm@xmission.com>

e46cdb66

xen: virtual mmu · 3b827c1b

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

Xen pagetable handling, including the machinery to implement direct
pagetables.

Xen presents the real CPU's pagetables directly to guests, with no
added shadowing or other layer of abstraction.  Naturally this means
the hypervisor must maintain close control over what the guest can put
into the pagetable.

When the guest modifies the pte/pmd/pgd, it must convert its
domain-specific notion of a "physical" pfn into a global machine frame
number (mfn) before inserting the entry into the pagetable.  Xen will
check to make sure the domain is allowed to create a mapping of the
given mfn.

Xen also requires that all mappings the guest has of its own active
pagetable are read-only.  This is relatively easy to implement in
Linux because all pagetables share the same pte pages for kernel
mappings, so updating the pte in one pagetable will implicitly update
the mapping in all pagetables.

Normally a pagetable becomes active when you point to it with cr3 (or
the Xen equivalent), but when you do so, Xen must check the whole
pagetable for correctness, which is clearly a performance problem.

Xen solves this with pinning which keeps a pagetable effectively
active even if its currently unused, which means that all the normal
update rules are enforced.  This means that it need not revalidate the
pagetable when loading cr3.

This patch has a first-cut implementation of pinning, but it is more
fully implemented in a later patch.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>

3b827c1b

xen: Core Xen implementation · 5ead97c8

由 Jeremy Fitzhardinge 提交于 7月 17, 2007

This patch is a rollup of all the core pieces of the Xen
implementation, including:
 - booting and setup
 - pagetable setup
 - privileged instructions
 - segmentation
 - interrupt flags
 - upcalls
 - multicall batching

BOOTING AND SETUP

The vmlinux image is decorated with ELF notes which tell the Xen
domain builder what the kernel's requirements are; the domain builder
then constructs the address space accordingly and starts the kernel.

Xen has its own entrypoint for the kernel (contained in an ELF note).
The ELF notes are set up by xen-head.S, which is included into head.S.
In principle it could be linked separately, but it seems to provoke
lots of binutils bugs.

Because the domain builder starts the kernel in a fairly sane state
(32-bit protected mode, paging enabled, flat segments set up), there's
not a lot of setup needed before starting the kernel proper.  The main
steps are:
  1. Install the Xen paravirt_ops, which is simply a matter of a
     structure assignment.
  2. Set init_mm to use the Xen-supplied pagetables (analogous to the
     head.S generated pagetables in a native boot).
  3. Reserve address space for Xen, since it takes a chunk at the top
     of the address space for its own use.
  4. Call start_kernel()

PAGETABLE SETUP

Once we hit the main kernel boot sequence, it will end up calling back
via paravirt_ops to set up various pieces of Xen specific state.  One
of the critical things which requires a bit of extra care is the
construction of the initial init_mm pagetable.  Because Xen places
tight constraints on pagetables (an active pagetable must always be
valid, and must always be mapped read-only to the guest domain), we
need to be careful when constructing the new pagetable to keep these
constraints in mind.  It turns out that the easiest way to do this is
use the initial Xen-provided pagetable as a template, and then just
insert new mappings for memory where a mapping doesn't already exist.

This means that during pagetable setup, it uses a special version of
xen_set_pte which ignores any attempt to remap a read-only page as
read-write (since Xen will map its own initial pagetable as RO), but
lets other changes to the ptes happen, so that things like NX are set
properly.

PRIVILEGED INSTRUCTIONS AND SEGMENTATION

When the kernel runs under Xen, it runs in ring 1 rather than ring 0.
This means that it is more privileged than user-mode in ring 3, but it
still can't run privileged instructions directly.  Non-performance
critical instructions are dealt with by taking a privilege exception
and trapping into the hypervisor and emulating the instruction, but
more performance-critical instructions have their own specific
paravirt_ops.  In many cases we can avoid having to do any hypercalls
for these instructions, or the Xen implementation is quite different
from the normal native version.

The privileged instructions fall into the broad classes of:
  Segmentation: setting up the GDT and the GDT entries, LDT,
     TLS and so on.  Xen doesn't allow the GDT to be directly
     modified; all GDT updates are done via hypercalls where the new
     entries can be validated.  This is important because Xen uses
     segment limits to prevent the guest kernel from damaging the
     hypervisor itself.
  Traps and exceptions: Xen uses a special format for trap entrypoints,
     so when the kernel wants to set an IDT entry, it needs to be
     converted to the form Xen expects.  Xen sets int 0x80 up specially
     so that the trap goes straight from userspace into the guest kernel
     without going via the hypervisor.  sysenter isn't supported.
  Kernel stack: The esp0 entry is extracted from the tss and provided to
     Xen.
  TLB operations: the various TLB calls are mapped into corresponding
     Xen hypercalls.
  Control registers: all the control registers are privileged.  The most
     important is cr3, which points to the base of the current pagetable,
     and we handle it specially.

Another instruction we treat specially is CPUID, even though its not
privileged.  We want to control what CPU features are visible to the
rest of the kernel, and so CPUID ends up going into a paravirt_op.
Xen implements this mainly to disable the ACPI and APIC subsystems.

INTERRUPT FLAGS

Xen maintains its own separate flag for masking events, which is
contained within the per-cpu vcpu_info structure.  Because the guest
kernel runs in ring 1 and not 0, the IF flag in EFLAGS is completely
ignored (and must be, because even if a guest domain disables
interrupts for itself, it can't disable them overall).

(A note on terminology: "events" and interrupts are effectively
synonymous.  However, rather than using an "enable flag", Xen uses a
"mask flag", which blocks event delivery when it is non-zero.)

There are paravirt_ops for each of cli/sti/save_fl/restore_fl, which
are implemented to manage the Xen event mask state.  The only thing
worth noting is that when events are unmasked, we need to explicitly
see if there's a pending event and call into the hypervisor to make
sure it gets delivered.

UPCALLS

Xen needs a couple of upcall (or callback) functions to be implemented
by each guest.  One is the event upcalls, which is how events
(interrupts, effectively) are delivered to the guests.  The other is
the failsafe callback, which is used to report errors in either
reloading a segment register, or caused by iret.  These are
implemented in i386/kernel/entry.S so they can jump into the normal
iret_exc path when necessary.

MULTICALL BATCHING

Xen provides a multicall mechanism, which allows multiple hypercalls
to be issued at once in order to mitigate the cost of trapping into
the hypervisor.  This is particularly useful for context switches,
since the 4-5 hypercalls they would normally need (reload cr3, update
TLS, maybe update LDT) can be reduced to one.  This patch implements a
generic batching mechanism for hypercalls, which gets used in many
places in the Xen code.
Signed-off-by: NJeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: NChris Wright <chrisw@sous-sol.org>
Cc: Ian Pratt <ian.pratt@xensource.com>
Cc: Christian Limpach <Christian.Limpach@cl.cam.ac.uk>
Cc: Adrian Bunk <bunk@stusta.de>

5ead97c8