提交 · bd3d1ec3d26b61120bb4f60b18ee99aa81839e6b · openanolis / cloud-kernel

18 3月, 2011 19 次提交

KVM: SVM: check for progress after IRET interception · bd3d1ec3

由 Avi Kivity 提交于 2月 03, 2011

When we enable an NMI window, we ask for an IRET intercept, since
the IRET re-enables NMIs. However, the IRET intercept happens before
the instruction executes, while the NMI window architecturally opens
afterwards.

To compensate for this mismatch, we only open the NMI window in the
following exit, assuming that the IRET has by then executed; however,
this assumption is not always correct; we may exit due to a host interrupt
or page fault, without having executed the instruction.

Fix by checking for forward progress by recording and comparing the IRET's
rip. This is somewhat of a hack, since an unchaging rip does not mean that
no forward progress has been made, but is the simplest fix for now.
Signed-off-by: NAvi Kivity <avi@redhat.com>

bd3d1ec3

KVM: Fix race between nmi injection and enabling nmi window · f8636849

由 Avi Kivity 提交于 2月 03, 2011

The interrupt injection logic looks something like

  if an nmi is pending, and nmi injection allowed
    inject nmi
  if an nmi is pending
    request exit on nmi window

the problem is that "nmi is pending" can be set asynchronously by
the PIT; if it happens to fire between the two if statements, we
will request an nmi window even though nmi injection is allowed.  On
SVM, this has disasterous results, since it causes eflags.TF to be
set in random guest code.

The fix is simple; make nmi_pending synchronous using the standard
vcpu->requests mechanism; this ensures the code above is completely
synchronous wrt nmi_pending.
Signed-off-by: NAvi Kivity <avi@redhat.com>

f8636849

KVM: Drop ad-hoc vendor specific instruction restriction · 4005996e

由 Avi Kivity 提交于 2月 01, 2011

Use the new support in the emulator, and drop the ad-hoc code in x86.c.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

4005996e

KVM: x86 emulator: vendor specific instructions · d867162c

由 Avi Kivity 提交于 2月 01, 2011

Mark some instructions as vendor specific, and allow the caller to request
emulation only of vendor specific instructions.  This is useful in some
circumstances (responding to a #UD fault).
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

d867162c

KVM: Drop bogus x86_decode_insn() error check · 3e909439

由 Avi Kivity 提交于 2月 01, 2011

x86_decode_insn() doesn't return X86EMUL_* values, so the check
for X86EMUL_PROPOGATE_FAULT will always fail.  There is a proper
check later on, so there is no need for a replacement for this
code.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

3e909439

KVM: x86: Drop obsolete warning about INIT on runnable VCPU · 0bb88659

由 Jan Kiszka 提交于 2月 01, 2011

This warning was once used for debugging QEMU user space. Though
uncommon, it is actually possible to send an INIT request to a running
VCPU. So better drop this warning before someone misuses it to flood
kernel logs this way.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

0bb88659

KVM: x86: release kvmclock page on reset · 12f9a48f

由 Glauber Costa 提交于 2月 01, 2011

When a vcpu is reset, kvmclock page keeps being written to this days.
This is wrong and inconsistent: a cpu reset should take it to its
initial state.
Signed-off-by: NGlauber Costa <glommer@redhat.com>
CC: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

12f9a48f

KVM: x86: handle guest access to BBL_CR_CTL3 MSR · 91c9c3ed

由 john cooper 提交于 1月 21, 2011

A correction to Intel cpu model CPUID data (patch queued)
caused winxp to BSOD when booted with a Penryn model.
This was traced to the CPUID "model" field correction from
6 -> 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
called out as "reserved" in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.
Signed-off-by: Njohn cooper <john.cooper@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

91c9c3ed

KVM: Add "exiting guest mode" state · 6b7e2d09

由 Xiao Guangrong 提交于 1月 12, 2011

Currently we keep track of only two states: guest mode and host
mode.  This patch adds an "exiting guest mode" state that tells
us that an IPI will happen soon, so unless we need to wait for the
IPI, we can avoid it completely.

Also
1: No need atomically to read/write ->mode in vcpu's thread

2: reorganize struct kvm_vcpu to make ->mode and ->requests
   in the same cache line explicitly
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

6b7e2d09

KVM: x86: Remove user space triggerable MCE error message · 9ca52318

由 Jan Kiszka 提交于 1月 15, 2011

This case is a pure user space error we do not need to record. Moreover,
it can be misused to flood the kernel log. Remove it.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

9ca52318

KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs() · 63f42e02

由 Xiao Guangrong 提交于 1月 12, 2011

Fix:

[ 1001.499596] ===================================================
[ 1001.499599] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 1001.499601] ---------------------------------------------------
[ 1001.499604] include/linux/kvm_host.h:301 invoked rcu_dereference_check() without protection!
	......
[ 1001.499636] Pid: 6035, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ #62
[ 1001.499638] Call Trace:
[ 1001.499644]  [] lockdep_rcu_dereference+0x9d/0xa5
[ 1001.499653]  [] gfn_to_memslot+0x8d/0xc8 [kvm]
[ 1001.499661]  [] gfn_to_hva+0x16/0x3f [kvm]
[ 1001.499669]  [] kvm_read_guest_page+0x1e/0x5e [kvm]
[ 1001.499681]  [] kvm_read_guest_page_mmu+0x53/0x5e [kvm]
[ 1001.499699]  [] load_pdptrs+0x3f/0x9c [kvm]
[ 1001.499705]  [] ? vmx_set_cr0+0x507/0x517 [kvm_intel]
[ 1001.499717]  [] kvm_arch_vcpu_ioctl_set_sregs+0x1f3/0x3c0 [kvm]
[ 1001.499727]  [] kvm_vcpu_ioctl+0x6a5/0xbc5 [kvm]
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

63f42e02

KVM: VMX: Avoid atomic operation in vmx_vcpu_run · 40712fae

由 Avi Kivity 提交于 1月 06, 2011

Instead of exchanging the guest and host rcx, have separate storage
for each.  This allows us to avoid using the xchg instruction, which
is is a little slower than normal operations.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

40712fae

KVM: VMX: Simplify saving guest rcx in vmx_vcpu_run · 1c696d0e

由 Avi Kivity 提交于 1月 06, 2011

Change

  push top-of-stack
  pop guest-rcx
  pop dummy

to

  pop guest-rcx

which is the same thing, only simpler.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

1c696d0e

KVM: VMX: increase ple_gap default to 128 · 00c25bce

由 Rik van Riel 提交于 1月 04, 2011

On some CPUs, a ple_gap of 41 is simply insufficient to ever trigger
PLE exits, even with the minimalistic PLE test from kvm-unit-tests.

http://git.kernel.org/?p=virt/kvm/kvm-unit-tests.git;a=commitdiff;h=eda71b28fa122203e316483b35f37aaacd42f545

For example, the Xeon X5670 CPU needs a ple_gap of at least 48 in
order to get pause loop exits:

# modprobe kvm_intel ple_gap=47
# taskset 1 /usr/local/bin/qemu-system-x86_64 \
  -device testdev,chardev=log -chardev stdio,id=log \
  -kernel x86/vmexit.flat -append ple-round-robin -smp 2
VNC server running on `::1:5900'
enabling apic
enabling apic
ple-round-robin 58298446
# rmmod kvm_intel
# modprobe kvm_intel ple_gap=48
# taskset 1 /usr/local/bin/qemu-system-x86_64 \
   -device testdev,chardev=log -chardev stdio,id=log \
   -kernel x86/vmexit.flat -append ple-round-robin -smp 2
VNC server running on `::1:5900'
enabling apic
enabling apic
ple-round-robin 36616

Increase the ple_gap to 128 to be on the safe side.
Signed-off-by: NRik van Riel <riel@redhat.com>
Acked-by: NZhai, Edwin <edwin.zhai@intel.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

00c25bce

KVM: SVM: Add support for perf-kvm · 3781c01c

由 Joerg Roedel 提交于 1月 14, 2011

This patch adds the necessary code to run perf-kvm on AMD
machines.
Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3781c01c

KVM: VMX: Avoid leaking fake realmode state to userspace · a9179499

由 Avi Kivity 提交于 1月 03, 2011

When emulating real mode, we fake some state:

 - tr.base points to a fake vm86 tss
 - segment registers are made to conform to vm86 restrictions

change vmx_get_segment() not to expose this fake state to userspace;
instead, return the original state.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

a9179499

KVM: VMX: Save and restore tr selector across mode switches · d0ba64f9

由 Avi Kivity 提交于 1月 03, 2011

When emulating real mode we play with tr hidden state, but leave
tr.selector alone.  That works well, except for save/restore, since
loading TR writes it to the hidden state in vmx->rmode.

Fix by also saving and restoring the tr selector; this makes things
more consistent and allows migration to work during the early
boot stages of Windows XP.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

d0ba64f9

KVM guest: Fix section mismatch derived from kvm_guest_cpu_online() · 775077a0

由 Sedat Dilek 提交于 1月 03, 2011

WARNING: arch/x86/built-in.o(.text+0x1bb74): Section mismatch in reference from the function kvm_guest_cpu_online() to the function .cpuinit.text:kvm_guest_cpu_init()
The function kvm_guest_cpu_online() references
the function __cpuinit kvm_guest_cpu_init().
This is often because kvm_guest_cpu_online lacks a __cpuinit
annotation or the annotation of kvm_guest_cpu_init is wrong.

This patch fixes the warning.

Tested with linux-next (next-20101231)
Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

775077a0

KVM: MMU: Don't flush shadow when enabling dirty tracking · 8234b22e

由 Avi Kivity 提交于 12月 27, 2010

Instead, drop large mappings, which were the reason we dropped shadow.
Signed-off-by: NAvi Kivity <avi@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

8234b22e

16 3月, 2011 2 次提交

x86, AMD: Set ARAT feature on AMD processors · b87cf80a

由 Boris Ostrovsky 提交于 3月 15, 2011

Support for Always Running APIC timer (ARAT) was introduced in
commit db954b58. This feature
allows us to avoid switching timers from LAPIC to something else
(e.g. HPET) and go into timer broadcasts when entering deep
C-states.

AMD processors don't provide a CPUID bit for that feature but
they also keep APIC timers running in deep C-states (except for
cases when the processor is affected by erratum 400). Therefore
we should set ARAT feature bit on AMD CPUs.
Tested-by: NBorislav Petkov <borislav.petkov@amd.com>
Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NMark Langsdorf <mark.langsdorf@amd.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@amd.com>
LKML-Reference: <1300205624-4813-1-git-send-email-ostr@amd64.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b87cf80a

x86, quirk: Fix SB600 revision check · 1d3e09a3

由 Andreas Herrmann 提交于 3月 15, 2011

Commit 7f74f8f2
(x86 quirk: Fix polarity for IRQ0 pin2 override on SB800
systems) introduced a regression. It removed some SB600 specific
code to determine the revision ID without adapting a
corresponding revision ID check for SB600.

See this mail thread:

  http://marc.info/?l=linux-kernel&m=129980296006380&w=2

This patch adapts the corresponding check to cover all SB600
revisions.
Tested-by: NWang Lei <f3d27b@gmail.com>
Signed-off-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org # 38.x, 37.x, 32.x
LKML-Reference: <20110315143137.GD29499@alberich.amd.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1d3e09a3

15 3月, 2011 6 次提交

x86: stop_machine_text_poke() should issue sync_core() · 0e00f7ae

由 Mathieu Desnoyers 提交于 3月 03, 2011

Intel Archiecture Software Developer's Manual section 7.1.3 specifies that a
core serializing instruction such as "cpuid" should be executed on _each_ core
before the new instruction is made visible.

Failure to do so can lead to unspecified behavior (Intel XMC erratas include
General Protection Fault in the list), so we should avoid this at all cost.

This problem can affect modified code executed by interrupt handlers after
interrupt are re-enabled at the end of stop_machine, because no core serializing
instruction is executed between the code modification and the moment interrupts
are reenabled.

Because stop_machine_text_poke performs the text modification from the first CPU
decrementing stop_machine_first, modified code executed in thread context is
also affected by this problem. To explain why, we have to split the CPUs in two
categories: the CPU that initiates the text modification (calls text_poke_smp)
and all the others. The scheduler, executed on all other CPUs after
stop_machine, issues an "iret" core serializing instruction, and therefore
handles core serialization for all these CPUs. However, the text modification
initiator can continue its execution on the same thread and access the modified
text without any scheduler call. Given that the CPU that initiates the code
modification is not guaranteed to be the one actually performing the code
modification, it falls into the XMC errata.

Q: Isn't this executed from an IPI handler, which will return with IRET (a
   serializing instruction) anyway?
A: No, now stop_machine uses per-cpu workqueue, so that handler will be
   executed from worker threads. There is no iret anymore.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20110303160137.GB1590@Krystal>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: <stable@kernel.org>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>

0e00f7ae

x86, tlb, UV: Do small micro-optimization for native_flush_tlb_others() · 25542c64

由 Xiao Guangrong 提交于 3月 15, 2011

native_flush_tlb_others() is called from:

 flush_tlb_current_task()
 flush_tlb_mm()
 flush_tlb_page()

All these functions disable preemption explicitly, so we can use
smp_processor_id() instead of get_cpu() and put_cpu().
Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Cliff Wickman <cpw@sgi.com>
LKML-Reference: <4D7EC791.4040003@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

25542c64

x86: Add new syscalls for x86_64 · 6aae5f2b

由 Aneesh Kumar K.V 提交于 1月 29, 2011

This patch add new syscalls to x86_64
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6aae5f2b

x86: Add new syscalls for x86_32 · 7dadb755

由 Aneesh Kumar K.V 提交于 1月 29, 2011

This patch adds new syscalls to x86_32
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7dadb755

PM: Drop pm_flags that is not necessary · 6831c6ed

由 Rafael J. Wysocki 提交于 2月 15, 2011

The variable pm_flags is used to prevent APM from being enabled
along with ACPI, which would lead to problems.  However, acpi_init()
is always called before apm_init() and after acpi_init() has
returned, it is known whether or not ACPI will be used.  Namely, if
acpi_disabled is not set after acpi_init() has returned, this means
that ACPI is enabled.  Thus, it is sufficient to check acpi_disabled
in apm_init() to prevent APM from being enabled in parallel with
ACPI.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NLen Brown <len.brown@intel.com>

6831c6ed

PM: Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME) · 1eb208ae

由 Rafael J. Wysocki 提交于 2月 11, 2011

From the users' point of view CONFIG_PM is really only used for
making it possible to set CONFIG_SUSPEND, CONFIG_HIBERNATION,
CONFIG_PM_RUNTIME and (surprisingly enough) CONFIG_XEN_SAVE_RESTORE
(CONFIG_PM_OPP also depends on CONFIG_PM, but quite artificially).
However, both CONFIG_SUSPEND and CONFIG_HIBERNATION require platform
support (independent of CONFIG_PM) and it is not quite obvious that
CONFIG_PM has to be set for CONFIG_XEN_SAVE_RESTORE to be available.
Thus, from the users' point of view, it would be more logical to
automatically select CONFIG_PM if any of the above options depending
on it are set.

Make CONFIG_PM depend on (CONFIG_PM_SLEEP || CONFIG_PM_RUNTIME),
which will cause it to be selected when any of CONFIG_SUSPEND,
CONFIG_HIBERNATION, CONFIG_PM_RUNTIME, CONFIG_XEN_SAVE_RESTORE is
set and will clarify its meaning.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

1eb208ae

14 3月, 2011 9 次提交

xen/m2p: Check whether the MFN has IDENTITY_FRAME bit set.. · 706cc9d2

由 Stefano Stabellini 提交于 2月 02, 2011

If there is no proper PFN value in the M2P for the MFN
(so we get 0xFFFFF.. or 0x55555, or 0x0), we should
consult the M2P override to see if there is an entry for this.
[Note: we also consult the M2P override if the MFN
is past our machine_to_phys size].

We consult the P2M with the PFN. In case the returned
MFN is one of the special values: 0xFFF.., 0x5555
(which signify that the MFN can be either "missing" or it
belongs to DOMID_IO) or the p2m(m2p(mfn)) != mfn, we check
the M2P override. If we fail the M2P override check, we reset
the PFN value to INVALID_P2M_ENTRY.

Next we try to find the MFN in the P2M using the MFN
value (not the PFN value) and if found, we know
that this MFN is an identity value and return it as so.

Otherwise we have exhausted all the posibilities and we
return the PFN, which at this stage can either be a real
PFN value found in the machine_to_phys.. array, or
INVALID_P2M_ENTRY value.

[v1: Added Review-by tag]
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

706cc9d2

xen/m2p: No need to catch exceptions when we know that there is no RAM · 146c4e51

由 Konrad Rzeszutek Wilk 提交于 1月 14, 2011

.. beyound what we think is the end of memory. However there might
be more System RAM - but assigned to a guest. Hence jump to the
M2P override check and consult.

[v1: Added Review-by tag]
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

146c4e51

xen/debug: WARN_ON when identity PFN has no _PAGE_IOMAP flag set. · fc25151d

由 Konrad Rzeszutek Wilk 提交于 12月 23, 2010

Only enabled if XEN_DEBUG is enabled. We print a warning
when:

 pfn_to_mfn(pfn) == pfn, but no VM_IO (_PAGE_IOMAP) flag set
	(and pfn is an identity mapped pfn)
 pfn_to_mfn(pfn) != pfn, and VM_IO flag is set.
	(ditto, pfn is an identity mapped pfn)

[v2: Make it dependent on CONFIG_XEN_DEBUG instead of ..DEBUG_FS]
[v3: Fix compiler warning]
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

fc25151d

xen/debugfs: Add 'p2m' file for printing out the P2M layout. · 2222e71b

由 Konrad Rzeszutek Wilk 提交于 12月 22, 2010

We walk over the whole P2M tree and construct a simplified view of
which PFN regions belong to what level and what type they are.

Only enabled if CONFIG_XEN_DEBUG_FS is set.

[v2: UNKN->UNKNOWN, use uninitialized_var]
[v3: Rebased on top of mmu->p2m code split]
[v4: Fixed the else if]
Reviewed-by: NIan Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

2222e71b

xen/setup: Set identity mapping for non-RAM E820 and E820 gaps. · 68df0da7

由 Konrad Rzeszutek Wilk 提交于 2月 01, 2011

We walk the E820 region and start at 0 (for PV guests we start
at ISA_END_ADDRESS) and skip any E820 RAM regions. For all other
regions and as well the gaps we set them to be identity mappings.

The reasons we do not want to set the identity mapping from 0->
ISA_END_ADDRESS when running as PV is b/c that the kernel would
try to read DMI information and fail (no permissions to read that).
There is a lot of gnarly code to deal with that weird region so
we won't try to do a cleanup in this patch.

This code ends up calling 'set_phys_to_identity' with the start
and end PFN of the the E820 that are non-RAM or have gaps.
On 99% of machines that means one big region right underneath the
4GB mark. Usually starts at 0xc0000 (or 0x80000) and goes to
0x100000.

[v2: Fix for E820 crossing 1MB region and clamp the start]
[v3: Squshed in code that does this over ranges]
[v4: Moved the comment to the correct spot]
[v5: Use the "raw" E820 from the hypervisor]
[v6: Added Review-by tag]
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

68df0da7

xen/mmu: WARN_ON when racing to swap middle leaf. · c7617798

由 Konrad Rzeszutek Wilk 提交于 1月 18, 2011

The initial bootup code uses set_phys_to_machine quite a lot, and after
bootup it would be used by the balloon driver. The balloon driver does have
mutex lock so this should not be necessary - but just in case, add
a WARN_ON if we do hit this scenario. If we do fail this, it is OK
to continue as there is a backup mechanism (VM_IO) that can bypass
the P2M and still set the _PAGE_IOMAP flags.

[v2: Change from WARN to BUG_ON]
[v3: Rebased on top of xen->p2m code split]
[v4: Change from BUG_ON to WARN]
Reviewed-by: NIan Campbell <Ian.Campbell@eu.citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

c7617798

xen/mmu: Set _PAGE_IOMAP if PFN is an identity PFN. · fb38923e

由 Konrad Rzeszutek Wilk 提交于 1月 05, 2011

If we find that the PFN is within the P2M as an identity
PFN make sure to tack on the _PAGE_IOMAP flag.
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

fb38923e

xen/mmu: Add the notion of identity (1-1) mapping. · f4cec35b

由 Konrad Rzeszutek Wilk 提交于 1月 18, 2011

Our P2M tree structure is a three-level. On the leaf nodes
we set the Machine Frame Number (MFN) of the PFN. What this means
is that when one does: pfn_to_mfn(pfn), which is used when creating
PTE entries, you get the real MFN of the hardware. When Xen sets
up a guest it initially populates a array which has descending
(or ascending) MFN values, as so:

 idx: 0,  1,       2
 [0x290F, 0x290E, 0x290D, ..]

so pfn_to_mfn(2)==0x290D. If you start, restart many guests that list
starts looking quite random.

We graft this structure on our P2M tree structure and stick in
those MFN in the leafs. But for all other leaf entries, or for the top
root, or middle one, for which there is a void entry, we assume it is
"missing". So
 pfn_to_mfn(0xc0000)=INVALID_P2M_ENTRY.

We add the possibility of setting 1-1 mappings on certain regions, so
that:
 pfn_to_mfn(0xc0000)=0xc0000

The benefit of this is, that we can assume for non-RAM regions (think
PCI BARs, or ACPI spaces), we can create mappings easily b/c we
get the PFN value to match the MFN.

For this to work efficiently we introduce one new page p2m_identity and
allocate (via reserved_brk) any other pages we need to cover the sides
(1GB or 4MB boundary violations). All entries in p2m_identity are set to
INVALID_P2M_ENTRY type (Xen toolstack only recognizes that and MFNs,
no other fancy value).

On lookup we spot that the entry points to p2m_identity and return the identity
value instead of dereferencing and returning INVALID_P2M_ENTRY. If the entry
points to an allocated page, we just proceed as before and return the PFN.
If the PFN has IDENTITY_FRAME_BIT set we unmask that in appropriate functions
(pfn_to_mfn).

The reason for having the IDENTITY_FRAME_BIT instead of just returning the
PFN is that we could find ourselves where pfn_to_mfn(pfn)==pfn for a
non-identity pfn. To protect ourselves against we elect to set (and get) the
IDENTITY_FRAME_BIT on all identity mapped PFNs.

This simplistic diagram is used to explain the more subtle piece of code.
There is also a digram of the P2M at the end that can help.
Imagine your E820 looking as so:

                   1GB                                           2GB
/-------------------+---------\/----\         /----------\    /---+-----\
| System RAM        | Sys RAM ||ACPI|         | reserved |    | Sys RAM |
\-------------------+---------/\----/         \----------/    \---+-----/
                              ^- 1029MB                       ^- 2001MB

[1029MB = 263424 (0x40500), 2001MB = 512256 (0x7D100), 2048MB = 524288 (0x80000)]

And dom0_mem=max:3GB,1GB is passed in to the guest, meaning memory past 1GB
is actually not present (would have to kick the balloon driver to put it in).

When we are told to set the PFNs for identity mapping (see patch: "xen/setup:
Set identity mapping for non-RAM E820 and E820 gaps.") we pass in the start
of the PFN and the end PFN (263424 and 512256 respectively). The first step is
to reserve_brk a top leaf page if the p2m[1] is missing. The top leaf page
covers 512^2 of page estate (1GB) and in case the start or end PFN is not
aligned on 512^2*PAGE_SIZE (1GB) we loop on aligned 1GB PFNs from start pfn to
end pfn.  We reserve_brk top leaf pages if they are missing (means they point
to p2m_mid_missing).

With the E820 example above, 263424 is not 1GB aligned so we allocate a
reserve_brk page which will cover the PFNs estate from 0x40000 to 0x80000.
Each entry in the allocate page is "missing" (points to p2m_missing).

Next stage is to determine if we need to do a more granular boundary check
on the 4MB (or 2MB depending on architecture) off the start and end pfn's.
We check if the start pfn and end pfn violate that boundary check, and if
so reserve_brk a middle (p2m[x][y]) leaf page. This way we have a much finer
granularity of setting which PFNs are missing and which ones are identity.
In our example 263424 and 512256 both fail the check so we reserve_brk two
pages. Populate them with INVALID_P2M_ENTRY (so they both have "missing" values)
and assign them to p2m[1][2] and p2m[1][488] respectively.

At this point we would at minimum reserve_brk one page, but could be up to
three. Each call to set_phys_range_identity has at maximum a three page
cost. If we were to query the P2M at this stage, all those entries from
start PFN through end PFN (so 1029MB -> 2001MB) would return INVALID_P2M_ENTRY
("missing").

The next step is to walk from the start pfn to the end pfn setting
the IDENTITY_FRAME_BIT on each PFN. This is done in 'set_phys_range_identity'.
If we find that the middle leaf is pointing to p2m_missing we can swap it over
to p2m_identity - this way covering 4MB (or 2MB) PFN space.  At this point we
do not need to worry about boundary aligment (so no need to reserve_brk a middle
page, figure out which PFNs are "missing" and which ones are identity), as that
has been done earlier.  If we find that the middle leaf is not occupied by
p2m_identity or p2m_missing, we dereference that page (which covers
512 PFNs) and set the appropriate PFN with IDENTITY_FRAME_BIT. In our example
263424 and 512256 end up there, and we set from p2m[1][2][256->511] and
p2m[1][488][0->256] with IDENTITY_FRAME_BIT set.

All other regions that are void (or not filled) either point to p2m_missing
(considered missing) or have the default value of INVALID_P2M_ENTRY (also
considered missing). In our case, p2m[1][2][0->255] and p2m[1][488][257->511]
contain the INVALID_P2M_ENTRY value and are considered "missing."

This is what the p2m ends up looking (for the E820 above) with this
fabulous drawing:

   p2m         /--------------\
 /-----\       | &mfn_list[0],|                           /-----------------\
 |  0  |------>| &mfn_list[1],|    /---------------\      | ~0, ~0, ..      |
 |-----|       |  ..., ~0, ~0 |    | ~0, ~0, [x]---+----->| IDENTITY [@256] |
 |  1  |---\   \--------------/    | [p2m_identity]+\     | IDENTITY [@257] |
 |-----|    \                      | [p2m_identity]+\\    | ....            |
 |  2  |--\  \-------------------->|  ...          | \\   \----------------/
 |-----|   \                       \---------------/  \\
 |  3  |\   \                                          \\  p2m_identity
 |-----| \   \-------------------->/---------------\   /-----------------\
 | ..  +->+                        | [p2m_identity]+-->| ~0, ~0, ~0, ... |
 \-----/ /                         | [p2m_identity]+-->| ..., ~0         |
        / /---------------\        | ....          |   \-----------------/
       /  | IDENTITY[@0]  |      /-+-[x], ~0, ~0.. |
      /   | IDENTITY[@256]|<----/  \---------------/
     /    | ~0, ~0, ....  |
    |     \---------------/
    |
    p2m_missing             p2m_missing
/------------------\     /------------\
| [p2m_mid_missing]+---->| ~0, ~0, ~0 |
| [p2m_mid_missing]+---->| ..., ~0    |
\------------------/     \------------/

where ~0 is INVALID_P2M_ENTRY. IDENTITY is (PFN | IDENTITY_BIT)
Reviewed-by: NIan Campbell <ian.campbell@citrix.com>
[v5: Changed code to use ranges, added ASCII art]
[v6: Rebased on top of xen->p2m code split]
[v4: Squished patches in just this one]
[v7: Added RESERVE_BRK for potentially allocated pages]
[v8: Fixed alignment problem]
[v9: Changed 1<<3X to 1<<BITS_PER_LONG-X]
[v10: Copied git commit description in the p2m code + Add Review tag]
[v11: Title had '2-1' - should be '1-1' mapping]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>

f4cec35b

x86: ce4100: Set pci ops via callback instead of module init · 03150171

由 Sebastian Andrzej Siewior 提交于 3月 14, 2011

Setting the pci ops on subsys initcall unconditionally will break
multi platform kernels on anything except ce4100.

Use x86_init.pci.init ops to call this only on real ce4100 platforms.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: sodaville@linutronix.de
LKML-Reference: <20110314093340.GA21026@www.tglx.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

03150171

12 3月, 2011 4 次提交

x86: Enable forced interrupt threading support · c0185808

由 Thomas Gleixner 提交于 2月 07, 2011

All non threadeable interrupts are marked. Enable forced irq threading
support.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

c0185808

T
x86: Mark low level interrupts IRQF_NO_THREAD · 9bbbff25
由 Thomas Gleixner 提交于 1月 27, 2011
```
These cannot be threaded.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
9bbbff25
T
x86: Use generic show_interrupts · 517e4981
由 Thomas Gleixner 提交于 12月 16, 2010
```
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
517e4981

x86: ioapic: Avoid redundant lookup of irq_cfg · 1a0e62a4

由 Thomas Gleixner 提交于 3月 12, 2011

The caller of ioapic_register_intr() has a pointer to the irq_cfg for
the irq already. Hand it in to avoid a full lookup.

In msi_compose_msg() the pointer to irq_cfg is already available. No
need to look it up again.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

1a0e62a4

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功