提交 · 1e3f42f03c38c29c1814199a6f0a2f01b919ea3f · openanolis / cloud-kernel

08 4月, 2012 40 次提交

KVM: MMU: Improve iteration through sptes from rmap · 1e3f42f0

由 Takuya Yoshikawa 提交于 3月 21, 2012

Iteration using rmap_next(), the actual body is pte_list_next(), is
inefficient: every time we call it we start from checking whether rmap
holds a single spte or points to a descriptor which links more sptes.

In the case of shadow paging, this quadratic total iteration cost is a
problem.  Even for two dimensional paging, with EPT/NPT on, in which we
almost always have a single mapping, the extra checks at the end of the
iteration should be eliminated.

This patch fixes this by introducing rmap_iterator which keeps the
iteration context for the next search.  Furthermore the implementation
of rmap_next() is splitted into two functions, rmap_get_first() and
rmap_get_next(), to avoid repeatedly checking whether the rmap being
iterated on has only one spte.

Although there seemed to be only a slight change for EPT/NPT, the actual
improvement was significant: we observed that GET_DIRTY_LOG for 1GB
dirty memory became 15% faster than before.  This is probably because
the new code is easy to make branch predictions.

Note: we just remove pte_list_next() because we can think of parent_ptes
as a reverse mapping.
Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: NAvi Kivity <avi@redhat.com>

1e3f42f0

KVM: MMU: Make pte_list_desc fit cache lines well · 220f773a

由 Takuya Yoshikawa 提交于 3月 21, 2012

We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20
bytes do not fit cache lines well. Furthermore, some allocators may
use 64/32-byte objects for the pte_list_desc cache.

This patch solves this problem by changing PTE_LIST_EXT from 4 to 3.

For shadow paging, the new size is still large enough to hold both the
kernel and process mappings for usual anonymous pages. For file
mappings, there may be a slight change in the cache usage.

Note: with EPT/NPT we almost always have a single spte in each reverse
mapping and we will not see any change by this.
Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: NAvi Kivity <avi@redhat.com>

220f773a

KVM: x86: add paging gcc optimization · c36fc04e

由 Davidlohr Bueso 提交于 3月 08, 2012

Since most guests will have paging enabled for memory management, add likely() optimization
around CR0.PG checks.
Signed-off-by: NDavidlohr Bueso <dave@gnu.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c36fc04e

KVM: VMX: Auto-load on CPUs with VMX · e9bda3b3

由 Josh Triplett 提交于 3月 20, 2012

Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.
Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
Acked-By: NKay Sievers <kay@vrfy.org>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e9bda3b3

powerpc/kvm: Fix magic page vs. 32-bit RTAS on ppc64 · bbcc9c06

由 Benjamin Herrenschmidt 提交于 3月 13, 2012

When the kernel calls into RTAS, it switches to 32-bit mode. The
magic page was is longer accessible in that case, causing the
patched instructions in the RTAS call wrapper to crash.

This fixes it by making available a 32-bit mapping of the magic
page in that case. This mapping is flushed whenever we switch
the kernel back to 64-bit mode.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
[agraf: add a check if the magic page is mapped]
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

bbcc9c06

KVM: PPC: Ignore unhalt request from kvm_vcpu_block · 966cd0f3

由 Alexander Graf 提交于 3月 14, 2012

When running kvm_vcpu_block and it realizes that the CPU is actually good
to run, we get a request bit set for KVM_REQ_UNHALT. Right now, there's
nothing we can do with that bit, so let's unset it right after the call
again so we don't get confused in our later checks for pending work.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

966cd0f3

KVM: PPC: Book3s: PR: Add HV traps so we can run in HV=1 mode on p7 · 4f225ae0

由 Alexander Graf 提交于 3月 13, 2012

When running PR KVM on a p7 system in bare metal, we get HV exits instead
of normal supervisor traps. Semantically they are identical though and the
HSRR vs SRR difference is already taken care of in the exit code.

So all we need to do is handle them in addition to our normal exits.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4f225ae0

KVM: PPC: Emulate tw and td instructions · 6df79df5

由 Alexander Graf 提交于 3月 13, 2012

There are 4 conditional trapping instructions: tw, twi, td, tdi. The
ones with an i take an immediate comparison, the others compare two
registers. All of them arrive in the emulator when the condition to
trap was successfully fulfilled.

Unfortunately, we were only implementing the i versions so far, so
let's also add support for the other two.

This fixes kernel booting with recents book3s_32 guest kernels.
Reported-by: NJörg Sommer <joerg@alea.gnuu.de>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

6df79df5

KVM: PPC: Pass EA to updating emulation ops · 6020c0f6

由 Alexander Graf 提交于 3月 12, 2012

When emulating updating load/store instructions (lwzu, stwu, ...) we need to
write the effective address of the load/store into a register.

Currently, we write the physical address in there, which is very wrong. So
instead let's save off where the virtual fault was on MMIO and use that
information as value to put into the register.

While at it, also move the XOP variants of the above instructions to the new
scheme of using the already known vaddr instead of calculating it themselves.
Reported-by: NJörg Sommer <joerg@alea.gnuu.de>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

6020c0f6

KVM: PPC: Work around POWER7 DABR corruption problem · 8943633c

由 Paul Mackerras 提交于 3月 02, 2012

It turns out that on POWER7, writing to the DABR can cause a corrupted
value to be written if the PMU is active and updating SDAR in continuous
sampling mode.  To work around this, we make sure that the PMU is inactive
and SDAR updates are disabled (via MMCRA) when we are context-switching
DABR.

When the guest sets DABR via the H_SET_DABR hypercall, we use a slightly
different workaround, which is to read back the DABR and write it again
if it got corrupted.

While we are at it, make it consistent that the saving and restoring
of the guest's non-volatile GPRs and the FPRs are done with the guest
setup of the PMU active.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8943633c

KVM: PPC: Book 3S: Fix compilation for !HV configs · 7657f408

由 Paul Mackerras 提交于 3月 05, 2012

Commits 2f5cdd5487 ("KVM: PPC: Book3S HV: Make secondary threads more
robust against stray IPIs") and 1c2066b0f7 ("KVM: PPC: Book3S HV: Make
virtual processor area registration more robust") added fields to
struct kvm_vcpu_arch inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions,
and added lines to arch/powerpc/kernel/asm-offsets.c to generate
assembler constants for their offsets. Unfortunately this led to
compile errors on Book 3S machines for configs that had KVM enabled
but not CONFIG_KVM_BOOK3S_64_HV. This fixes the problem by moving
the offending lines inside #ifdef CONFIG_KVM_BOOK3S_64_HV regions.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7657f408

Restore guest CR after exit timing calculation · c0fe7b09

由 Bharat Bhushan 提交于 3月 05, 2012

No instruction which can change Condition Register (CR) should be executed after
Guest CR is loaded. So the guest CR is restored after the Exit Timing in
lightweight_exit executes cmpw, which can clobber CR.
Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c0fe7b09

KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log · 0456ec4f

由 Paul Mackerras 提交于 2月 03, 2012

This adds code to measure "stolen" time per virtual core in units of
timebase ticks, and to report the stolen time to the guest using the
dispatch trace log (DTL).  The guest can register an area of memory
for the DTL for a given vcpu.  The DTL is a ring buffer where KVM
fills in one entry every time it enters the guest for that vcpu.

Stolen time is measured as time when the virtual core is not running,
either because the vcore is not runnable (e.g. some of its vcpus are
executing elsewhere in the kernel or in userspace), or when the vcpu
thread that is running the vcore is preempted.  This includes time
when all the vcpus are idle (i.e. have executed the H_CEDE hypercall),
which is OK because the guest accounts stolen time while idle as idle
time.

Each vcpu keeps a record of how much stolen time has been reported to
the guest for that vcpu so far.  When we are about to enter the guest,
we create a new DTL entry (if the guest vcpu has a DTL) and report the
difference between total stolen time for the vcore and stolen time
reported so far for the vcpu as the "enqueue to dispatch" time in the
DTL entry.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

0456ec4f

KVM: PPC: Book3S HV: Make virtual processor area registration more robust · 2e25aa5f

由 Paul Mackerras 提交于 2月 19, 2012

The PAPR API allows three sorts of per-virtual-processor areas to be
registered (VPA, SLB shadow buffer, and dispatch trace log), and
furthermore, these can be registered and unregistered for another
virtual CPU.  Currently we just update the vcpu fields pointing to
these areas at the time of registration or unregistration.  If this
is done on another vcpu, there is the possibility that the target vcpu
is using those fields at the time and could end up using a bogus
pointer and corrupting memory.

This fixes the race by making the target cpu itself do the update, so
we can be sure that the update happens at a time when the fields
aren't being used.  Each area now has a struct kvmppc_vpa which is
used to manage these updates.  There is also a spinlock which protects
access to all of the kvmppc_vpa structs, other than to the pinned_addr
fields.  (We could have just taken the spinlock when using the vpa,
slb_shadow or dtl fields, but that would mean taking the spinlock on
every guest entry and exit.)

This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
which is what the rest of the kernel uses.

Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
the need to initialize vcpu->arch.vpa_update_lock.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

2e25aa5f

KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70

由 Paul Mackerras 提交于 2月 03, 2012

Currently on POWER7, if we are running the guest on a core and we don't
need all the hardware threads, we do nothing to ensure that the unused
threads aren't executing in the kernel (other than checking that they
are offline). We just assume they're napping and we don't do anything
to stop them trying to enter the kernel while the guest is running.
This means that a stray IPI can wake up the hardware thread and it will
then try to enter the kernel, but since the core is in guest context,
it will execute code from the guest in hypervisor mode once it turns the
MMU on, which tends to lead to crashes or hangs in the host.

This fixes the problem by adding two new one-byte flags in the
kvmppc_host_state structure in the PACA which are used to interlock
between the primary thread and the unused secondary threads when entering
the guest. With these flags, the primary thread can ensure that the
unused secondaries are not already in kernel mode (i.e. handling a stray
IPI) and then indicate that they should not try to enter the kernel
if they do get woken for any reason. Instead they will go into KVM code,
find that there is no vcpu to run, acknowledge and clear the IPI and go
back to nap mode.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f0888f70

KVM: PPC: Save/Restore CR over vcpu_run · f6127716

由 Alexander Graf 提交于 3月 05, 2012

On PPC, CR2-CR4 are nonvolatile, thus have to be saved across function calls.
We didn't respect that for any architecture until Paul spotted it in his
patch for Book3S-HV. This patch saves/restores CR for all KVM capable PPC hosts.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

f6127716

KVM: PPC: Book3s: PR: Add SPAPR H_BULK_REMOVE support · 3aaefef2

由 Matt Evans 提交于 1月 30, 2012

SPAPR support includes various in-kernel hypercalls, improving performance
by cutting out the exit to userspace.  H_BULK_REMOVE is implemented in this
patch.
Signed-off-by: NMatt Evans <matt@ozlabs.org>
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

3aaefef2

KVM: PPC: Booke: only prepare to enter when we enter · 03660ba2

由 Alexander Graf 提交于 2月 28, 2012

So far, we've always called prepare_to_enter even when all we did was return
to the host. This patch changes that semantic to only call prepare_to_enter
when we actually want to get back into the guest.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

03660ba2

KVM: PPC: booke: Reinject performance monitor interrupts · 7cc1e8ee

由 Alexander Graf 提交于 2月 22, 2012

When we get a performance monitor interrupt, we need to make sure that
the host receives it. So reinject it like we reinject the other host
destined interrupts.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

7cc1e8ee

KVM: PPC: booke: expose good state on irq reinject · 4e642ccb

由 Alexander Graf 提交于 2月 20, 2012

When reinjecting an interrupt into the host interrupt handler after we're
back in host kernel land, we need to tell the kernel where the interrupt
happened. We can't tell it that we were in guest state, because that might
lead to random code walking host addresses. So instead, we tell it that
we came from the interrupt reinject code.

This helps getting reasonable numbers out of perf.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4e642ccb

KVM: PPC: booke: Support perfmon interrupts · 95f2e921

由 Alexander Graf 提交于 2月 20, 2012

When during guest context we get a performance monitor interrupt, we
currently bail out and oops. Let's route it to its correct handler
instead.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

95f2e921

KVM: PPC: e500: fix typo in tlb code · c6b3733b

由 Alexander Graf 提交于 2月 20, 2012

The tlbncfg registers should be populated with their respective TLB's
values. Fix the obvious typo.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c6b3733b

KVM: PPC: bookehv: remove unused code · 55cdf08b

由 Alexander Graf 提交于 2月 20, 2012

There was some unused code in the exit code path that must have been
a leftover from earlier iterations. While it did no harm, it's superfluous
and thus should be removed.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

55cdf08b

KVM: PPC: booke: add GS documentation for program interrupt · 0268597c

由 Alexander Graf 提交于 2月 20, 2012

The comment for program interrupts triggered when using bookehv was
misleading. Update it to mention why MSR_GS indicates that we have
to inject an interrupt into the guest again, not emulate it.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

0268597c

KVM: PPC: booke: Readd debug abort code for machine check · c35c9d84

由 Alexander Graf 提交于 2月 20, 2012

When during guest execution we get a machine check interrupt, we don't
know how to handle it yet. So let's add the error printing code back
again that we dropped accidently earlier and tell user space that something
went really wrong.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

c35c9d84

KVM: PPC: bookehv: add comment about shadow_msr · 5fd8505e

由 Alexander Graf 提交于 2月 16, 2012

For BookE HV the guest visible MSR is shared->msr and is identical to
the MSR that is in use while the guest is running, because we can't trap
reads from/to MSR.

So shadow_msr is unused there. Indicate that with a comment.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

5fd8505e

KVM: PPC: bookehv: disable MAS register updates early · e9ba39c1

由 Alexander Graf 提交于 2月 16, 2012

We need to make sure that no MAS updates happen automatically while we
have the guest MAS registers loaded. So move the disabling code a bit
higher up so that it covers the full time we have guest values in MAS
registers.

The race this patch fixes should never occur, but it makes the code a
bit more logical to do it this way around.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

e9ba39c1

KVM: PPC: bookehv: remove SET_VCPU · 8a3da557

由 Alexander Graf 提交于 2月 16, 2012

The SET_VCPU macro is a leftover from times when the vcpu struct wasn't
stored in the thread on vcpu_load/put. It's not needed anymore. Remove it.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8a3da557

KVM: PPC: bookehv: remove negation for CONFIG_64BIT · 8764b46e

由 Alexander Graf 提交于 2月 16, 2012

Instead if doing

  #ifndef CONFIG_64BIT
  ...
  #else
  ...
  #endif

we should rather do

  #ifdef CONFIG_64BIT
  ...
  #else
  ...
  #endif

which is a lot easier to read. Change the bookehv implementation to
stick with this rule.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8764b46e

KVM: PPC: bookehv: fix exit timing · 73ede8d3

由 Alexander Graf 提交于 2月 16, 2012

When using exit timing stats, we clobber r9 in the NEED_EMU case,
so better move that part down a few lines and fix it that way.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

73ede8d3

KVM: PPC: booke: BOOKE_IRQPRIO_MAX is n+1 · 8b3a00fc

由 Alexander Graf 提交于 2月 16, 2012

The semantics of BOOKE_IRQPRIO_MAX changed to denote the highest available
irqprio + 1, so let's reflect that in the code too.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

8b3a00fc

KVM: PPC: booke: rework rescheduling checks · a8e4ef84

由 Alexander Graf 提交于 2月 16, 2012

Instead of checking whether we should reschedule only when we exited
due to an interrupt, let's always check before entering the guest back
again. This gets the target more in line with the other archs.

Also while at it, generalize the whole thing so that eventually we could
have a single kvmppc_prepare_to_enter function for all ppc targets that
does signal and reschedule checking for us.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

a8e4ef84

KVM: PPC: booke: deliver program int on emulation failure · d1ff5499

由 Alexander Graf 提交于 2月 16, 2012

When we fail to emulate an instruction for the guest, we better go in and
tell it that we failed to emulate it, by throwing an illegal instruction
exception.

Please beware that we basically never get around to telling the guest that
we failed thanks to the debugging code right above it. If user space however
decides that it wants to ignore the debug, we would at least do "the right
thing" afterwards.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

d1ff5499

KVM: PPC: booke: remove leftover debugging · acab0529

由 Alexander Graf 提交于 2月 16, 2012

The e500mc patches left some debug code in that we don't need. Remove it.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

acab0529

KVM: PPC: make e500v2 kvm and e500mc cpu mutually exclusive · b2e19b20

由 Alexander Graf 提交于 2月 15, 2012

We can't run e500v2 kvm on e500mc kernels, so indicate that by
making the 2 options mutually exclusive in kconfig.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

b2e19b20

KVM: PPC: rename CONFIG_KVM_E500 -> CONFIG_KVM_E500V2 · bf7ca4bd

由 Alexander Graf 提交于 2月 15, 2012

The CONFIG_KVM_E500 option really indicates that we're running on a V2 machine,
not on a machine of the generic E500 class. So indicate that properly and
change the config name accordingly.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

bf7ca4bd

KVM: PPC: e500mc: add load inst fixup · 1d628af7

由 Alexander Graf 提交于 2月 15, 2012

There's always a chance we're unable to read a guest instruction. The guest
could have its TLB mapped execute-, but not readable, something odd happens
and our TLB gets flushed. So it's a good idea to be prepared for that case
and have a fallback that allows us to fix things up in that case.

Add fixup code that keeps guest code from potentially crashing our host kernel.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

1d628af7

KVM: PPC: e500mc: Move r1/r2 restoration very early · a2723ce7

由 Alexander Graf 提交于 2月 15, 2012

If we hit any exception whatsoever in the restore path and r1/r2 aren't the
host registers, we don't get a working oops. So it's always a good idea to
restore them as early as possible.

This time, it actually has practical reasons to do so too, since we need to
have the host page fault handler fix up our guest instruction read code. And
for that to work we need r1/r2 restored.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

a2723ce7

KVM: PPC: e500mc: implicitly set MSR_GS · 79300f8c

由 Alexander Graf 提交于 2月 15, 2012

When setting MSR for an e500mc guest, we implicitly always set MSR_GS
to make sure the guest is in guest state. Since we have this implicit
rule there, we don't need to explicitly pass MSR_GS to set_msr().

Remove all explicit setters of MSR_GS.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

79300f8c

KVM: PPC: e500mc: Add doorbell emulation support · 4ab96919

由 Alexander Graf 提交于 2月 15, 2012

When one vcpu wants to kick another, it can issue a special IPI instruction
called msgsnd. This patch emulates this instruction, its clearing counterpart
and the infrastructure required to actually trigger that interrupt inside
a guest vcpu.

With this patch, SMP guests on e500mc work.
Signed-off-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NAvi Kivity <avi@redhat.com>

4ab96919

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功