提交 · f1adb9c48a01779311aff57d96dc578f91f37eb7 · openeuler / Kernel

19 2月, 2019 3 次提交

KVM: PPC: Remove -I. header search paths · f1adb9c4

由 Masahiro Yamada 提交于 1月 11, 2019

The header search path -I. in kernel Makefiles is very suspicious;
it allows the compiler to search for headers in the top of $(srctree),
where obviously no header file exists.

Commit 46f43c6e ("KVM: powerpc: convert marker probes to event
trace") first added these options, but they are completely useless.
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

f1adb9c4

KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node · 08434ab4

由 wangbo 提交于 1月 07, 2019

Replace kmalloc_node and memset with kzalloc_node
Signed-off-by: Nwangbo <wang.bo116@zte.com.cn>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

08434ab4

KVM: PPC: Book3S PR: Add emulation for slbfee. instruction · 41a8645a

由 Paul Mackerras 提交于 2月 04, 2019

Recent kernels, since commit e15a4fea ("powerpc/64s/hash: Add
some SLB debugging tests", 2018-10-03) use the slbfee. instruction,
which PR KVM currently does not have code to emulate.  Consequently
recent kernels fail to boot under PR KVM.  This adds emulation of
slbfee., enabling these kernels to boot successfully.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

41a8645a

04 1月, 2019 1 次提交

Remove 'type' argument from access_ok() function · 96d4f267

由 Linus Torvalds 提交于 1月 03, 2019

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96d4f267

30 12月, 2018 1 次提交

KVM: PPC: Book3S HV: radix: Fix uninitialized var build error · f4607722

由 Michael Ellerman 提交于 12月 30, 2018

Old GCCs (4.6.3 at least), aren't able to follow the logic in
__kvmhv_copy_tofrom_guest_radix() and warn that old_pid is used
uninitialized:

  arch/powerpc/kvm/book3s_64_mmu_radix.c:75:3: error: 'old_pid' may be
  used uninitialized in this function

The logic is OK, we only use old_pid if quadrant == 1, and in that
case it has definitely be initialised, eg:

	if (quadrant == 1) {
		old_pid = mfspr(SPRN_PID);
	...
	if (quadrant == 1 && pid != old_pid)
		mtspr(SPRN_PID, old_pid);

Annotate it to fix the error.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f4607722

21 12月, 2018 8 次提交

treewide: surround Kconfig file paths with double quotes · 8636a1f9

由 Masahiro Yamada 提交于 12月 11, 2018

The Kconfig lexer supports special characters such as '.' and '/' in
the parameter context. In my understanding, the reason is just to
support bare file paths in the source statement.

I do not see a good reason to complicate Kconfig for the room of
ambiguity.

The majority of code already surrounds file paths with double quotes,
and it makes sense since file paths are constant string literals.

Make it treewide consistent now.
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: NWolfram Sang <wsa@the-dreams.de>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Acked-by: NIngo Molnar <mingo@kernel.org>

8636a1f9

KVM: Make kvm_set_spte_hva() return int · 748c0e31

由 Lan Tianyu 提交于 12月 06, 2018

The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

748c0e31

powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef

由 Alexey Kardashevskiy 提交于 12月 19, 2018

This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.

This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.

This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().

This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c10c21ef

KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host · ae59a7e1

由 Suraj Jitindar Singh 提交于 12月 21, 2018

The rc bits contained in ptes are used to track whether a page has been
accessed and whether it is dirty. The accessed bit is used to age a page
and the dirty bit to track whether a page is dirty or not.

Now that we support nested guests there are three ptes which track the
state of the same page:
- The partition-scoped page table in the L1 guest, mapping L2->L1 address
- The partition-scoped page table in the host for the L1 guest, mapping
  L1->L0 address
- The shadow partition-scoped page table for the nested guest in the host,
  mapping L2->L0 address

The idea is to attempt to keep the rc state of these three ptes in sync,
both when setting and when clearing rc bits.

When setting the bits we achieve consistency by:
- Initially setting the bits in the shadow page table as the 'and' of the
  other two.
- When updating in software the rc bits in the shadow page table we
  ensure the state is consistent with the other two locations first, and
  update these before reflecting the change into the shadow page table.
  i.e. only set the bits in the L2->L0 pte if also set in both the
       L2->L1 and the L1->L0 pte.

When clearing the bits we achieve consistency by:
- The rc bits in the shadow page table are only cleared when discarding
  a pte, and we don't need to record this as if either bit is set then
  it must also be set in the pte mapping L1->L0.
- When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
  tlbie instruction
  - This means we will discard the pte from the shadow page table
    meaning the mapping will have to be setup again.
  - When setup the pte again in the shadow page table we will ensure
    consistency with the L2->L1 pte.
- When the host clears an rc bit in the L1->L0 mapping we need to also
  clear the bit in any ptes in the shadow page table which map the same
  gfn so we will be notified if a nested guest accesses the page.
  This case is what this patch specifically concerns.
  - We can search the nest_rmap list for that given gfn and clear the
    same bit from all corresponding ptes in shadow page tables.
  - If a nested guest causes either of the rc bits to be set by software
    in future then we will update the L1->L0 pte and maintain consistency.

With the process outlined above we aim to maintain consistency of the 3
pte locations where we track rc for a given guest page.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

ae59a7e1

KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list() · 90165d3d

由 Suraj Jitindar Singh 提交于 12月 21, 2018

Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
nest_rmap list will traverse it, find the corresponding pte in the shadow
page tables, and if it still maps the same host page update the rc bits
accordingly.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

90165d3d

KVM: PPC: Book3S HV: Apply combination of host and l1 pte rc for nested guest · 8b23eee4

由 Suraj Jitindar Singh 提交于 12月 21, 2018

The shadow page table contains ptes for translations from nested guest
address to host address. Currently when creating these ptes we take the
rc bits from the pte for the L1 guest address to host address
translation. This is incorrect as we must also factor in the rc bits
from the pte for the nested guest address to L1 guest address
translation (as contained in the L1 guest partition table for the nested
guest).

By not calculating these bits correctly L1 may not have been correctly
notified when it needed to update its rc bits in the partition table it
maintains for its nested guest.

Modify the code so that the rc bits in the resultant pte for the L2->L0
translation are the 'and' of the rc bits in the L2->L1 pte and the L1->L0
pte, also accounting for whether this was a write access when setting
the dirty bit.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

8b23eee4

KVM: PPC: Book3S HV: Align gfn to L1 page size when inserting nest-rmap entry · 8400f874

由 Suraj Jitindar Singh 提交于 12月 21, 2018

Nested rmap entries are used to store the translation from L1 gpa to L2
gpa when entries are inserted into the shadow (nested) page tables. This
rmap list is located by indexing the rmap array in the memslot by L1
gfn. When we come to search for these entries we only know the L1 page size
(which could be PAGE_SIZE, 2M or a 1G page) and so can only select a gfn
aligned to that size. This means that when we insert the entry, so we can
find it later, we need to align the gfn we use to select the rmap list
in which to insert the entry to L1 page size as well.

By not doing this we were missing nested rmap entries when modifying L1
ptes which were for a page also passed through to an L2 guest.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

8400f874

KVM: PPC: Book3S HV: Hold kvm->mmu_lock across updating nested pte rc bits · bec6e03b

由 Suraj Jitindar Singh 提交于 12月 21, 2018

We already hold the kvm->mmu_lock spin lock across updating the rc bits
in the pte for the L1 guest. Continue to hold the lock across updating
the rc bits in the pte for the nested guest as well to prevent
invalidations from occurring.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

bec6e03b

20 12月, 2018 2 次提交

powerpc/fsl: Flush branch predictor when entering KVM · e7aa61f4

由 Diana Craciun 提交于 12月 12, 2018

Switching from the guest to host is another place
where the speculative accesses can be exploited.
Flush the branch predictor when entering KVM.
Signed-off-by: NDiana Craciun <diana.craciun@nxp.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e7aa61f4

powerpc/fsl: Emulate SPRN_BUCSR register · 98518c4d

由 Diana Craciun 提交于 12月 12, 2018

In order to flush the branch predictor the guest kernel performs
writes to the BUCSR register which is hypervisor privilleged. However,
the branch predictor is flushed at each KVM entry, so the branch
predictor has been already flushed, so just return as soon as possible
to guest.
Signed-off-by: NDiana Craciun <diana.craciun@nxp.com>
[mpe: Tweak comment formatting]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

98518c4d

17 12月, 2018 12 次提交

KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest · 95d386c2

由 Suraj Jitindar Singh 提交于 12月 14, 2018

Previously when a device was being emulated by an L1 guest for an L2
guest, that device couldn't then be passed through to an L3 guest. This
was because the L1 guest had no method for accessing L3 memory.

The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
passthrough can now be allowed.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

95d386c2

KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2 · 6ff887b8

由 Suraj Jitindar Singh 提交于 12月 14, 2018

A guest cannot access quadrants 1 or 2 as this would result in an
exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
guest when it wants to perform an access to quadrants 1 or 2, for
example when it wants to access memory for one of its nested guests.

Also provide an implementation for the kvm-hv module.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

6ff887b8

KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest · 873db2cd

由 Suraj Jitindar Singh 提交于 12月 14, 2018

Allow for a device which is being emulated at L0 (the host) for an L1
guest to be passed through to a nested (L2) guest.

The existing kvmppc_hv_emulate_mmio function can be used here. The main
challenge is that for a load the result must be stored into the L2 gpr,
not an L1 gpr as would normally be the case after going out to qemu to
complete the operation. This presents a challenge as at this point the
L2 gpr state has been written back into L1 memory.

To work around this we store the address in L1 memory of the L2 gpr
where the result of the load is to be stored and use the new io_gpr
value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
which completion must be done when returning back into the kernel. Then
in kvmppc_complete_mmio_load() the resultant value is written into L1
memory at the location of the indicated L2 gpr.

Note that we don't currently let an L1 guest emulate a device for an L2
guest which is then passed through to an L3 guest.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

873db2cd

KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants · cc6929cc

由 Suraj Jitindar Singh 提交于 12月 14, 2018

The functions kvmppc_st and kvmppc_ld are used to access guest memory
from the host using a guest effective address. They do so by translating
through the process table to obtain a guest real address and then using
kvm_read_guest or kvm_write_guest to make the access with the guest real
address.

This method of access however only works for L1 guests and will give the
incorrect results for a nested guest.

We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
perform the access for a nested guesti (and a L1 guest). So attempt this
method first and fall back to the old method if this fails and we aren't
running a nested guest.

At this stage there is no fall back method to perform the access for a
nested guest and this is left as a future improvement. For now we will
return to the nested guest and rely on the fact that a translation
should be faulted in before retrying the access.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

cc6929cc

KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct · dceadcf9

由 Suraj Jitindar Singh 提交于 12月 14, 2018

The kvmppc_ops struct is used to store function pointers to kvm
implementation specific functions.

Introduce two new functions load_from_eaddr and store_to_eaddr to be
used to load from and store to a guest effective address respectively.

Also implement these for the kvm-hv module. If we are using the radix
mmu then we can call the functions to access quadrant 1 and 2.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

dceadcf9

KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2 · d7b45615

由 Suraj Jitindar Singh 提交于 12月 14, 2018

The POWER9 radix mmu has the concept of quadrants. The quadrant number
is the two high bits of the effective address and determines the fully
qualified address to be used for the translation. The fully qualified
address consists of the effective lpid, the effective pid and the
effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.

When accessing these quadrants the fully qualified address is obtained
as follows:

Quadrant		| Hypervisor		| Guest
--------------------------------------------------------------------------
			| EA[0:1] = 0b00	| EA[0:1] = 0b00
0			| effLPID = 0		| effLPID = LPIDR
			| effPID  = PIDR	| effPID  = PIDR
--------------------------------------------------------------------------
			| EA[0:1] = 0b01	|
1			| effLPID = LPIDR	| Invalid Access
			| effPID  = PIDR	|
--------------------------------------------------------------------------
			| EA[0:1] = 0b10	|
2			| effLPID = LPIDR	| Invalid Access
			| effPID  = 0		|
--------------------------------------------------------------------------
			| EA[0:1] = 0b11	| EA[0:1] = 0b11
3			| effLPID = 0		| effLPID = LPIDR
			| effPID  = 0		| effPID  = 0
--------------------------------------------------------------------------

In the Guest;
Quadrant 3 is normally used to address the operating system since this
uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
be switched.
Quadrant 0 is normally used to address user space since the effLPID and
effPID are taken from the corresponding registers.

In the Host;
Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
address the host.

Quadrants 1 and 2 can be used by the host to address guest memory using
a guest effective address. Since the effLPID comes from the LPID register,
the host loads the LPID of the guest it would like to access (and the
PID of the process) and can perform accesses to a guest effective
address.

This means quadrant 1 can be used to address the guest user space and
quadrant 2 can be used to address the guest operating system from the
hypervisor, using a guest effective address.

Access to the quadrants can cause a Hypervisor Data Storage Interrupt
(HDSI) due to being unable to perform partition scoped translation.
Previously this could only be generated from a guest and so the code
path expects us to take the KVM trampoline in the interrupt handler.
This is no longer the case so we modify the handler to call
bad_page_fault() to check if we were expecting this fault so we can
handle it gracefully and just return with an error code. In the hash mmu
case we still raise an unknown exception since quadrants aren't defined
for the hash mmu.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d7b45615

KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix() · d232afeb

由 Suraj Jitindar Singh 提交于 12月 14, 2018

There exists a function kvm_is_radix() which is used to determine if a
kvm instance is using the radix mmu. However this only applies to the
first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
be used to determine if the current execution context of the vcpu is
radix, accounting for if the vcpu is running a nested guest.

Currently all nested guests must be radix but this may change in the
future.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d232afeb

KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines · 693ac10a

由 Suraj Jitindar Singh 提交于 12月 14, 2018

The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
availability of in kernel tce acceleration for vfio. However it is
currently the case that this is only available on a powernv machine,
not for a pseries machine.

Thus make this capability dependent on having the cpu feature
CPU_FTR_HVMODE.

[paulus@ozlabs.org - fixed compilation for Book E.]
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

693ac10a

KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off · 5af3e9d0

由 Paul Mackerras 提交于 12月 12, 2018

This adds code to flush the partition-scoped page tables for a radix
guest when dirty tracking is turned on or off for a memslot.  Only the
guest real addresses covered by the memslot are flushed.  The reason
for this is to get rid of any 2M PTEs in the partition-scoped page
tables that correspond to host transparent huge pages, so that page
dirtiness is tracked at a system page (4k or 64k) granularity rather
than a 2M granularity.  The page tables are also flushed when turning
dirty tracking off so that the memslot's address space can be
repopulated with THPs if possible.

To do this, we add a new function kvmppc_radix_flush_memslot().  Since
this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
guest, we now make kvmppc_core_flush_memslot_hv() call the new
kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
for each page in the memslot.  This has the effect of fixing a bug in
that kvmppc_core_flush_memslot_hv() was previously calling
kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
is required to be held.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

5af3e9d0

KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments · c43c3a86

由 Paul Mackerras 提交于 12月 12, 2018

This adds 'const' to the declarations for the struct kvm_memory_slot
pointer parameters of some functions, which will make it possible to
call those functions from kvmppc_core_commit_memory_region_hv()
in the next patch.

This also fixes some comments about locking.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

c43c3a86

KVM: PPC: Book3S HV: Map single pages when doing dirty page logging · f460f679

由 Paul Mackerras 提交于 12月 12, 2018

For radix guests, this makes KVM map guest memory as individual pages
when dirty page logging is enabled for the memslot corresponding to the
guest real address. Having a separate partition-scoped PTE for each
system page mapped to the guest means that we have a separate dirty
bit for each page, thus making the reported dirty bitmap more accurate.
Without this, if part of guest memory is backed by transparent huge
pages, the dirty status is reported at a 2MB granularity rather than
a 64kB (or 4kB) granularity for that part, causing userspace to have
to transmit more data when migrating the guest.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

f460f679

KVM: PPC: Pass change type down to memslot commit function · f032b734

由 Bharata B Rao 提交于 12月 12, 2018

Currently, kvm_arch_commit_memory_region() gets called with a
parameter indicating what type of change is being made to the memslot,
but it doesn't pass it down to the platform-specific memslot commit
functions.  This adds the `change' parameter to the lower-level
functions so that they can use it in future.

[paulus@ozlabs.org - fix book E also.]
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

f032b734

14 12月, 2018 4 次提交

kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic · e5d83c74

由 Paolo Bonzini 提交于 2月 16, 2017

The first such capability to be handled in virt/kvm/ will be manual
dirty page reprotection.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e5d83c74

KVM: PPC: Book3S PR: Set hflag to indicate that POWER9 supports 1T segments · 6142236c

由 Suraj Jitindar Singh 提交于 12月 07, 2018

When booting a kvm-pr guest on a POWER9 machine the following message is
observed:
"qemu-system-ppc64: KVM does not support 1TiB segments which guest expects"

This is because the guest is expecting to be able to use 1T segments
however we don't indicate support for it. This is because we don't set
the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr()
on POWER9.

POWER9 does indeed have support for 1T segments, so add a case for
POWER9 to the switch statement to ensure it is set.
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

6142236c

KVM: PPC: Book3S HV: Change to use DEFINE_SHOW_ATTRIBUTE macro · 0f6ddf34

由 Yangtao Li 提交于 11月 05, 2018

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
Signed-off-by: NYangtao Li <tiny.windzz@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

0f6ddf34

KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · 234ff0b7

由 Paul Mackerras 提交于 11月 16, 2018

Testing has revealed an occasional crash which appears to be caused
by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
The symptom is a NULL pointer dereference in __find_linux_pte() called
from kvm_unmap_radix() with kvm->arch.pgtable == NULL.

Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
kvm->arch.pgtable (via kvmppc_free_radix()) before setting
kvm->arch.radix to NULL, and there is nothing to prevent
kvm_unmap_hva_range_hv() or the other MMU callback functions from
being called concurrently with kvmppc_switch_mmu_to_hpt() or
kvmppc_switch_mmu_to_radix().

This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
around the assignments to kvm->arch.radix, and makes sure that the
partition-scoped radix tree or HPT is only freed after changing
kvm->arch.radix.

This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
that the clearing of each rmap array (one per memslot) doesn't happen
concurrently with use of the array in the kvm_unmap_hva_range_hv()
or the other MMU callbacks.

Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

234ff0b7

04 12月, 2018 1 次提交

powerpc/mm: move platform specific mmu-xxx.h in platform directories · 994da93d

由 Christophe Leroy 提交于 11月 29, 2018

The purpose of this patch is to move platform specific
mmu-xxx.h files in platform directories like pte-xxx.h files.

In the meantime this patch creates common nohash and
nohash/32 + nohash/64 mmu.h files for future common parts.
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

994da93d

15 11月, 2018 1 次提交

KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12

由 Michael Roth 提交于 11月 08, 2018

While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
pending signal in the L0 QEMU process can generate the following
sequence:

  ret0 = kvmppc_pseries_do_hcall()
    ret1 = kvmhv_enter_nested_guest()
      ret2 = kvmhv_run_single_vcpu()
      if (ret2 == -EINTR)
        return H_INTERRUPT
    if (ret1 == H_INTERRUPT)
      kvmppc_set_gpr(vcpu, 3, 0)
      return -EINTR
    /* skipped: */
    kvmppc_set_gpr(vcpu, 3, ret)
    vcpu->arch.hcall_needed = 0
    return RESUME_GUEST

which causes an exit to L0 userspace with ret0 == -EINTR.

The intention seems to be to set the hcall return value to 0 (via
VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
once we resume executing the VCPU. However, because we don't set
vcpu->arch.hcall_needed = 0, we do the following once userspace
resumes execution via kvm_arch_vcpu_ioctl_run():

  ...
  } else if (vcpu->arch.hcall_needed) {
    int i

    kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
    for (i = 0; i < 9; ++i)
           kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
    vcpu->arch.hcall_needed = 0;

since vcpu->arch.hcall_needed == 1 indicates that userspace should
have handled the hcall and stored the return value in
run->papr_hcall.ret. Since that's not the case here, we can get an
unexpected value in VCPU r3, which can result in
kvmhv_p9_guest_entry() reporting an unexpected trap value when it
returns from H_ENTER_NESTED, causing the following register dump to
console via subsequent call to kvmppc_handle_exit_hv() in L1:

  [  350.612854] vcpu 00000000f9564cf8 (0):
  [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
  [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
  [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
  [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
  [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
  [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
  [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
  [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
  [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
  [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
  [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
  [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
  [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
  [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
  [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
  [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
  [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
  [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
  [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
  [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
  [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
  [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
  [  350.613962] dar = 0000772f95390000
  [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
  [  350.614073] SLB (0 entries):
  [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
  [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033

followed by L1's QEMU reporting the following before stopping execution
of the nested guest:

  KVM: unknown exit, hardware reason 1
  NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
  MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
  TB 00000000 00000000 DECR 00000000
  GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
  GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
  GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
  GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
  GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
  GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
  GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
  GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
  CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
   SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
  SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
  SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
  HSRR0 0000000000000000 HSRR1 0000000000000000
   CFAR 0000000000000000
   LPCR 0000000003d40413
   PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000

Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
of H_ENTER_NESTED before we exit to L0 userspace.

Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

6c08ec12

07 11月, 2018 1 次提交

KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE · 28c5bcf7

由 Scott Wood 提交于 11月 06, 2018

TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
<trace/define_trace.h>, so like that #include, they should
be outside #ifdef protection.

They also need to be #undefed before defining, in case multiple trace
headers are included by the same C file.  This became the case on
book3e after commit cf4a6085 ("powerpc/mm: Add missing tracepoint for
tlbie"), leading to the following build error:

   CC      arch/powerpc/kvm/powerpc.o
In file included from arch/powerpc/kvm/powerpc.c:51:0:
arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
[-Werror]
  #define TRACE_INCLUDE_PATH .
  ^
In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
                  from arch/powerpc/kvm/powerpc.c:48:
./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
the previous definition
  #define TRACE_INCLUDE_PATH asm
  ^
cc1: all warnings being treated as errors
Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: NScott Wood <oss@buserror.net>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

28c5bcf7

26 10月, 2018 1 次提交

KVM: PPC: Use exported tb_to_ns() function in decrementer emulation · c43befca

由 Paul Mackerras 提交于 10月 20, 2018

This changes the KVM code that emulates the decrementer function to do
the conversion of decrementer values to time intervals in nanoseconds
by calling the tb_to_ns() function exported by the powerpc timer code,
in preference to open-coded arithmetic using values from the
decrementer_clockevent struct.  Similarly, the HV-KVM code that did
the same conversion using arithmetic on tb_ticks_per_sec also now
uses tb_to_ns().
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c43befca

20 10月, 2018 1 次提交

KVM: PPC: Optimize clearing TCEs for sparse tables · 6e301a8e

由 Alexey Kardashevskiy 提交于 10月 15, 2018

The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
table and a table with userspace addresses. These tables are radix trees,
we allocate indirect levels when they are written to. Since
the memory allocation is problematic in real mode, we have 2 accessors
to the entries:
- for virtual mode: it allocates the memory and it is always expected
to return non-NULL;
- fr real mode: it does not allocate and can return NULL.

Also, DMA windows can span to up to 55 bits of the address space and since
we never have this much RAM, such windows are sparse. However currently
the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.

Since we maintain a userspace addresses table for VFIO which is a mirror
of the hardware table, we can use it to know which parts of the DMA
window have not been mapped and skip these so does this patch.

The bare metal systems do not have this problem as they use a bypass mode
of a PHB which maps RAM directly.

This helps a lot with sparse DMA windows, reducing the shutdown time from
about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
Just skipping the last level seems to be good enough.

As non-allocating accessor is used now in virtual mode as well, rename it
from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

6e301a8e

19 10月, 2018 1 次提交

KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips · 8d9fcacf

由 Paul Mackerras 提交于 10月 19, 2018

This disables the use of the streamlined entry path for radix guests
on early POWER9 chips that need the workaround added in commit
a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM",
2017-07-24), because the streamlined entry path does not include
that workaround.  This also means that we can't do nested HV-KVM
on those chips.

Since the chips that need that workaround are the same ones that can't
run both radix and HPT guests at the same time on different threads of
a core, we use the existing 'no_mixing_hpt_and_radix' variable that
identifies those chips to identify when we can't use the new guest
entry path, and when we can't do nested virtualization.
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

8d9fcacf

18 10月, 2018 1 次提交

powerpc: Add -Werror at arch/powerpc level · 23ad1a27

由 Michael Ellerman 提交于 10月 10, 2018

Back when I added -Werror in commit ba55bd74 ("powerpc: Add
configurable -Werror for arch/powerpc") I did it by adding it to most
of the arch Makefiles.

At the time we excluded math-emu, because apparently it didn't build
cleanly. But that seems to have been fixed somewhere in the interim.

So move the -Werror addition to the top-level of the arch, this saves
us from repeating it in every Makefile and means we won't forget to
add it to any new sub-dirs.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

23ad1a27

09 10月, 2018 2 次提交

KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f

由 Paul Mackerras 提交于 10月 08, 2018

This adds a KVM_PPC_NO_HASH flag to the flags field of the
kvm_ppc_smmu_info struct, and arranges for it to be set when
running as a nested hypervisor, as an unambiguous indication
to userspace that HPT guests are not supported.  Reporting the
KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
indicating only that the new HPT features in ISA V3.0 are not
supported, leaving it ambiguous whether pre-V3.0 HPT features
are supported.
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

901f8c3f

KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99

由 Paul Mackerras 提交于 9月 21, 2018

With this, userspace can enable a KVM-HV guest to run nested guests
under it.

The administrator can control whether any nested guests can be run;
setting the "nested" module parameter to false prevents any guests
becoming nested hypervisors (that is, any attempt to enable the nested
capability on a guest will fail).  Guests which are already nested
hypervisors will continue to be so.
Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

aa069a99

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功