1. 22 2月, 2019 3 次提交
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · c3c7470c
      Michael Ellerman 提交于
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      c3c7470c
    • A
      KVM: PPC: Book3S: Improve KVM reference counting · 716cb116
      Alexey Kardashevskiy 提交于
      The anon fd's ops releases the KVM reference in the release hook.
      However we reference the KVM object after we create the fd so there is
      small window when the release function can be called and
      dereferenced the KVM object which potentially may free it.
      
      It is not a problem at the moment as the file is created and KVM is
      referenced under the KVM lock and the release function obtains the same
      lock before dereferencing the KVM (although the lock is not held when
      calling kvm_put_kvm()) but it is potentially fragile against future changes.
      
      This references the KVM object before creating a file.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      716cb116
    • J
      KVM: PPC: Book3S HV: Fix build failure without IOMMU support · e40542af
      Jordan Niethe 提交于
      Currently trying to build without IOMMU support will fail:
      
        (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
        (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
        (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
        (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
      
      This happens because turning off IOMMU support will prevent
      book3s_64_vio_hv.c from being built because it is only built when
      SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
      
      Fix it using ifdefs for the undefined references.
      
      Fixes: 76d837a4 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e40542af
  2. 21 2月, 2019 6 次提交
    • P
      powerpc/64s: Better printing of machine check info for guest MCEs · c0577201
      Paul Mackerras 提交于
      This adds an "in_guest" parameter to machine_check_print_event_info()
      so that we can avoid trying to translate guest NIP values into
      symbolic form using the host kernel's symbol table.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0577201
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
    • M
      KVM: PPC: Book3S HV: Context switch AMR on Power9 · d976f680
      Michael Ellerman 提交于
      kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
      when guest and host are both running with the Radix MMU.
      
      Currently in that path we don't save the host AMR (Authority Mask
      Register) value, and we always restore 0 on return to the host. That
      is OK at the moment because the AMR is not used for storage keys with
      the Radix MMU.
      
      However we plan to start using the AMR on Radix to prevent the kernel
      from reading/writing to userspace outside of copy_to/from_user(). In
      order to make that work we need to save/restore the AMR value.
      
      We only restore the value if it is different from the guest value,
      which is already in the register when we exit to the host. This should
      mean we rarely need to actually restore the value when running a
      modern Linux as a guest, because it will be using the same value as
      us.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NRussell Currey <ruscur@russell.cc>
      d976f680
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
  3. 19 2月, 2019 7 次提交
    • S
      KVM: PPC: Book3S HV: Add KVM stat largepages_[2M/1G] · 8f1f7b9b
      Suraj Jitindar Singh 提交于
      This adds an entry to the kvm_stats_debugfs directory which provides the
      number of large (2M or 1G) pages which have been used to setup the guest
      mappings, for radix guests.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8f1f7b9b
    • A
      KVM: PPC: Release all hardware TCE tables attached to a group · a67614cc
      Alexey Kardashevskiy 提交于
      The SPAPR TCE KVM device references all hardware IOMMU tables assigned to
      some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE
      can work. The tables are references when an IOMMU group gets registered
      with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl;
      KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code
      in kvm_spapr_tce_release_iommu_group() which walks through the list of
      LIOBNs, finds a matching IOMMU table and calls kref_put() when found.
      
      However that code stops after the very first successful derefencing
      leaving other tables referenced till the SPAPR TCE KVM device is destroyed
      which normally happens on guest reboot or termination so if we do hotplug
      and unplug in a loop, we are leaking IOMMU tables here.
      
      This removes a premature return to let kvm_spapr_tce_release_iommu_group()
      find and dereference all attached tables.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a67614cc
    • S
      KVM: PPC: Book3S HV: Optimise mmio emulation for devices on FAST_MMIO_BUS · 1b642257
      Suraj Jitindar Singh 提交于
      Devices on the KVM_FAST_MMIO_BUS by definition have length zero and are
      thus used for notification purposes rather than data transfer. For
      example eventfd for virtio devices.
      
      This means that when emulating mmio instructions which target devices on
      this bus we can immediately handle them and return without needing to load
      the instruction from guest memory.
      
      For now we restrict this to stores as this is the only use case at
      present.
      
      For a normal guest the effect is negligible, however for a nested guest
      we save on the order of 5us per access.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b642257
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
    • M
      KVM: PPC: Remove -I. header search paths · f1adb9c4
      Masahiro Yamada 提交于
      The header search path -I. in kernel Makefiles is very suspicious;
      it allows the compiler to search for headers in the top of $(srctree),
      where obviously no header file exists.
      
      Commit 46f43c6e ("KVM: powerpc: convert marker probes to event
      trace") first added these options, but they are completely useless.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f1adb9c4
    • W
      KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node · 08434ab4
      wangbo 提交于
      Replace kmalloc_node and memset with kzalloc_node
      Signed-off-by: Nwangbo <wang.bo116@zte.com.cn>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      08434ab4
    • P
      KVM: PPC: Book3S PR: Add emulation for slbfee. instruction · 41a8645a
      Paul Mackerras 提交于
      Recent kernels, since commit e15a4fea ("powerpc/64s/hash: Add
      some SLB debugging tests", 2018-10-03) use the slbfee. instruction,
      which PR KVM currently does not have code to emulate.  Consequently
      recent kernels fail to boot under PR KVM.  This adds emulation of
      slbfee., enabling these kernels to boot successfully.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      41a8645a
  4. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  5. 30 12月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: radix: Fix uninitialized var build error · f4607722
      Michael Ellerman 提交于
      Old GCCs (4.6.3 at least), aren't able to follow the logic in
      __kvmhv_copy_tofrom_guest_radix() and warn that old_pid is used
      uninitialized:
      
        arch/powerpc/kvm/book3s_64_mmu_radix.c:75:3: error: 'old_pid' may be
        used uninitialized in this function
      
      The logic is OK, we only use old_pid if quadrant == 1, and in that
      case it has definitely be initialised, eg:
      
      	if (quadrant == 1) {
      		old_pid = mfspr(SPRN_PID);
      	...
      	if (quadrant == 1 && pid != old_pid)
      		mtspr(SPRN_PID, old_pid);
      
      Annotate it to fix the error.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4607722
  6. 21 12月, 2018 8 次提交
    • M
      treewide: surround Kconfig file paths with double quotes · 8636a1f9
      Masahiro Yamada 提交于
      The Kconfig lexer supports special characters such as '.' and '/' in
      the parameter context. In my understanding, the reason is just to
      support bare file paths in the source statement.
      
      I do not see a good reason to complicate Kconfig for the room of
      ambiguity.
      
      The majority of code already surrounds file paths with double quotes,
      and it makes sense since file paths are constant string literals.
      
      Make it treewide consistent now.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NWolfram Sang <wsa@the-dreams.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      8636a1f9
    • L
      KVM: Make kvm_set_spte_hva() return int · 748c0e31
      Lan Tianyu 提交于
      The patch is to make kvm_set_spte_hva() return int and caller can
      check return value to determine flush tlb or not.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      748c0e31
    • A
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy 提交于
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • S
      KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host · ae59a7e1
      Suraj Jitindar Singh 提交于
      The rc bits contained in ptes are used to track whether a page has been
      accessed and whether it is dirty. The accessed bit is used to age a page
      and the dirty bit to track whether a page is dirty or not.
      
      Now that we support nested guests there are three ptes which track the
      state of the same page:
      - The partition-scoped page table in the L1 guest, mapping L2->L1 address
      - The partition-scoped page table in the host for the L1 guest, mapping
        L1->L0 address
      - The shadow partition-scoped page table for the nested guest in the host,
        mapping L2->L0 address
      
      The idea is to attempt to keep the rc state of these three ptes in sync,
      both when setting and when clearing rc bits.
      
      When setting the bits we achieve consistency by:
      - Initially setting the bits in the shadow page table as the 'and' of the
        other two.
      - When updating in software the rc bits in the shadow page table we
        ensure the state is consistent with the other two locations first, and
        update these before reflecting the change into the shadow page table.
        i.e. only set the bits in the L2->L0 pte if also set in both the
             L2->L1 and the L1->L0 pte.
      
      When clearing the bits we achieve consistency by:
      - The rc bits in the shadow page table are only cleared when discarding
        a pte, and we don't need to record this as if either bit is set then
        it must also be set in the pte mapping L1->L0.
      - When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
        tlbie instruction
        - This means we will discard the pte from the shadow page table
          meaning the mapping will have to be setup again.
        - When setup the pte again in the shadow page table we will ensure
          consistency with the L2->L1 pte.
      - When the host clears an rc bit in the L1->L0 mapping we need to also
        clear the bit in any ptes in the shadow page table which map the same
        gfn so we will be notified if a nested guest accesses the page.
        This case is what this patch specifically concerns.
        - We can search the nest_rmap list for that given gfn and clear the
          same bit from all corresponding ptes in shadow page tables.
        - If a nested guest causes either of the rc bits to be set by software
          in future then we will update the L1->L0 pte and maintain consistency.
      
      With the process outlined above we aim to maintain consistency of the 3
      pte locations where we track rc for a given guest page.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ae59a7e1
    • S
      KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list() · 90165d3d
      Suraj Jitindar Singh 提交于
      Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
      nest_rmap list will traverse it, find the corresponding pte in the shadow
      page tables, and if it still maps the same host page update the rc bits
      accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      90165d3d
    • S
      KVM: PPC: Book3S HV: Apply combination of host and l1 pte rc for nested guest · 8b23eee4
      Suraj Jitindar Singh 提交于
      The shadow page table contains ptes for translations from nested guest
      address to host address. Currently when creating these ptes we take the
      rc bits from the pte for the L1 guest address to host address
      translation. This is incorrect as we must also factor in the rc bits
      from the pte for the nested guest address to L1 guest address
      translation (as contained in the L1 guest partition table for the nested
      guest).
      
      By not calculating these bits correctly L1 may not have been correctly
      notified when it needed to update its rc bits in the partition table it
      maintains for its nested guest.
      
      Modify the code so that the rc bits in the resultant pte for the L2->L0
      translation are the 'and' of the rc bits in the L2->L1 pte and the L1->L0
      pte, also accounting for whether this was a write access when setting
      the dirty bit.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b23eee4
    • S
      KVM: PPC: Book3S HV: Align gfn to L1 page size when inserting nest-rmap entry · 8400f874
      Suraj Jitindar Singh 提交于
      Nested rmap entries are used to store the translation from L1 gpa to L2
      gpa when entries are inserted into the shadow (nested) page tables. This
      rmap list is located by indexing the rmap array in the memslot by L1
      gfn. When we come to search for these entries we only know the L1 page size
      (which could be PAGE_SIZE, 2M or a 1G page) and so can only select a gfn
      aligned to that size. This means that when we insert the entry, so we can
      find it later, we need to align the gfn we use to select the rmap list
      in which to insert the entry to L1 page size as well.
      
      By not doing this we were missing nested rmap entries when modifying L1
      ptes which were for a page also passed through to an L2 guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8400f874
    • S
      KVM: PPC: Book3S HV: Hold kvm->mmu_lock across updating nested pte rc bits · bec6e03b
      Suraj Jitindar Singh 提交于
      We already hold the kvm->mmu_lock spin lock across updating the rc bits
      in the pte for the L1 guest. Continue to hold the lock across updating
      the rc bits in the pte for the nested guest as well to prevent
      invalidations from occurring.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bec6e03b
  7. 20 12月, 2018 2 次提交
  8. 17 12月, 2018 12 次提交
    • S
      KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest · 95d386c2
      Suraj Jitindar Singh 提交于
      Previously when a device was being emulated by an L1 guest for an L2
      guest, that device couldn't then be passed through to an L3 guest. This
      was because the L1 guest had no method for accessing L3 memory.
      
      The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
      passthrough can now be allowed.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      95d386c2
    • S
      KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2 · 6ff887b8
      Suraj Jitindar Singh 提交于
      A guest cannot access quadrants 1 or 2 as this would result in an
      exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
      guest when it wants to perform an access to quadrants 1 or 2, for
      example when it wants to access memory for one of its nested guests.
      
      Also provide an implementation for the kvm-hv module.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6ff887b8
    • S
      KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest · 873db2cd
      Suraj Jitindar Singh 提交于
      Allow for a device which is being emulated at L0 (the host) for an L1
      guest to be passed through to a nested (L2) guest.
      
      The existing kvmppc_hv_emulate_mmio function can be used here. The main
      challenge is that for a load the result must be stored into the L2 gpr,
      not an L1 gpr as would normally be the case after going out to qemu to
      complete the operation. This presents a challenge as at this point the
      L2 gpr state has been written back into L1 memory.
      
      To work around this we store the address in L1 memory of the L2 gpr
      where the result of the load is to be stored and use the new io_gpr
      value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
      which completion must be done when returning back into the kernel. Then
      in kvmppc_complete_mmio_load() the resultant value is written into L1
      memory at the location of the indicated L2 gpr.
      
      Note that we don't currently let an L1 guest emulate a device for an L2
      guest which is then passed through to an L3 guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      873db2cd
    • S
      KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants · cc6929cc
      Suraj Jitindar Singh 提交于
      The functions kvmppc_st and kvmppc_ld are used to access guest memory
      from the host using a guest effective address. They do so by translating
      through the process table to obtain a guest real address and then using
      kvm_read_guest or kvm_write_guest to make the access with the guest real
      address.
      
      This method of access however only works for L1 guests and will give the
      incorrect results for a nested guest.
      
      We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
      perform the access for a nested guesti (and a L1 guest). So attempt this
      method first and fall back to the old method if this fails and we aren't
      running a nested guest.
      
      At this stage there is no fall back method to perform the access for a
      nested guest and this is left as a future improvement. For now we will
      return to the nested guest and rely on the fact that a translation
      should be faulted in before retrying the access.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cc6929cc
    • S
      KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct · dceadcf9
      Suraj Jitindar Singh 提交于
      The kvmppc_ops struct is used to store function pointers to kvm
      implementation specific functions.
      
      Introduce two new functions load_from_eaddr and store_to_eaddr to be
      used to load from and store to a guest effective address respectively.
      
      Also implement these for the kvm-hv module. If we are using the radix
      mmu then we can call the functions to access quadrant 1 and 2.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      dceadcf9
    • S
      KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2 · d7b45615
      Suraj Jitindar Singh 提交于
      The POWER9 radix mmu has the concept of quadrants. The quadrant number
      is the two high bits of the effective address and determines the fully
      qualified address to be used for the translation. The fully qualified
      address consists of the effective lpid, the effective pid and the
      effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
      
      When accessing these quadrants the fully qualified address is obtained
      as follows:
      
      Quadrant		| Hypervisor		| Guest
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b00	| EA[0:1] = 0b00
      0			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = PIDR	| effPID  = PIDR
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b01	|
      1			| effLPID = LPIDR	| Invalid Access
      			| effPID  = PIDR	|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b10	|
      2			| effLPID = LPIDR	| Invalid Access
      			| effPID  = 0		|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b11	| EA[0:1] = 0b11
      3			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = 0		| effPID  = 0
      --------------------------------------------------------------------------
      
      In the Guest;
      Quadrant 3 is normally used to address the operating system since this
      uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
      be switched.
      Quadrant 0 is normally used to address user space since the effLPID and
      effPID are taken from the corresponding registers.
      
      In the Host;
      Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
      address the host.
      
      Quadrants 1 and 2 can be used by the host to address guest memory using
      a guest effective address. Since the effLPID comes from the LPID register,
      the host loads the LPID of the guest it would like to access (and the
      PID of the process) and can perform accesses to a guest effective
      address.
      
      This means quadrant 1 can be used to address the guest user space and
      quadrant 2 can be used to address the guest operating system from the
      hypervisor, using a guest effective address.
      
      Access to the quadrants can cause a Hypervisor Data Storage Interrupt
      (HDSI) due to being unable to perform partition scoped translation.
      Previously this could only be generated from a guest and so the code
      path expects us to take the KVM trampoline in the interrupt handler.
      This is no longer the case so we modify the handler to call
      bad_page_fault() to check if we were expecting this fault so we can
      handle it gracefully and just return with an error code. In the hash mmu
      case we still raise an unknown exception since quadrants aren't defined
      for the hash mmu.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d7b45615
    • S
      KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix() · d232afeb
      Suraj Jitindar Singh 提交于
      There exists a function kvm_is_radix() which is used to determine if a
      kvm instance is using the radix mmu. However this only applies to the
      first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
      be used to determine if the current execution context of the vcpu is
      radix, accounting for if the vcpu is running a nested guest.
      
      Currently all nested guests must be radix but this may change in the
      future.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d232afeb
    • S
      KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines · 693ac10a
      Suraj Jitindar Singh 提交于
      The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
      availability of in kernel tce acceleration for vfio. However it is
      currently the case that this is only available on a powernv machine,
      not for a pseries machine.
      
      Thus make this capability dependent on having the cpu feature
      CPU_FTR_HVMODE.
      
      [paulus@ozlabs.org - fixed compilation for Book E.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      693ac10a
    • P
      KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off · 5af3e9d0
      Paul Mackerras 提交于
      This adds code to flush the partition-scoped page tables for a radix
      guest when dirty tracking is turned on or off for a memslot.  Only the
      guest real addresses covered by the memslot are flushed.  The reason
      for this is to get rid of any 2M PTEs in the partition-scoped page
      tables that correspond to host transparent huge pages, so that page
      dirtiness is tracked at a system page (4k or 64k) granularity rather
      than a 2M granularity.  The page tables are also flushed when turning
      dirty tracking off so that the memslot's address space can be
      repopulated with THPs if possible.
      
      To do this, we add a new function kvmppc_radix_flush_memslot().  Since
      this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
      guest, we now make kvmppc_core_flush_memslot_hv() call the new
      kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
      for each page in the memslot.  This has the effect of fixing a bug in
      that kvmppc_core_flush_memslot_hv() was previously calling
      kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
      is required to be held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5af3e9d0
    • P
      KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments · c43c3a86
      Paul Mackerras 提交于
      This adds 'const' to the declarations for the struct kvm_memory_slot
      pointer parameters of some functions, which will make it possible to
      call those functions from kvmppc_core_commit_memory_region_hv()
      in the next patch.
      
      This also fixes some comments about locking.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c43c3a86
    • P
      KVM: PPC: Book3S HV: Map single pages when doing dirty page logging · f460f679
      Paul Mackerras 提交于
      For radix guests, this makes KVM map guest memory as individual pages
      when dirty page logging is enabled for the memslot corresponding to the
      guest real address.  Having a separate partition-scoped PTE for each
      system page mapped to the guest means that we have a separate dirty
      bit for each page, thus making the reported dirty bitmap more accurate.
      Without this, if part of guest memory is backed by transparent huge
      pages, the dirty status is reported at a 2MB granularity rather than
      a 64kB (or 4kB) granularity for that part, causing userspace to have
      to transmit more data when migrating the guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f460f679
    • B
      KVM: PPC: Pass change type down to memslot commit function · f032b734
      Bharata B Rao 提交于
      Currently, kvm_arch_commit_memory_region() gets called with a
      parameter indicating what type of change is being made to the memslot,
      but it doesn't pass it down to the platform-specific memslot commit
      functions.  This adds the `change' parameter to the lower-level
      functions so that they can use it in future.
      
      [paulus@ozlabs.org - fix book E also.]
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f032b734