1. 31 1月, 2017 3 次提交
  2. 27 9月, 2016 1 次提交
    • P
      KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread · 88b02cf9
      Paul Mackerras 提交于
      POWER8 has one virtual timebase (VTB) register per subcore, not one
      per CPU thread.  The HV KVM code currently treats VTB as a per-thread
      register, which can lead to spurious soft lockup messages from guests
      which use the VTB as the time source for the soft lockup detector.
      (CPUs before POWER8 did not have the VTB register.)
      
      For HV KVM, this fixes the problem by making only the primary thread
      in each virtual core save and restore the VTB value.  With this,
      the VTB state becomes part of the kvmppc_vcore structure.  This
      also means that "piggybacking" of multiple virtual cores onto one
      subcore is not possible on POWER8, because then the virtual cores
      would share a single VTB register.
      
      PR KVM emulates a VTB register, which is per-vcpu because PR KVM
      has no notion of CPU threads or SMT.  For PR KVM we move the VTB
      state into the kvmppc_vcpu_book3s struct.
      
      Cc: stable@vger.kernel.org # v3.14+
      Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      88b02cf9
  3. 12 9月, 2016 1 次提交
  4. 08 9月, 2016 3 次提交
    • S
      KVM: PPC: Book3S HV: Implement halt polling · 0cda69dd
      Suraj Jitindar Singh 提交于
      This patch introduces new halt polling functionality into the kvm_hv kernel
      module. When a vcore is idle it will poll for some period of time before
      scheduling itself out.
      
      When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
      idle) we schedule ourselves out to allow something else to run. In the
      event that we need to wake up very quickly (for example an interrupt
      arrives), we are required to wait until we get scheduled again.
      
      Implement halt polling so that when a vcore is idle, and before scheduling
      ourselves, we poll for vcpus in the runnable_threads list which have
      pending exceptions or which leave the ceded state. If we poll successfully
      then we can get back into the guest very quickly without ever scheduling
      ourselves, otherwise we schedule ourselves out as before.
      
      There exists generic halt_polling code in virt/kvm_main.c, however on
      powerpc the polling conditions are different to the generic case. It would
      be nice if we could just implement an arch specific kvm_check_block()
      function, but there is still other arch specific things which need to be
      done for kvm_hv (for example manipulating vcore states) which means that a
      separate implementation is the best option.
      
      Testing of this patch with a TCP round robin test between two guests with
      virtio network interfaces has found a decrease in round trip time of ~15us
      on average. A performance gain is only seen when going out of and
      back into the guest often and quickly, otherwise there is no net benefit
      from the polling. The polling interval is adjusted such that when we are
      often scheduled out for long periods of time it is reduced, and when we
      often poll successfully it is increased. The rate at which the polling
      interval increases or decreases, and the maximum polling interval, can
      be set through module parameters.
      
      Based on the implementation in the generic kvm module by Wanpeng Li and
      Paolo Bonzini, and on direction from Paul Mackerras.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0cda69dd
    • S
      KVM: PPC: Book3S HV: Change vcore element runnable_threads from linked-list to array · 7b5f8272
      Suraj Jitindar Singh 提交于
      The struct kvmppc_vcore is a structure used to store various information
      about a virtual core for a kvm guest. The runnable_threads element of the
      struct provides a list of all of the currently runnable vcpus on the core
      (those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of
      this list was a linked_list. The next patch requires that the list be able
      to be iterated over without holding the vcore lock.
      
      Reimplement the runnable_threads list in the kvmppc_vcore struct as an
      array. Implement function to iterate over valid entries in the array and
      update access sites accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b5f8272
    • S
      KVM: PPC: Book3S HV: Move struct kvmppc_vcore from kvm_host.h to kvm_book3s.h · e64fb7e2
      Suraj Jitindar Singh 提交于
      The next commit will introduce a member to the kvmppc_vcore struct which
      references MAX_SMT_THREADS which is defined in kvm_book3s_asm.h, however
      this file isn't included in kvm_host.h directly. Thus compiling for
      certain platforms such as pmac32_defconfig and ppc64e_defconfig with KVM
      fails due to MAX_SMT_THREADS not being defined.
      
      Move the struct kvmppc_vcore definition to kvm_book3s.h which explicitly
      includes kvm_book3s_asm.h.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e64fb7e2
  5. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  6. 22 8月, 2015 2 次提交
    • S
      KVM: PPC: Book3S: correct width in XER handling · c63517c2
      Sam bobroff 提交于
      In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
      bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
      accessed as such.
      
      This patch corrects places where it is accessed as a 32 bit field by a
      64 bit kernel.  In some cases this is via a 32 bit load or store
      instruction which, depending on endianness, will cause either the
      lower or upper 32 bits to be missed.  In another case it is cast as a
      u32, causing the upper 32 bits to be cleared.
      
      This patch corrects those places by extending the access methods to
      64 bits.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Reviewed-by: NLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c63517c2
    • P
      KVM: PPC: Book3S HV: Fix bug in dirty page tracking · 08fe1e7b
      Paul Mackerras 提交于
      This fixes a bug in the tracking of pages that get modified by the
      guest.  If the guest creates a large-page HPTE, writes to memory
      somewhere within the large page, and then removes the HPTE, we only
      record the modified state for the first normal page within the large
      page, when in fact the guest might have modified some other normal
      page within the large page.
      
      To fix this we use some unused bits in the rmap entry to record the
      order (log base 2) of the size of the page that was modified, when
      removing an HPTE.  Then in kvm_test_clear_dirty_npages() we use that
      order to return the correct number of modified pages.
      
      The same thing could in principle happen when removing a HPTE at the
      host's request, i.e. when paging out a page, except that we never
      page out large pages, and the guest can only create large-page HPTEs
      if the guest RAM is backed by large pages.  However, we also fix
      this case for the sake of future-proofing.
      
      The reference bit is also subject to the same loss of information.  We
      don't make the same fix here for the reference bit because there isn't
      an interface for userspace to find out which pages the guest has
      referenced, whereas there is one for userspace to find out which pages
      the guest has modified.  Because of this loss of information, the
      kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
      say that a page has not been referenced when it has, but that doesn't
      matter greatly because we never page or swap out large pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      08fe1e7b
  7. 21 4月, 2015 1 次提交
    • D
      kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM · 99342cf8
      David Gibson 提交于
      On POWER, storage caching is usually configured via the MMU - attributes
      such as cache-inhibited are stored in the TLB and the hashed page table.
      
      This makes correctly performing cache inhibited IO accesses awkward when
      the MMU is turned off (real mode).  Some CPU models provide special
      registers to control the cache attributes of real mode load and stores but
      this is not at all consistent.  This is a problem in particular for SLOF,
      the firmware used on KVM guests, which runs entirely in real mode, but
      which needs to do IO to load the kernel.
      
      To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
      and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
      a logical address (aka guest physical address).  SLOF uses these for IO.
      
      However, because these are implemented within qemu, not the host kernel,
      these bypass any IO devices emulated within KVM itself.  The simplest way
      to see this problem is to attempt to boot a KVM guest from a virtio-blk
      device with iothread / dataplane enabled.  The iothread code relies on an
      in kernel implementation of the virtio queue notification, which is not
      triggered by the IO hcalls, and so the guest will stall in SLOF unable to
      load the guest OS.
      
      This patch addresses this by providing in-kernel implementations of the
      2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
      address not handled by the KVM IO bus will cause a VM exit, hitting the
      qemu implementation as before.
      
      Note that a userspace change is also required, in order to enable these
      new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [agraf: fix compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      99342cf8
  8. 09 3月, 2015 1 次提交
  9. 17 12月, 2014 1 次提交
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
  10. 31 7月, 2014 1 次提交
    • A
      KVM: PPC: PR: Handle FSCR feature deselects · 8e6afa36
      Alexander Graf 提交于
      We handle FSCR feature bits (well, TAR only really today) lazily when the guest
      starts using them. So when a guest activates the bit and later uses that feature
      we enable it for real in hardware.
      
      However, when the guest stops using that bit we don't stop setting it in
      hardware. That means we can potentially lose a trap that the guest expects to
      happen because it thinks a feature is not active.
      
      This patch adds support to drop TAR when then guest turns it off in FSCR. While
      at it it also restricts FSCR access to 64bit systems - 32bit ones don't have it.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8e6afa36
  11. 28 7月, 2014 8 次提交
    • A
      KVM: PPC: Handle magic page in kvmppc_ld/st · c12fb43c
      Alexander Graf 提交于
      We use kvmppc_ld and kvmppc_st to emulate load/store instructions that may as
      well access the magic page. Special case it out so that we can properly access
      it.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c12fb43c
    • A
      KVM: PPC: Move kvmppc_ld/st to common code · 35c4a733
      Alexander Graf 提交于
      We have enough common infrastructure now to resolve GVA->GPA mappings at
      runtime. With this we can move our book3s specific helpers to load / store
      in guest virtual address space to common code as well.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      35c4a733
    • M
      KVM: PPC: Allow kvmppc_get_last_inst() to fail · 51f04726
      Mihai Caraman 提交于
      On book3e, guest last instruction is read on the exit path using load
      external pid (lwepx) dedicated instruction. This load operation may fail
      due to TLB eviction and execute-but-not-read entries.
      
      This patch lay down the path for an alternative solution to read the guest
      last instruction, by allowing kvmppc_get_lat_inst() function to fail.
      Architecture specific implmentations of kvmppc_load_last_inst() may read
      last guest instruction and instruct the emulation layer to re-execute the
      guest in case of failure.
      
      Make kvmppc_get_last_inst() definition common between architectures.
      Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      51f04726
    • A
      KVM: PPC: Book3S: Make magic page properly 4k mappable · 89b68c96
      Alexander Graf 提交于
      The magic page is defined as a 4k page of per-vCPU data that is shared
      between the guest and the host to accelerate accesses to privileged
      registers.
      
      However, when the host is using 64k page size granularity we weren't quite
      as strict about that rule anymore. Instead, we partially treated all of the
      upper 64k as magic page and mapped only the uppermost 4k with the actual
      magic contents.
      
      This works well enough for Linux which doesn't use any memory in kernel
      space in the upper 64k, but Mac OS X got upset. So this patch makes magic
      page actually stay in a 4k range even on 64k page size hosts.
      
      This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
      hosts for me.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      89b68c96
    • A
      KVM: PPC: Book3S: Add hack for split real mode · c01e3f66
      Alexander Graf 提交于
      Today we handle split real mode by mapping both instruction and data faults
      into a special virtual address space that only exists during the split mode
      phase.
      
      This is good enough to catch 32bit Linux guests that use split real mode for
      copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
      instruction pointer and can map the user space process freely below there.
      
      However, that approach fails when we're running KVM inside of KVM. Here the 1st
      level last_inst reader may well be in the same virtual page as a 2nd level
      interrupt handler.
      
      It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
      kernel copy_from/to_user implementation can easily overlap with user space
      addresses.
      
      The architecturally correct way to fix this would be to implement an instruction
      interpreter in KVM that kicks in whenever we go into split real mode. This
      interpreter however would not receive a great amount of testing and be a lot of
      bloat for a reasonably isolated corner case.
      
      So I went back to the drawing board and tried to come up with a way to make
      split real mode work with a single flat address space. And then I realized that
      we could get away with the same trick that makes it work for Linux:
      
      Whenever we see an instruction address during split real mode that may collide,
      we just move it higher up the virtual address space to a place that hopefully
      does not collide (keep your fingers crossed!).
      
      That approach does work surprisingly well. I am able to successfully run
      Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
      apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
      broken split real mode :).
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c01e3f66
    • A
      KVM: PPC: Book3S HV: Make HTAB code LE host aware · 6f22bd32
      Alexander Graf 提交于
      When running on an LE host all data structures are kept in little endian
      byte order. However, the HTAB still needs to be maintained in big endian.
      
      So every time we access any HTAB we need to make sure we do so in the right
      byte order. Fix up all accesses to manually byte swap.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6f22bd32
    • P
      KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled · ae2113a4
      Paul Mackerras 提交于
      This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
      capability is used to enable or disable in-kernel handling of an
      hcall, that the hcall is actually implemented by the kernel.
      If not an EINVAL error is returned.
      
      This also checks the default-enabled list of hcalls and prints a
      warning if any hcall there is not actually implemented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ae2113a4
    • P
      KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling · 699a0ea0
      Paul Mackerras 提交于
      This provides a way for userspace controls which sPAPR hcalls get
      handled in the kernel.  Each hcall can be individually enabled or
      disabled for in-kernel handling, except for H_RTAS.  The exception
      for H_RTAS is because userspace can already control whether
      individual RTAS functions are handled in-kernel or not via the
      KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
      H_RTAS is out of the normal sequence of hcall numbers.
      
      Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
      KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
      The args field of the struct kvm_enable_cap specifies the hcall number
      in args[0] and the enable/disable flag in args[1]; 0 means disable
      in-kernel handling (so that the hcall will always cause an exit to
      userspace) and 1 means enable.  Enabling or disabling in-kernel
      handling of an hcall is effective across the whole VM.
      
      The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
      on PowerPC is new, added by this commit.  The KVM_CAP_ENABLE_CAP_VM
      capability advertises that this ability exists.
      
      When a VM is created, an initial set of hcalls are enabled for
      in-kernel handling.  The set that is enabled is the set that have
      an in-kernel implementation at this point.  Any new hcall
      implementations from this point onwards should not be added to the
      default set without a good reason.
      
      No distinction is made between real-mode and virtual-mode hcall
      implementations; the one setting controls them both.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      699a0ea0
  12. 06 7月, 2014 1 次提交
  13. 30 5月, 2014 1 次提交
    • A
      KVM: PPC: Make shared struct aka magic page guest endian · 5deb8e7a
      Alexander Graf 提交于
      The shared (magic) page is a data structure that contains often used
      supervisor privileged SPRs accessible via memory to the user to reduce
      the number of exits we have to take to read/write them.
      
      When we actually share this structure with the guest we have to maintain
      it in guest endianness, because some of the patch tricks only work with
      native endian load/store operations.
      
      Since we only share the structure with either host or guest in little
      endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
      
      For booke, the shared struct stays big endian. For book3s_64 hv we maintain
      the struct in host native endian, since it never gets shared with the guest.
      
      For book3s_64 pr we introduce a variable that tells us which endianness the
      shared struct is in and route every access to it through helper inline
      functions that evaluate this variable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5deb8e7a
  14. 26 3月, 2014 1 次提交
    • G
      KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write · e59d24e6
      Greg Kurz 提交于
      When the guest does an MMIO write which is handled successfully by an
      ioeventfd, ioeventfd_write() returns 0 (success) and
      kvmppc_handle_store() returns EMULATE_DONE.  Then
      kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and
      this causes an exit from the loop in kvmppc_vcpu_run_hv(), causing an
      exit back to userspace with a bogus exit reason code, typically
      causing userspace (e.g. qemu) to crash with a message about an unknown
      exit code.
      
      This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv() in order
      to fix that.  For generality, we define a helper to check for either
      of the return-to-guest codes we use, RESUME_GUEST and RESUME_GUEST_NV,
      to make it easy to check for either and provide one place to update if
      any other return-to-guest code gets defined in future.
      
      Since it only affects Book3S HV for now, the helper is added to
      the kvm_book3s.h header file.
      
      We use the helper in two places in kvmppc_run_core() as well for
      future-proofing, though we don't see RESUME_GUEST_NV in either place
      at present.
      
      [paulus@samba.org - combined 4 patches into one, rewrote description]
      Suggested-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      e59d24e6
  15. 27 1月, 2014 1 次提交
    • C
      KVM: PPC: Book3S: MMIO emulation support for little endian guests · 73601775
      Cédric Le Goater 提交于
      MMIO emulation reads the last instruction executed by the guest
      and then emulates. If the guest is running in Little Endian order,
      or more generally in a different endian order of the host, the
      instruction needs to be byte-swapped before being emulated.
      
      This patch adds a helper routine which tests the endian order of
      the host and the guest in order to decide whether a byteswap is
      needed or not. It is then used to byteswap the last instruction
      of the guest in the endian order of the host before MMIO emulation
      is performed.
      
      Finally, kvmppc_handle_load() of kvmppc_handle_store() are modified
      to reverse the endianness of the MMIO if required.
      Signed-off-by: NCédric Le Goater <clg@fr.ibm.com>
      [agraf: add booke handling]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      73601775
  16. 09 1月, 2014 2 次提交
  17. 09 12月, 2013 1 次提交
  18. 17 10月, 2013 10 次提交
    • A
      kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_ops · 699cc876
      Aneesh Kumar K.V 提交于
      This help us to identify whether we are running with hypervisor mode KVM
      enabled. The change is needed so that we can have both HV and PR kvm
      enabled in the same kernel.
      
      If both HV and PR KVM are included, interrupts come in to the HV version
      of the kvmppc_interrupt code, which then jumps to the PR handler,
      renamed to kvmppc_interrupt_pr, if the guest is a PR guest.
      
      Allowing both PR and HV in the same kernel required some changes to
      kvm_dev_ioctl_check_extension(), since the values returned now can't
      be selected with #ifdefs as much as previously. We look at is_hv_enabled
      to return the right value when checking for capabilities.For capabilities that
      are only provided by HV KVM, we return the HV value only if
      is_hv_enabled is true. For capabilities provided by PR KVM but not HV,
      we return the PR value only if is_hv_enabled is false.
      
      NOTE: in later patch we replace is_hv_enabled with a static inline
      function comparing kvm_ppc_ops
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      699cc876
    • A
      kvm: powerpc: Add kvmppc_ops callback · 3a167bea
      Aneesh Kumar K.V 提交于
      This patch add a new callback kvmppc_ops. This will help us in enabling
      both HV and PR KVM together in the same kernel. The actual change to
      enable them together is done in the later patch in the series.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: squash in booke changes]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3a167bea
    • A
      kvm: powerpc: book3s: Add a new config variable CONFIG_KVM_BOOK3S_HV_POSSIBLE · 9975f5e3
      Aneesh Kumar K.V 提交于
      This help ups to select the relevant code in the kernel code
      when we later move HV and PR bits as seperate modules. The patch
      also makes the config options for PR KVM selectable
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9975f5e3
    • A
      kvm: powerpc: book3s: pr: Rename KVM_BOOK3S_PR to KVM_BOOK3S_PR_POSSIBLE · 7aa79938
      Aneesh Kumar K.V 提交于
      With later patches supporting PR kvm as a kernel module, the changes
      that has to be built into the main kernel binary to enable PR KVM module
      is now selected via KVM_BOOK3S_PR_POSSIBLE
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7aa79938
    • P
      KVM: PPC: Book3S PR: Use mmu_notifier_retry() in kvmppc_mmu_map_page() · d78bca72
      Paul Mackerras 提交于
      When the MM code is invalidating a range of pages, it calls the KVM
      kvm_mmu_notifier_invalidate_range_start() notifier function, which calls
      kvm_unmap_hva_range(), which arranges to flush all the existing host
      HPTEs for guest pages.  However, the Linux PTEs for the range being
      flushed are still valid at that point.  We are not supposed to establish
      any new references to pages in the range until the ...range_end()
      notifier gets called.  The PPC-specific KVM code doesn't get any
      explicit notification of that; instead, we are supposed to use
      mmu_notifier_retry() to test whether we are or have been inside a
      range flush notifier pair while we have been getting a page and
      instantiating a host HPTE for the page.
      
      This therefore adds a call to mmu_notifier_retry inside
      kvmppc_mmu_map_page().  This call is inside a region locked with
      kvm->mmu_lock, which is the same lock that is called by the KVM
      MMU notifier functions, thus ensuring that no new notification can
      proceed while we are in the locked region.  Inside this region we
      also create the host HPTE and link the corresponding hpte_cache
      structure into the lists used to find it later.  We cannot allocate
      the hpte_cache structure inside this locked region because that can
      lead to deadlock, so we allocate it outside the region and free it
      if we end up not using it.
      
      This also moves the updates of vcpu3s->hpte_cache_count inside the
      regions locked with vcpu3s->mmu_lock, and does the increment in
      kvmppc_mmu_hpte_cache_map() when the pte is added to the cache
      rather than when it is allocated, in order that the hpte_cache_count
      is accurate.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d78bca72
    • P
      KVM: PPC: Book3S PR: Better handling of host-side read-only pages · 93b159b4
      Paul Mackerras 提交于
      Currently we request write access to all pages that get mapped into the
      guest, even if the guest is only loading from the page.  This reduces
      the effectiveness of KSM because it means that we unshare every page we
      access.  Also, we always set the changed (C) bit in the guest HPTE if
      it allows writing, even for a guest load.
      
      This fixes both these problems.  We pass an 'iswrite' flag to the
      mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether
      the access is a load or a store.  The mmu.xlate() functions now only
      set C for stores.  kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot()
      instead of gfn_to_pfn() so that it can indicate whether we need write
      access to the page, and get back a 'writable' flag to indicate whether
      the page is writable or not.  If that 'writable' flag is clear, we then
      make the host HPTE read-only even if the guest HPTE allowed writing.
      
      This means that we can get a protection fault when the guest writes to a
      page that it has mapped read-write but which is read-only on the host
      side (perhaps due to KSM having merged the page).  Thus we now call
      kvmppc_handle_pagefault() for protection faults as well as HPTE not found
      faults.  In kvmppc_handle_pagefault(), if the access was allowed by the
      guest HPTE and we thus need to install a new host HPTE, we then need to
      remove the old host HPTE if there is one.  This is done with a new
      function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to
      find and remove the old host HPTE.
      
      Since the memslot-related functions require the KVM SRCU read lock to
      be held, this adds srcu_read_lock/unlock pairs around the calls to
      kvmppc_handle_pagefault().
      
      Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore
      guest HPTEs that don't permit access, and to return -EPERM for accesses
      that are not permitted by the page protections.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      93b159b4
    • P
      KVM: PPC: Book3S PR: Allocate kvm_vcpu structs from kvm_vcpu_cache · 3ff95502
      Paul Mackerras 提交于
      This makes PR KVM allocate its kvm_vcpu structs from the kvm_vcpu_cache
      rather than having them embedded in the kvmppc_vcpu_book3s struct,
      which is allocated with vzalloc.  The reason is to reduce the
      differences between PR and HV KVM in order to make is easier to have
      them coexist in one kernel binary.
      
      With this, the kvm_vcpu struct has a pointer to the kvmppc_vcpu_book3s
      struct.  The pointer to the kvmppc_book3s_shadow_vcpu struct has moved
      from the kvmppc_vcpu_book3s struct to the kvm_vcpu struct, and is only
      present for 32-bit, since it is only used for 32-bit.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: squash in compile fix from Aneesh]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3ff95502
    • P
      KVM: PPC: Book3S PR: Use 64k host pages where possible · c9029c34
      Paul Mackerras 提交于
      Currently, PR KVM uses 4k pages for the host-side mappings of guest
      memory, regardless of the host page size.  When the host page size is
      64kB, we might as well use 64k host page mappings for guest mappings
      of 64kB and larger pages and for guest real-mode mappings.  However,
      the magic page has to remain a 4k page.
      
      To implement this, we first add another flag bit to the guest VSID
      values we use, to indicate that this segment is one where host pages
      should be mapped using 64k pages.  For segments with this bit set
      we set the bits in the shadow SLB entry to indicate a 64k base page
      size.  When faulting in host HPTEs for this segment, we make them
      64k HPTEs instead of 4k.  We record the pagesize in struct hpte_cache
      for use when invalidating the HPTE.
      
      For now we restrict the segment containing the magic page (if any) to
      4k pages.  It should be possible to lift this restriction in future
      by ensuring that the magic 4k page is appropriately positioned within
      a host 64k page.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c9029c34
    • P
      KVM: PPC: Book3S PR: Allow guest to use 64k pages · a4a0f252
      Paul Mackerras 提交于
      This adds the code to interpret 64k HPTEs in the guest hashed page
      table (HPT), 64k SLB entries, and to tell the guest about 64k pages
      in kvm_vm_ioctl_get_smmu_info().  Guest 64k pages are still shadowed
      by 4k pages.
      
      This also adds another hash table to the four we have already in
      book3s_mmu_hpte.c to allow us to find all the PTEs that we have
      instantiated that match a given 64k guest page.
      
      The tlbie instruction changed starting with POWER6 to use a bit in
      the RB operand to indicate large page invalidations, and to use other
      RB bits to indicate the base and actual page sizes and the segment
      size.  64k pages came in slightly earlier, with POWER5++.
      We use one bit in vcpu->arch.hflags to indicate that the emulated
      cpu supports 64k pages, and another to indicate that it has the new
      tlbie definition.
      
      The KVM_PPC_GET_SMMU_INFO ioctl presents a bit of a problem, because
      the MMU capabilities depend on which CPU model we're emulating, but it
      is a VM ioctl not a VCPU ioctl and therefore doesn't get passed a VCPU
      fd.  In addition, commonly-used userspace (QEMU) calls it before
      setting the PVR for any VCPU.  Therefore, as a best effort we look at
      the first vcpu in the VM and return 64k pages or not depending on its
      capabilities.  We also make the PVR default to the host PVR on recent
      CPUs that support 1TB segments (and therefore multiple page sizes as
      well) so that KVM_PPC_GET_SMMU_INFO will include 64k page and 1TB
      segment support on those CPUs.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4a0f252
    • P
      KVM: PPC: Book3S PR: Keep volatile reg values in vcpu rather than shadow_vcpu · a2d56020
      Paul Mackerras 提交于
      Currently PR-style KVM keeps the volatile guest register values
      (R0 - R13, CR, LR, CTR, XER, PC) in a shadow_vcpu struct rather than
      the main kvm_vcpu struct.  For 64-bit, the shadow_vcpu exists in two
      places, a kmalloc'd struct and in the PACA, and it gets copied back
      and forth in kvmppc_core_vcpu_load/put(), because the real-mode code
      can't rely on being able to access the kmalloc'd struct.
      
      This changes the code to copy the volatile values into the shadow_vcpu
      as one of the last things done before entering the guest.  Similarly
      the values are copied back out of the shadow_vcpu to the kvm_vcpu
      immediately after exiting the guest.  We arrange for interrupts to be
      still disabled at this point so that we can't get preempted on 64-bit
      and end up copying values from the wrong PACA.
      
      This means that the accessor functions in kvm_book3s.h for these
      registers are greatly simplified, and are same between PR and HV KVM.
      In places where accesses to shadow_vcpu fields are now replaced by
      accesses to the kvm_vcpu, we can also remove the svcpu_get/put pairs.
      Finally, on 64-bit, we don't need the kmalloc'd struct at all any more.
      
      With this, the time to read the PVR one million times in a loop went
      from 567.7ms to 575.5ms (averages of 6 values), an increase of about
      1.4% for this worse-case test for guest entries and exits.  The
      standard deviation of the measurements is about 11ms, so the
      difference is only marginally significant statistically.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2d56020