1. 16 2月, 2016 1 次提交
  2. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  3. 21 9月, 2015 1 次提交
  4. 16 9月, 2015 1 次提交
    • P
      KVM: add halt_attempted_poll to VCPU stats · 62bea5bf
      Paolo Bonzini 提交于
      This new statistic can help diagnosing VCPUs that, for any reason,
      trigger bad behavior of halt_poll_ns autotuning.
      
      For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
      like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
      10+20+40+80+160+320+480 = 1110 microseconds out of every
      479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
      is consuming about 30% more CPU than it would use without
      polling.  This would show as an abnormally high number of
      attempted polling compared to the successful polls.
      
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62bea5bf
  5. 22 8月, 2015 1 次提交
    • T
      KVM: PPC: Fix warnings from sparse · 5358a963
      Thomas Huth 提交于
      When compiling the KVM code for POWER with "make C=1", sparse
      complains about functions missing proper prototypes and a 64-bit
      constant missing the ULL prefix. Let's fix this by making the
      functions static or by including the proper header with the
      prototypes, and by appending a ULL prefix to the constant
      PPC_MPPE_ADDRESS_MASK.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5358a963
  6. 07 8月, 2015 1 次提交
  7. 28 5月, 2015 1 次提交
  8. 26 5月, 2015 1 次提交
  9. 21 4月, 2015 1 次提交
    • D
      kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVM · 99342cf8
      David Gibson 提交于
      On POWER, storage caching is usually configured via the MMU - attributes
      such as cache-inhibited are stored in the TLB and the hashed page table.
      
      This makes correctly performing cache inhibited IO accesses awkward when
      the MMU is turned off (real mode).  Some CPU models provide special
      registers to control the cache attributes of real mode load and stores but
      this is not at all consistent.  This is a problem in particular for SLOF,
      the firmware used on KVM guests, which runs entirely in real mode, but
      which needs to do IO to load the kernel.
      
      To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
      and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
      a logical address (aka guest physical address).  SLOF uses these for IO.
      
      However, because these are implemented within qemu, not the host kernel,
      these bypass any IO devices emulated within KVM itself.  The simplest way
      to see this problem is to attempt to boot a KVM guest from a virtio-blk
      device with iothread / dataplane enabled.  The iothread code relies on an
      in kernel implementation of the virtio queue notification, which is not
      triggered by the IO hcalls, and so the guest will stall in SLOF unable to
      load the guest OS.
      
      This patch addresses this by providing in-kernel implementations of the
      2 hypercalls, which correctly scan the KVM IO bus.  Any access to an
      address not handled by the KVM IO bus will cause a VM exit, hitting the
      qemu implementation as before.
      
      Note that a userspace change is also required, in order to enable these
      new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [agraf: fix compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      99342cf8
  10. 06 2月, 2015 1 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
  11. 17 12月, 2014 1 次提交
  12. 24 9月, 2014 1 次提交
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
  13. 22 9月, 2014 4 次提交
  14. 29 7月, 2014 1 次提交
  15. 28 7月, 2014 10 次提交
    • A
      KVM: PPC: Move kvmppc_ld/st to common code · 35c4a733
      Alexander Graf 提交于
      We have enough common infrastructure now to resolve GVA->GPA mappings at
      runtime. With this we can move our book3s specific helpers to load / store
      in guest virtual address space to common code as well.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      35c4a733
    • A
      KVM: PPC: Implement kvmppc_xlate for all targets · 7d15c06f
      Alexander Graf 提交于
      We have a nice API to find the translated GPAs of a GVA including protection
      flags. So far we only use it on Book3S, but there's no reason the same shouldn't
      be used on BookE as well.
      
      Implement a kvmppc_xlate() version for BookE and clean it up to make it more
      readable in general.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7d15c06f
    • P
      KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication · 1b2e33b0
      Paul Mackerras 提交于
      At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns
      any error indication, it returns -ENOENT, which is taken to mean an
      HPTE not found error.  However, the error could have been a segment
      found (no SLB entry) or a permission error.  Similarly,
      kvmppc_pte_to_hva currently does permission checking, but any error
      from it is taken by kvmppc_ld to mean that the access is an emulated
      MMIO access.  Also, kvmppc_ld does no execute permission checking.
      
      This fixes these problems by (a) returning any error from kvmppc_xlate
      directly, (b) moving the permission check from kvmppc_pte_to_hva
      into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld.
      
      This is similar to what was done for kvmppc_st() by commit 82ff911317c3
      ("KVM: PPC: Deflect page write faults properly in kvmppc_st").
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b2e33b0
    • M
      KVM: PPC: Allow kvmppc_get_last_inst() to fail · 51f04726
      Mihai Caraman 提交于
      On book3e, guest last instruction is read on the exit path using load
      external pid (lwepx) dedicated instruction. This load operation may fail
      due to TLB eviction and execute-but-not-read entries.
      
      This patch lay down the path for an alternative solution to read the guest
      last instruction, by allowing kvmppc_get_lat_inst() function to fail.
      Architecture specific implmentations of kvmppc_load_last_inst() may read
      last guest instruction and instruct the emulation layer to re-execute the
      guest in case of failure.
      
      Make kvmppc_get_last_inst() definition common between architectures.
      Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      51f04726
    • A
      KVM: PPC: Book3S: Make magic page properly 4k mappable · 89b68c96
      Alexander Graf 提交于
      The magic page is defined as a 4k page of per-vCPU data that is shared
      between the guest and the host to accelerate accesses to privileged
      registers.
      
      However, when the host is using 64k page size granularity we weren't quite
      as strict about that rule anymore. Instead, we partially treated all of the
      upper 64k as magic page and mapped only the uppermost 4k with the actual
      magic contents.
      
      This works well enough for Linux which doesn't use any memory in kernel
      space in the upper 64k, but Mac OS X got upset. So this patch makes magic
      page actually stay in a 4k range even on 64k page size hosts.
      
      This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
      hosts for me.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      89b68c96
    • A
      KVM: PPC: Book3S: Add hack for split real mode · c01e3f66
      Alexander Graf 提交于
      Today we handle split real mode by mapping both instruction and data faults
      into a special virtual address space that only exists during the split mode
      phase.
      
      This is good enough to catch 32bit Linux guests that use split real mode for
      copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
      instruction pointer and can map the user space process freely below there.
      
      However, that approach fails when we're running KVM inside of KVM. Here the 1st
      level last_inst reader may well be in the same virtual page as a 2nd level
      interrupt handler.
      
      It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
      kernel copy_from/to_user implementation can easily overlap with user space
      addresses.
      
      The architecturally correct way to fix this would be to implement an instruction
      interpreter in KVM that kicks in whenever we go into split real mode. This
      interpreter however would not receive a great amount of testing and be a lot of
      bloat for a reasonably isolated corner case.
      
      So I went back to the drawing board and tried to come up with a way to make
      split real mode work with a single flat address space. And then I realized that
      we could get away with the same trick that makes it work for Linux:
      
      Whenever we see an instruction address during split real mode that may collide,
      we just move it higher up the virtual address space to a place that hopefully
      does not collide (keep your fingers crossed!).
      
      That approach does work surprisingly well. I am able to successfully run
      Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
      apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
      broken split real mode :).
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c01e3f66
    • A
      KVM: PPC: Deflect page write faults properly in kvmppc_st · 17824b5a
      Alexander Graf 提交于
      When we have a page that we're not allowed to write to, xlate() will already
      tell us -EPERM on lookup of that page. With the code as is we change it into
      a "page missing" error which a guest may get confused about. Instead, just
      tell the caller about the -EPERM directly.
      
      This fixes Mac OS X guests when run with DCBZ32 emulation.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      17824b5a
    • P
      KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled · ae2113a4
      Paul Mackerras 提交于
      This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
      capability is used to enable or disable in-kernel handling of an
      hcall, that the hcall is actually implemented by the kernel.
      If not an EINVAL error is returned.
      
      This also checks the default-enabled list of hcalls and prints a
      warning if any hcall there is not actually implemented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ae2113a4
    • A
      KVM: PPC: BOOK3S: PR: Emulate instruction counter · 06da28e7
      Aneesh Kumar K.V 提交于
      Writing to IC is not allowed in the privileged mode.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      06da28e7
    • A
      KVM: PPC: BOOK3S: PR: Emulate virtual timebase register · 8f42ab27
      Aneesh Kumar K.V 提交于
      virtual time base register is a per VM, per cpu register that needs
      to be saved and restored on vm exit and entry. Writing to VTB is not
      allowed in the privileged mode.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: fix compile error]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8f42ab27
  16. 30 5月, 2014 4 次提交
    • A
      KVM: PPC: Book3S PR: Expose EBB registers · 2e23f544
      Alexander Graf 提交于
      POWER8 introduces a new facility called the "Event Based Branch" facility.
      It contains of a few registers that indicate where a guest should branch to
      when a defined event occurs and it's in PR mode.
      
      We don't want to really enable EBB as it will create a big mess with !PR guest
      mode while hardware is in PR and we don't really emulate the PMU anyway.
      
      So instead, let's just leave it at emulation of all its registers.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2e23f544
    • A
      KVM: PPC: Book3S PR: Expose TAR facility to guest · e14e7a1e
      Alexander Graf 提交于
      POWER8 implements a new register called TAR. This register has to be
      enabled in FSCR and then from KVM's point of view is mere storage.
      
      This patch enables the guest to use TAR.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e14e7a1e
    • A
      KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR · 616dff86
      Alexander Graf 提交于
      POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
      which contains its status message in a new register called FSCR.
      
      Handle these exits and try to emulate instructions for unhandled facilities.
      Follow-on patches enable KVM to expose specific facilities into the guest.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      616dff86
    • A
      KVM: PPC: Make shared struct aka magic page guest endian · 5deb8e7a
      Alexander Graf 提交于
      The shared (magic) page is a data structure that contains often used
      supervisor privileged SPRs accessible via memory to the user to reduce
      the number of exits we have to take to read/write them.
      
      When we actually share this structure with the guest we have to maintain
      it in guest endianness, because some of the patch tricks only work with
      native endian load/store operations.
      
      Since we only share the structure with either host or guest in little
      endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
      
      For booke, the shared struct stays big endian. For book3s_64 hv we maintain
      the struct in host native endian, since it never gets shared with the guest.
      
      For book3s_64 pr we introduce a variable that tells us which endianness the
      shared struct is in and route every access to it through helper inline
      functions that evaluate this variable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      5deb8e7a
  17. 28 4月, 2014 1 次提交
  18. 09 1月, 2014 2 次提交
  19. 18 10月, 2013 2 次提交
  20. 17 10月, 2013 4 次提交