1. 15 12月, 2014 2 次提交
  2. 24 9月, 2014 1 次提交
    • A
      kvm: Fix page ageing bugs · 57128468
      Andres Lagar-Cavilla 提交于
      1. We were calling clear_flush_young_notify in unmap_one, but we are
      within an mmu notifier invalidate range scope. The spte exists no more
      (due to range_start) and the accessed bit info has already been
      propagated (due to kvm_pfn_set_accessed). Simply call
      clear_flush_young.
      
      2. We clear_flush_young on a primary MMU PMD, but this may be mapped
      as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
      This required expanding the interface of the clear_flush_young mmu
      notifier, so a lot of code has been trivially touched.
      
      3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
      the access bit by blowing the spte. This requires proper synchronizing
      with MMU notifier consumers, like every other removal of spte's does.
      Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57128468
  3. 03 9月, 2014 1 次提交
    • L
      powerpc/kvm/cma: Fix panic introduces by signed shift operation · 02a68d05
      Laurent Dufour 提交于
      fc95ca72 introduces a memset in
      kvmppc_alloc_hpt since the general CMA doesn't clear the memory it
      allocates.
      
      However, the size argument passed to memset is computed from a signed value
      and its signed bit is extended by the cast the compiler is doing. This lead
      to extremely large size value when dealing with order value >= 31, and
      almost all the memory following the allocated space is cleaned. As a
      consequence, the system is panicing and may even fail spawning the kdump
      kernel.
      
      This fix makes use of an unsigned value for the memset's size argument to
      avoid sign extension. Among this fix, another shift operation which may
      lead to signed extended value too is also fixed.
      
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02a68d05
  4. 07 8月, 2014 1 次提交
  5. 28 7月, 2014 2 次提交
    • M
      KVM: PPC: Allow kvmppc_get_last_inst() to fail · 51f04726
      Mihai Caraman 提交于
      On book3e, guest last instruction is read on the exit path using load
      external pid (lwepx) dedicated instruction. This load operation may fail
      due to TLB eviction and execute-but-not-read entries.
      
      This patch lay down the path for an alternative solution to read the guest
      last instruction, by allowing kvmppc_get_lat_inst() function to fail.
      Architecture specific implmentations of kvmppc_load_last_inst() may read
      last guest instruction and instruct the emulation layer to re-execute the
      guest in case of failure.
      
      Make kvmppc_get_last_inst() definition common between architectures.
      Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      51f04726
    • A
      KVM: PPC: Book3S HV: Make HTAB code LE host aware · 6f22bd32
      Alexander Graf 提交于
      When running on an LE host all data structures are kept in little endian
      byte order. However, the HTAB still needs to be maintained in big endian.
      
      So every time we access any HTAB we need to make sure we do so in the right
      byte order. Fix up all accesses to manually byte swap.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6f22bd32
  6. 25 6月, 2014 1 次提交
  7. 30 5月, 2014 4 次提交
    • P
      KVM: PPC: Book3S HV: Make sure we don't miss dirty pages · 6c576e74
      Paul Mackerras 提交于
      Current, when testing whether a page is dirty (when constructing the
      bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit
      in the HPT entries mapping the page, and if it is 0, we consider the
      page to be clean.  However, the Power ISA doesn't require processors
      to set the C bit to 1 immediately when writing to a page, and in fact
      allows them to delay the writeback of the C bit until they receive a
      TLB invalidation for the page.  Thus it is possible that the page
      could be dirty and we miss it.
      
      Now, if there are vcpus running, this is not serious since the
      collection of the dirty log is racy already - some vcpu could dirty
      the page just after we check it.  But if there are no vcpus running we
      should return definitive results, in case we are in the final phase of
      migrating the guest.
      
      Also, if the permission bits in the HPTE don't allow writing, then we
      know that no CPU can set C.  If the HPTE was previously writable and
      the page was modified, any C bit writeback would have been flushed out
      by the tlbie that we did when changing the HPTE to read-only.
      
      Otherwise we need to do a TLB invalidation even if the C bit is 0, and
      then check the C bit.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      6c576e74
    • A
      KVM: PPC: Book3S HV: Fix dirty map for hugepages · 687414be
      Alexey Kardashevskiy 提交于
      The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has
      one bit per system page (4K/64K).  Currently, we only set one bit in
      the map for each HPT entry with the Change bit set, even if the HPT is
      for a large page (e.g., 16MB).  Userspace then considers only the
      first system page dirty, though in fact the guest may have modified
      anywhere in the large page.
      
      To fix this, we make kvm_test_clear_dirty() return the actual number
      of pages that are dirty (and rename it to kvm_test_clear_dirty_npages()
      to emphasize that that's what it returns).  In kvmppc_hv_get_dirty_log()
      we then set that many bits in the dirty map.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      687414be
    • P
      KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address · 1066f772
      Paul Mackerras 提交于
      Currently, when a huge page is faulted in for a guest, we select the
      rmap chain to insert the HPTE into based on the guest physical address
      that the guest tried to access.  Since there is an rmap chain for each
      system page, there are many rmap chains for the area covered by a huge
      page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page
      HPTE could end up in any one of them.
      
      For consistency, and to make the huge-page HPTEs easier to find, we now
      put huge-page HPTEs in the rmap chain corresponding to the base address
      of the huge page.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1066f772
    • A
      KVM: PPC: BOOK3S: HV: Prefer CMA region for hash page table allocation · 792fc497
      Aneesh Kumar K.V 提交于
      Today when KVM tries to reserve memory for the hash page table it
      allocates from the normal page allocator first. If that fails it
      falls back to CMA's reserved region. One of the side effects of
      this is that we could end up exhausting the page allocator and
      get linux into OOM conditions while we still have plenty of space
      available in CMA.
      
      This patch addresses this issue by first trying hash page table
      allocation from CMA's reserved region before falling back to the normal
      page allocator. So if we run out of memory, we really are out of memory.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      792fc497
  8. 29 3月, 2014 1 次提交
  9. 27 1月, 2014 2 次提交
    • A
      KVM: PPC: Book3S HV: Basic little-endian guest support · d682916a
      Anton Blanchard 提交于
      We create a guest MSR from scratch when delivering exceptions in
      a few places.  Instead of extracting LPCR[ILE] and inserting it
      into MSR_LE each time, we simply create a new variable intr_msr which
      contains the entire MSR to use.  For a little-endian guest, userspace
      needs to set the ILE (interrupt little-endian) bit in the LPCR for
      each vcpu (or at least one vcpu in each virtual core).
      
      [paulus@samba.org - removed H_SET_MODE implementation from original
      version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d682916a
    • C
      KVM: PPC: Book3S: MMIO emulation support for little endian guests · 73601775
      Cédric Le Goater 提交于
      MMIO emulation reads the last instruction executed by the guest
      and then emulates. If the guest is running in Little Endian order,
      or more generally in a different endian order of the host, the
      instruction needs to be byte-swapped before being emulated.
      
      This patch adds a helper routine which tests the endian order of
      the host and the guest in order to decide whether a byteswap is
      needed or not. It is then used to byteswap the last instruction
      of the guest in the endian order of the host before MMIO emulation
      is performed.
      
      Finally, kvmppc_handle_load() of kvmppc_handle_store() are modified
      to reverse the endianness of the MMIO if required.
      Signed-off-by: NCédric Le Goater <clg@fr.ibm.com>
      [agraf: add booke handling]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      73601775
  10. 19 11月, 2013 2 次提交
    • P
      powerpc: kvm: fix rare but potential deadlock scene · 91648ec0
      pingfan liu 提交于
      Since kvmppc_hv_find_lock_hpte() is called from both virtmode and
      realmode, so it can trigger the deadlock.
      
      Suppose the following scene:
      
      Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of
      vcpus.
      
      If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched
      out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be
      caught in realmode for a long time.
      
      What makes things even worse if the following happens,
        On cpuM, bitlockX is hold, on cpuN, Y is hold.
        vcpu_B_2 try to lock Y on cpuM in realmode
        vcpu_A_2 try to lock X on cpuN in realmode
      
      Oops! deadlock happens
      Signed-off-by: NLiu Ping Fan <pingfank@linux.vnet.ibm.com>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      CC: stable@vger.kernel.org
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      91648ec0
    • P
      KVM: PPC: Book3S HV: Fix physical address calculations · caaa4c80
      Paul Mackerras 提交于
      This fixes a bug in kvmppc_do_h_enter() where the physical address
      for a page can be calculated incorrectly if transparent huge pages
      (THP) are active.  Until THP came along, it was true that if we
      encountered a large (16M) page in kvmppc_do_h_enter(), then the
      associated memslot must be 16M aligned for both its guest physical
      address and the userspace address, and the physical address
      calculations in kvmppc_do_h_enter() assumed that.  With THP, that
      is no longer true.
      
      In the case where we are using MMU notifiers and the page size that
      we get from the Linux page tables is larger than the page being mapped
      by the guest, we need to fill in some low-order bits of the physical
      address.  Without THP, these bits would be the same in the guest
      physical address (gpa) and the host virtual address (hva).  With THP,
      they can be different, and we need to use the bits from hva rather
      than gpa.
      
      In the case where we are not using MMU notifiers, the host physical
      address we get from the memslot->arch.slot_phys[] array already
      includes the low-order bits down to the PAGE_SIZE level, even if
      we are using large pages.  Thus we can simplify the calculation in
      this case to just add in the remaining bits in the case where
      PAGE_SIZE is 64k and the guest is mapping a 4k page.
      
      The same bug exists in kvmppc_book3s_hv_page_fault().  The basic fix
      is to use psize (the page size from the HPTE) rather than pte_size
      (the page size from the Linux PTE) when updating the HPTE low word
      in r.  That means that pfn needs to be computed to PAGE_SIZE
      granularity even if the Linux PTE is a huge page PTE.  That can be
      arranged simply by doing the page_to_pfn() before setting page to
      the head of the compound page.  If psize is less than PAGE_SIZE,
      then we need to make sure we only update the bits from PAGE_SIZE
      upwards, in order not to lose any sub-page offset bits in r.
      On the other hand, if psize is greater than PAGE_SIZE, we need to
      make sure we don't bring in non-zero low order bits in pfn, hence
      we mask (pfn << PAGE_SHIFT) with ~(psize - 1).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      caaa4c80
  11. 17 10月, 2013 3 次提交
    • A
      kvm: powerpc: Add kvmppc_ops callback · 3a167bea
      Aneesh Kumar K.V 提交于
      This patch add a new callback kvmppc_ops. This will help us in enabling
      both HV and PR KVM together in the same kernel. The actual change to
      enable them together is done in the later patch in the series.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [agraf: squash in booke changes]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      3a167bea
    • P
      KVM: PPC: Book3S PR: Better handling of host-side read-only pages · 93b159b4
      Paul Mackerras 提交于
      Currently we request write access to all pages that get mapped into the
      guest, even if the guest is only loading from the page.  This reduces
      the effectiveness of KSM because it means that we unshare every page we
      access.  Also, we always set the changed (C) bit in the guest HPTE if
      it allows writing, even for a guest load.
      
      This fixes both these problems.  We pass an 'iswrite' flag to the
      mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether
      the access is a load or a store.  The mmu.xlate() functions now only
      set C for stores.  kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot()
      instead of gfn_to_pfn() so that it can indicate whether we need write
      access to the page, and get back a 'writable' flag to indicate whether
      the page is writable or not.  If that 'writable' flag is clear, we then
      make the host HPTE read-only even if the guest HPTE allowed writing.
      
      This means that we can get a protection fault when the guest writes to a
      page that it has mapped read-write but which is read-only on the host
      side (perhaps due to KSM having merged the page).  Thus we now call
      kvmppc_handle_pagefault() for protection faults as well as HPTE not found
      faults.  In kvmppc_handle_pagefault(), if the access was allowed by the
      guest HPTE and we thus need to install a new host HPTE, we then need to
      remove the old host HPTE if there is one.  This is done with a new
      function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to
      find and remove the old host HPTE.
      
      Since the memslot-related functions require the KVM SRCU read lock to
      be held, this adds srcu_read_lock/unlock pairs around the calls to
      kvmppc_handle_pagefault().
      
      Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore
      guest HPTEs that don't permit access, and to return -EPERM for accesses
      that are not permitted by the page protections.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      93b159b4
    • P
      KVM: PPC: Book3S HV: Store LPCR value for each virtual core · a0144e2a
      Paul Mackerras 提交于
      This adds the ability to have a separate LPCR (Logical Partitioning
      Control Register) value relating to a guest for each virtual core,
      rather than only having a single value for the whole VM.  This
      corresponds to what real POWER hardware does, where there is a LPCR
      per CPU thread but most of the fields are required to have the same
      value on all active threads in a core.
      
      The per-virtual-core LPCR can be read and written using the
      GET/SET_ONE_REG interface.  Userspace can can only modify the
      following fields of the LPCR value:
      
      DPFD	Default prefetch depth
      ILE	Interrupt little-endian
      TC	Translation control (secondary HPT hash group search disable)
      
      We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
      contains bits relating to memory management, i.e. the Virtualized
      Partition Memory (VPM) bits and the bits relating to guest real mode.
      When this default value is updated, the update needs to be propagated
      to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
      that.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix whitespace]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a0144e2a
  12. 26 8月, 2013 1 次提交
  13. 08 7月, 2013 2 次提交
  14. 21 6月, 2013 1 次提交
  15. 27 4月, 2013 2 次提交
    • P
      KVM: PPC: Book3S HV: Report VPA and DTL modifications in dirty map · c35635ef
      Paul Mackerras 提交于
      At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications
      done by the host to the virtual processor areas (VPAs) and dispatch
      trace logs (DTLs) registered by the guest.  This is because those
      modifications are done either in real mode or in the host kernel
      context, and in neither case does the access go through the guest's
      HPT, and thus no change (C) bit gets set in the guest's HPT.
      
      However, the changes done by the host do need to be tracked so that
      the modified pages get transferred when doing live migration.  In
      order to track these modifications, this adds a dirty flag to the
      struct representing the VPA/DTL areas, and arranges to set the flag
      when the VPA/DTL gets modified by the host.  Then, when we are
      collecting the dirty log, we also check the dirty flags for the
      VPA and DTL for each vcpu and set the relevant bit in the dirty log
      if necessary.  Doing this also means we now need to keep track of
      the guest physical address of the VPA/DTL areas.
      
      So as not to lose track of modifications to a VPA/DTL area when it gets
      unregistered, or when a new area gets registered in its place, we need
      to transfer the dirty state to the rmap chain.  This adds code to
      kvmppc_unpin_guest_page() to do that if the area was dirty.  To simplify
      that code, we now require that all VPA, DTL and SLB shadow buffer areas
      fit within a single host page.  Guests already comply with this
      requirement because pHyp requires that these areas not cross a 4k
      boundary.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c35635ef
    • P
      KVM: PPC: Book3S HV: Make HPT reading code notice R/C bit changes · a1b4a0f6
      Paul Mackerras 提交于
      At present, the code that determines whether a HPT entry has changed,
      and thus needs to be sent to userspace when it is copying the HPT,
      doesn't consider a hardware update to the reference and change bits
      (R and C) in the HPT entries to constitute a change that needs to
      be sent to userspace.  This adds code to check for changes in R and C
      when we are scanning the HPT to find changed entries, and adds code
      to set the changed flag for the HPTE when we update the R and C bits
      in the guest view of the HPTE.
      
      Since we now need to set the HPTE changed flag in book3s_64_mmu_hv.c
      as well as book3s_hv_rm_mmu.c, we move the note_hpte_modification()
      function into kvm_book3s_64.h.
      
      Current Linux guest kernels don't use the hardware updates of R and C
      in the HPT, so this change won't affect them.  Linux (or other) kernels
      might in future want to use the R and C bits and have them correctly
      transferred across when a guest is migrated, so it is better to correct
      this deficiency.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a1b4a0f6
  16. 10 4月, 2013 1 次提交
  17. 06 12月, 2012 5 次提交
    • P
      KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations · 1b400ba0
      Paul Mackerras 提交于
      When we change or remove a HPT (hashed page table) entry, we can do
      either a global TLB invalidation (tlbie) that works across the whole
      machine, or a local invalidation (tlbiel) that only affects this core.
      Currently we do local invalidations if the VM has only one vcpu or if
      the guest requests it with the H_LOCAL flag, though the guest Linux
      kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
      possibility that vcpus moving around to different physical cores might
      expose stale TLB entries, there is some code in kvmppc_hv_entry to
      flush the whole TLB of entries for this VM if either this vcpu is now
      running on a different physical core from where it last ran, or if this
      physical core last ran a different vcpu.
      
      There are a number of problems on POWER7 with this as it stands:
      
      - The TLB invalidation is done per thread, whereas it only needs to be
        done per core, since the TLB is shared between the threads.
      - With the possibility of the host paging out guest pages, the use of
        H_LOCAL by an SMP guest is dangerous since the guest could possibly
        retain and use a stale TLB entry pointing to a page that had been
        removed from the guest.
      - The TLB invalidations that we do when a vcpu moves from one physical
        core to another are unnecessary in the case of an SMP guest that isn't
        using H_LOCAL.
      - The optimization of using local invalidations rather than global should
        apply to guests with one virtual core, not just one vcpu.
      
      (None of this applies on PPC970, since there we always have to
      invalidate the whole TLB when entering and leaving the guest, and we
      can't support paging out guest memory.)
      
      To fix these problems and simplify the code, we now maintain a simple
      cpumask of which cpus need to flush the TLB on entry to the guest.
      (This is indexed by cpu, though we only ever use the bits for thread
      0 of each core.)  Whenever we do a local TLB invalidation, we set the
      bits for every cpu except the bit for thread 0 of the core that we're
      currently running on.  Whenever we enter a guest, we test and clear the
      bit for our core, and flush the TLB if it was set.
      
      On initial startup of the VM, and when resetting the HPT, we set all the
      bits in the need_tlb_flush cpumask, since any core could potentially have
      stale TLB entries from the previous VM to use the same LPID, or the
      previous contents of the HPT.
      
      Then, we maintain a count of the number of online virtual cores, and use
      that when deciding whether to use a local invalidation rather than the
      number of online vcpus.  The code to make that decision is extracted out
      into a new function, global_invalidates().  For multi-core guests on
      POWER7 (i.e. when we are using mmu notifiers), we now never do local
      invalidations regardless of the H_LOCAL flag.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b400ba0
    • P
      KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT · 05dd85f7
      Paul Mackerras 提交于
      This fixes a bug in the code which allows userspace to read out the
      contents of the guest's hashed page table (HPT).  On the second and
      subsequent passes through the HPT, when we are reporting only those
      entries that have changed, we were incorrectly initializing the index
      field of the header with the index of the first entry we skipped
      rather than the first changed entry.  This fixes it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      05dd85f7
    • P
      KVM: PPC: Book3S HV: Reset reverse-map chains when resetting the HPT · a64fd707
      Paul Mackerras 提交于
      With HV-style KVM, we maintain reverse-mapping lists that enable us to
      find all the HPT (hashed page table) entries that reference each guest
      physical page, with the heads of the lists in the memslot->arch.rmap
      arrays.  When we reset the HPT (i.e. when we reboot the VM), we clear
      out all the HPT entries but we were not clearing out the reverse
      mapping lists.  The result is that as we create new HPT entries, the
      lists get corrupted, which can easily lead to loops, resulting in the
      host kernel hanging when it tries to traverse those lists.
      
      This fixes the problem by zeroing out all the reverse mapping lists
      when we zero out the HPT.  This incidentally means that we are also
      zeroing our record of the referenced and changed bits (not the bits
      in the Linux PTEs, used by the Linux MM subsystem, but the bits used
      by the KVM_GET_DIRTY_LOG ioctl, and those used by kvm_age_hva() and
      kvm_test_age_hva()).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a64fd707
    • P
      KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT · a2932923
      Paul Mackerras 提交于
      A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
      this fd return the contents of the HPT (hashed page table), writes
      create and/or remove entries in the HPT.  There is a new capability,
      KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
      takes an argument structure with the index of the first HPT entry to
      read out and a set of flags.  The flags indicate whether the user is
      intending to read or write the HPT, and whether to return all entries
      or only the "bolted" entries (those with the bolted bit, 0x10, set in
      the first doubleword).
      
      This is intended for use in implementing qemu's savevm/loadvm and for
      live migration.  Therefore, on reads, the first pass returns information
      about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
      end of the HPT, it returns from the read.  Subsequent reads only return
      information about HPTEs that have changed since they were last read.
      A read that finds no changed HPTEs in the HPT following where the last
      read finished will return 0 bytes.
      
      The format of the data provides a simple run-length compression of the
      invalid entries.  Each block of data starts with a header that indicates
      the index (position in the HPT, which is just an array), the number of
      valid entries starting at that index (may be zero), and the number of
      invalid entries following those valid entries.  The valid entries, 16
      bytes each, follow the header.  The invalid entries are not explicitly
      represented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2932923
    • P
      KVM: PPC: Book3S HV: Restructure HPT entry creation code · 7ed661bf
      Paul Mackerras 提交于
      This restructures the code that creates HPT (hashed page table)
      entries so that it can be called in situations where we don't have a
      struct vcpu pointer, only a struct kvm pointer.  It also fixes a bug
      where kvmppc_map_vrma() would corrupt the guest R4 value.
      
      Most of the work of kvmppc_virtmode_h_enter is now done by a new
      function, kvmppc_virtmode_do_h_enter, which itself calls another new
      function, kvmppc_do_h_enter, which contains most of the old
      kvmppc_h_enter.  The new kvmppc_do_h_enter takes explicit arguments
      for the place to return the HPTE index, the Linux page tables to use,
      and whether it is being called in real mode, thus removing the need
      for it to have the vcpu as an argument.
      
      Currently kvmppc_map_vrma creates the VRMA (virtual real mode area)
      HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily
      to handle H_ENTER hcalls from the guest that need to pin a page of
      memory.  Since H_ENTER returns the index of the created HPTE in R4,
      kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4
      in the case when it gets called from kvmppc_map_vrma on the first
      VCPU_RUN ioctl.  With this, kvmppc_map_vrma instead calls
      kvmppc_virtmode_do_h_enter with the address of a dummy word as the
      place to store the HPTE index, thus avoiding corrupting the guest R4.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7ed661bf
  18. 23 10月, 2012 1 次提交
  19. 06 10月, 2012 4 次提交
    • P
      KVM: PPC: Book3S HV: Fix calculation of guest phys address for MMIO emulation · 70bddfef
      Paul Mackerras 提交于
      In the case where the host kernel is using a 64kB base page size and
      the guest uses a 4k HPTE (hashed page table entry) to map an emulated
      MMIO device, we were calculating the guest physical address wrongly.
      We were calculating a gfn as the guest physical address shifted right
      16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the
      effective address, since the HPTE had a 4k page size.  Thus the gpa
      reported to userspace was missing 4 bits.
      
      Instead, we now compute the guest physical address from the HPTE
      without reference to the host page size, and then compute the gfn
      by shifting the gpa right PAGE_SHIFT bits.
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      70bddfef
    • P
      KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly · dfe49dbd
      Paul Mackerras 提交于
      This adds an implementation of kvm_arch_flush_shadow_memslot for
      Book3S HV, and arranges for kvmppc_core_commit_memory_region to
      flush the dirty log when modifying an existing slot.  With this,
      we can handle deletion and modification of memory slots.
      
      kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which
      on Book3S HV now traverses the reverse map chains to remove any HPT
      (hashed page table) entries referring to pages in the memslot.  This
      gets called by generic code whenever deleting a memslot or changing
      the guest physical address for a memslot.
      
      We flush the dirty log in kvmppc_core_commit_memory_region for
      consistency with what x86 does.  We only need to flush when an
      existing memslot is being modified, because for a new memslot the
      rmap array (which stores the dirty bits) is all zero, meaning that
      every page is considered clean already, and when deleting a memslot
      we obviously don't care about the dirty bits any more.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dfe49dbd
    • P
      KVM: PPC: Move kvm->arch.slot_phys into memslot.arch · a66b48c3
      Paul Mackerras 提交于
      Now that we have an architecture-specific field in the kvm_memory_slot
      structure, we can use it to store the array of page physical addresses
      that we need for Book3S HV KVM on PPC970 processors.  This reduces the
      size of struct kvm_arch for Book3S HV, and also reduces the size of
      struct kvm_arch_memory_slot for other PPC KVM variants since the fields
      in it are now only compiled in for Book3S HV.
      
      This necessitates making the kvm_arch_create_memslot and
      kvm_arch_free_memslot operations specific to each PPC KVM variant.
      That in turn means that we now don't allocate the rmap arrays on
      Book3S PR and Book E.
      
      Since we now unpin pages and free the slot_phys array in
      kvmppc_core_free_memslot, we no longer need to do it in
      kvmppc_core_destroy_vm, since the generic code takes care to free
      all the memslots when destroying a VM.
      
      We now need the new memslot to be passed in to
      kvmppc_core_prepare_memory_region, since we need to initialize its
      arch.slot_phys member on Book3S HV.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a66b48c3
    • P
      KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots · 2c9097e4
      Paul Mackerras 提交于
      The generic KVM code uses SRCU (sleeping RCU) to protect accesses
      to the memslots data structures against updates due to userspace
      adding, modifying or removing memory slots.  We need to do that too,
      both to avoid accessing stale copies of the memslots and to avoid
      lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
      around code that accesses and uses memslots.
      
      Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
      need to access the memslots, and we don't want to call the SRCU code
      in real mode (since we have no assurance that it would only access
      the linear mapping), we hold the SRCU read lock for the VM while
      in the guest.  This does mean that adding or removing memory slots
      while some vcpus are executing in the guest will block for up to
      two jiffies.  This tradeoff is acceptable since adding/removing
      memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
      are performance-critical hot paths.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2c9097e4
  20. 06 8月, 2012 1 次提交
  21. 19 7月, 2012 2 次提交
    • T
      KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start() · b3ae2096
      Takuya Yoshikawa 提交于
      When we tested KVM under memory pressure, with THP enabled on the host,
      we noticed that MMU notifier took a long time to invalidate huge pages.
      
      Since the invalidation was done with mmu_lock held, it not only wasted
      the CPU but also made the host harder to respond.
      
      This patch mitigates this by using kvm_handle_hva_range().
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      b3ae2096
    • T
      KVM: MMU: Make kvm_handle_hva() handle range of addresses · 84504ef3
      Takuya Yoshikawa 提交于
      When guest's memory is backed by THP pages, MMU notifier needs to call
      kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to
      invalidate a range of pages which constitute one huge page:
      
        for each page
          for each memslot
            if page is in memslot
              unmap using rmap
      
      This means although every page in that range is expected to be found in
      the same memslot, we are forced to check unrelated memslots many times.
      If the guest has more memslots, the situation will become worse.
      
      Furthermore, if the range does not include any pages in the guest's
      memory, the loop over the pages will just consume extra time.
      
      This patch, together with the following patches, solves this problem by
      introducing kvm_handle_hva_range() which makes the loop look like this:
      
        for each memslot
          for each page in memslot
            unmap using rmap
      
      In this new processing, the actual work is converted to a loop over rmap
      which is much more cache friendly than before.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      84504ef3