1. 14 12月, 2012 1 次提交
  2. 06 12月, 2012 3 次提交
    • P
      KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking · b4072df4
      Paul Mackerras 提交于
      Currently, if a machine check interrupt happens while we are in the
      guest, we exit the guest and call the host's machine check handler,
      which tends to cause the host to panic.  Some machine checks can be
      triggered by the guest; for example, if the guest creates two entries
      in the SLB that map the same effective address, and then accesses that
      effective address, the CPU will take a machine check interrupt.
      
      To handle this better, when a machine check happens inside the guest,
      we call a new function, kvmppc_realmode_machine_check(), while still in
      real mode before exiting the guest.  On POWER7, it handles the cases
      that the guest can trigger, either by flushing and reloading the SLB,
      or by flushing the TLB, and then it delivers the machine check interrupt
      directly to the guest without going back to the host.  On POWER7, the
      OPAL firmware patches the machine check interrupt vector so that it
      gets control first, and it leaves behind its analysis of the situation
      in a structure pointed to by the opal_mc_evt field of the paca.  The
      kvmppc_realmode_machine_check() function looks at this, and if OPAL
      reports that there was no error, or that it has handled the error, we
      also go straight back to the guest with a machine check.  We have to
      deliver a machine check to the guest since the machine check interrupt
      might have trashed valid values in SRR0/1.
      
      If the machine check is one we can't handle in real mode, and one that
      OPAL hasn't already handled, or on PPC970, we exit the guest and call
      the host's machine check handler.  We do this by jumping to the
      machine_check_fwnmi label, rather than absolute address 0x200, because
      we don't want to re-execute OPAL's handler on POWER7.  On PPC970, the
      two are equivalent because address 0x200 just contains a branch.
      
      Then, if the host machine check handler decides that the system can
      continue executing, kvmppc_handle_exit() delivers a machine check
      interrupt to the guest -- once again to let the guest know that SRR0/1
      have been modified.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix checkpatch warnings]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4072df4
    • P
      KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations · 1b400ba0
      Paul Mackerras 提交于
      When we change or remove a HPT (hashed page table) entry, we can do
      either a global TLB invalidation (tlbie) that works across the whole
      machine, or a local invalidation (tlbiel) that only affects this core.
      Currently we do local invalidations if the VM has only one vcpu or if
      the guest requests it with the H_LOCAL flag, though the guest Linux
      kernel currently doesn't ever use H_LOCAL.  Then, to cope with the
      possibility that vcpus moving around to different physical cores might
      expose stale TLB entries, there is some code in kvmppc_hv_entry to
      flush the whole TLB of entries for this VM if either this vcpu is now
      running on a different physical core from where it last ran, or if this
      physical core last ran a different vcpu.
      
      There are a number of problems on POWER7 with this as it stands:
      
      - The TLB invalidation is done per thread, whereas it only needs to be
        done per core, since the TLB is shared between the threads.
      - With the possibility of the host paging out guest pages, the use of
        H_LOCAL by an SMP guest is dangerous since the guest could possibly
        retain and use a stale TLB entry pointing to a page that had been
        removed from the guest.
      - The TLB invalidations that we do when a vcpu moves from one physical
        core to another are unnecessary in the case of an SMP guest that isn't
        using H_LOCAL.
      - The optimization of using local invalidations rather than global should
        apply to guests with one virtual core, not just one vcpu.
      
      (None of this applies on PPC970, since there we always have to
      invalidate the whole TLB when entering and leaving the guest, and we
      can't support paging out guest memory.)
      
      To fix these problems and simplify the code, we now maintain a simple
      cpumask of which cpus need to flush the TLB on entry to the guest.
      (This is indexed by cpu, though we only ever use the bits for thread
      0 of each core.)  Whenever we do a local TLB invalidation, we set the
      bits for every cpu except the bit for thread 0 of the core that we're
      currently running on.  Whenever we enter a guest, we test and clear the
      bit for our core, and flush the TLB if it was set.
      
      On initial startup of the VM, and when resetting the HPT, we set all the
      bits in the need_tlb_flush cpumask, since any core could potentially have
      stale TLB entries from the previous VM to use the same LPID, or the
      previous contents of the HPT.
      
      Then, we maintain a count of the number of online virtual cores, and use
      that when deciding whether to use a local invalidation rather than the
      number of online vcpus.  The code to make that decision is extracted out
      into a new function, global_invalidates().  For multi-core guests on
      POWER7 (i.e. when we are using mmu notifiers), we now never do local
      invalidations regardless of the H_LOCAL flag.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1b400ba0
    • P
      KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT · a2932923
      Paul Mackerras 提交于
      A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor.  Reads on
      this fd return the contents of the HPT (hashed page table), writes
      create and/or remove entries in the HPT.  There is a new capability,
      KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl.  The ioctl
      takes an argument structure with the index of the first HPT entry to
      read out and a set of flags.  The flags indicate whether the user is
      intending to read or write the HPT, and whether to return all entries
      or only the "bolted" entries (those with the bolted bit, 0x10, set in
      the first doubleword).
      
      This is intended for use in implementing qemu's savevm/loadvm and for
      live migration.  Therefore, on reads, the first pass returns information
      about all HPTEs (or all bolted HPTEs).  When the first pass reaches the
      end of the HPT, it returns from the read.  Subsequent reads only return
      information about HPTEs that have changed since they were last read.
      A read that finds no changed HPTEs in the HPT following where the last
      read finished will return 0 bytes.
      
      The format of the data provides a simple run-length compression of the
      invalid entries.  Each block of data starts with a header that indicates
      the index (position in the HPT, which is just an array), the number of
      valid entries starting at that index (may be zero), and the number of
      invalid entries following those valid entries.  The valid entries, 16
      bytes each, follow the header.  The invalid entries are not explicitly
      represented.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix documentation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a2932923
  3. 30 10月, 2012 7 次提交
    • P
      KVM: PPC: Book3S HV: Allow DTL to be set to address 0, length 0 · 9f8c8c78
      Paul Mackerras 提交于
      Commit 55b665b0 ("KVM: PPC: Book3S HV: Provide a way for userspace
      to get/set per-vCPU areas") includes a check on the length of the
      dispatch trace log (DTL) to make sure the buffer is at least one entry
      long.  This is appropriate when registering a buffer, but the
      interface also allows for any existing buffer to be unregistered by
      specifying a zero address.  In this case the length check is not
      appropriate.  This makes the check conditional on the address being
      non-zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      9f8c8c78
    • P
      KVM: PPC: Book3S HV: Fix accounting of stolen time · c7b67670
      Paul Mackerras 提交于
      Currently the code that accounts stolen time tends to overestimate the
      stolen time, and will sometimes report more stolen time in a DTL
      (dispatch trace log) entry than has elapsed since the last DTL entry.
      This can cause guests to underflow the user or system time measured
      for some tasks, leading to ridiculous CPU percentages and total runtimes
      being reported by top and other utilities.
      
      In addition, the current code was designed for the previous policy where
      a vcore would only run when all the vcpus in it were runnable, and so
      only counted stolen time on a per-vcore basis.  Now that a vcore can
      run while some of the vcpus in it are doing other things in the kernel
      (e.g. handling a page fault), we need to count the time when a vcpu task
      is preempted while it is not running as part of a vcore as stolen also.
      
      To do this, we bring back the BUSY_IN_HOST vcpu state and extend the
      vcpu_load/put functions to count preemption time while the vcpu is
      in that state.  Handling the transitions between the RUNNING and
      BUSY_IN_HOST states requires checking and updating two variables
      (accumulated time stolen and time last preempted), so we add a new
      spinlock, vcpu->arch.tbacct_lock.  This protects both the per-vcpu
      stolen/preempt-time variables, and the per-vcore variables while this
      vcpu is running the vcore.
      
      Finally, we now don't count time spent in userspace as stolen time.
      The task could be executing in userspace on behalf of the vcpu, or
      it could be preempted, or the vcpu could be genuinely stopped.  Since
      we have no way of dividing up the time between these cases, we don't
      count any of it as stolen.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c7b67670
    • P
      KVM: PPC: Book3S HV: Run virtual core whenever any vcpus in it can run · 8455d79e
      Paul Mackerras 提交于
      Currently the Book3S HV code implements a policy on multi-threaded
      processors (i.e. POWER7) that requires all of the active vcpus in a
      virtual core to be ready to run before we run the virtual core.
      However, that causes problems on reset, because reset stops all vcpus
      except vcpu 0, and can also reduce throughput since all four threads
      in a virtual core have to wait whenever any one of them hits a
      hypervisor page fault.
      
      This relaxes the policy, allowing the virtual core to run as soon as
      any vcpu in it is runnable.  With this, the KVMPPC_VCPU_STOPPED state
      and the KVMPPC_VCPU_BUSY_IN_HOST state have been combined into a single
      KVMPPC_VCPU_NOTREADY state, since we no longer need to distinguish
      between them.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      8455d79e
    • P
      KVM: PPC: Book3S HV: Fixes for late-joining threads · 2f12f034
      Paul Mackerras 提交于
      If a thread in a virtual core becomes runnable while other threads
      in the same virtual core are already running in the guest, it is
      possible for the latecomer to join the others on the core without
      first pulling them all out of the guest.  Currently this only happens
      rarely, when a vcpu is first started.  This fixes some bugs and
      omissions in the code in this case.
      
      First, we need to check for VPA updates for the latecomer and make
      a DTL entry for it.  Secondly, if it comes along while the master
      vcpu is doing a VPA update, we don't need to do anything since the
      master will pick it up in kvmppc_run_core.  To handle this correctly
      we introduce a new vcore state, VCORE_STARTING.  Thirdly, there is
      a race because we currently clear the hardware thread's hwthread_req
      before waiting to see it get to nap.  A latecomer thread could have
      its hwthread_req cleared before it gets to test it, and therefore
      never increment the nap_count, leading to messages about wait_for_nap
      timeouts.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2f12f034
    • P
      KVM: PPC: Book3s HV: Don't access runnable threads list without vcore lock · 913d3ff9
      Paul Mackerras 提交于
      There were a few places where we were traversing the list of runnable
      threads in a virtual core, i.e. vc->runnable_threads, without holding
      the vcore spinlock.  This extends the places where we hold the vcore
      spinlock to cover everywhere that we traverse that list.
      
      Since we possibly need to sleep inside kvmppc_book3s_hv_page_fault,
      this moves the call of it from kvmppc_handle_exit out to
      kvmppc_vcpu_run, where we don't hold the vcore lock.
      
      In kvmppc_vcore_blocked, we don't actually need to check whether
      all vcpus are ceded and don't have any pending exceptions, since the
      caller has already done that.  The caller (kvmppc_run_vcpu) wasn't
      actually checking for pending exceptions, so we add that.
      
      The change of if to while in kvmppc_run_vcpu is to make sure that we
      never call kvmppc_remove_runnable() when the vcore state is RUNNING or
      EXITING.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      913d3ff9
    • P
      KVM: PPC: Book3S HV: Fix some races in starting secondary threads · 7b444c67
      Paul Mackerras 提交于
      Subsequent patches implementing in-kernel XICS emulation will make it
      possible for IPIs to arrive at secondary threads at arbitrary times.
      This fixes some races in how we start the secondary threads, which
      if not fixed could lead to occasional crashes of the host kernel.
      
      This makes sure that (a) we have grabbed all the secondary threads,
      and verified that they are no longer in the kernel, before we start
      any thread, (b) that the secondary thread loads its vcpu pointer
      after clearing the IPI that woke it up (so we don't miss a wakeup),
      and (c) that the secondary thread clears its vcpu pointer before
      incrementing the nap count.  It also removes unnecessary setting
      of the vcpu and vcore pointers in the paca in kvmppc_core_vcpu_load.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7b444c67
    • P
      KVM: PPC: Book3S HV: Allow KVM guests to stop secondary threads coming online · 512691d4
      Paul Mackerras 提交于
      When a Book3S HV KVM guest is running, we need the host to be in
      single-thread mode, that is, all of the cores (or at least all of
      the cores where the KVM guest could run) to be running only one
      active hardware thread.  This is because of the hardware restriction
      in POWER processors that all of the hardware threads in the core
      must be in the same logical partition.  Complying with this restriction
      is much easier if, from the host kernel's point of view, only one
      hardware thread is active.
      
      This adds two hooks in the SMP hotplug code to allow the KVM code to
      make sure that secondary threads (i.e. hardware threads other than
      thread 0) cannot come online while any KVM guest exists.  The KVM
      code still has to check that any core where it runs a guest has the
      secondary threads offline, but having done that check it can now be
      sure that they will not come online while the guest is running.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      512691d4
  4. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  5. 06 10月, 2012 7 次提交
    • P
      KVM: PPC: Book3S HV: Provide a way for userspace to get/set per-vCPU areas · 55b665b0
      Paul Mackerras 提交于
      The PAPR paravirtualization interface lets guests register three
      different types of per-vCPU buffer areas in its memory for communication
      with the hypervisor.  These are called virtual processor areas (VPAs).
      Currently the hypercalls to register and unregister VPAs are handled
      by KVM in the kernel, and userspace has no way to know about or save
      and restore these registrations across a migration.
      
      This adds "register" codes for these three areas that userspace can
      use with the KVM_GET/SET_ONE_REG ioctls to see what addresses have
      been registered, and to register or unregister them.  This will be
      needed for guest hibernation and migration, and is also needed so
      that userspace can unregister them on reset (otherwise we corrupt
      guest memory after reboot by writing to the VPAs registered by the
      previous kernel).
      
      The "register" for the VPA is a 64-bit value containing the address,
      since the length of the VPA is fixed.  The "registers" for the SLB
      shadow buffer and dispatch trace log (DTL) are 128 bits long,
      consisting of the guest physical address in the high (first) 64 bits
      and the length in the low 64 bits.
      
      This also fixes a bug where we were calling init_vpa unconditionally,
      leading to an oops when unregistering the VPA.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      55b665b0
    • P
      KVM: PPC: Book3S: Get/set guest FP regs using the GET/SET_ONE_REG interface · a8bd19ef
      Paul Mackerras 提交于
      This enables userspace to get and set all the guest floating-point
      state using the KVM_[GS]ET_ONE_REG ioctls.  The floating-point state
      includes all of the traditional floating-point registers and the
      FPSCR (floating point status/control register), all the VMX/Altivec
      vector registers and the VSCR (vector status/control register), and
      on POWER7, the vector-scalar registers (note that each FP register
      is the high-order half of the corresponding VSR).
      
      Most of these are implemented in common Book 3S code, except for VSX
      on POWER7.  Because HV and PR differ in how they store the FP and VSX
      registers on POWER7, the code for these cases is not common.  On POWER7,
      the FP registers are the upper halves of the VSX registers vsr0 - vsr31.
      PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the
      arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas
      HV KVM on POWER7 stores the whole VSX register in arch.vsr[].
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix whitespace, vsx compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a8bd19ef
    • P
      KVM: PPC: Book3S: Get/set guest SPRs using the GET/SET_ONE_REG interface · a136a8bd
      Paul Mackerras 提交于
      This enables userspace to get and set various SPRs (special-purpose
      registers) using the KVM_[GS]ET_ONE_REG ioctls.  With this, userspace
      can get and set all the SPRs that are part of the guest state, either
      through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or
      the KVM_[GS]ET_ONE_REG ioctls.
      
      The SPRs that are added here are:
      
      - DABR:  Data address breakpoint register
      - DSCR:  Data stream control register
      - PURR:  Processor utilization of resources register
      - SPURR: Scaled PURR
      - DAR:   Data address register
      - DSISR: Data storage interrupt status register
      - AMR:   Authority mask register
      - UAMOR: User authority mask override register
      - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers
      - PMC1..PMC8: Performance monitor unit counter registers
      
      In order to reduce code duplication between PR and HV KVM code, this
      moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and
      centralizes the copying between user and kernel space there.  The
      registers that are handled differently between PR and HV, and those
      that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg()
      functions that are specific to each flavor.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: minimal style fixes]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a136a8bd
    • P
      KVM: PPC: Book3S HV: Remove bogus update of physical thread IDs · 964ee98c
      Paul Mackerras 提交于
      When making a vcpu non-runnable we incorrectly changed the
      thread IDs of all other threads on the core, just remove that
      code.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      964ee98c
    • P
      KVM: PPC: Book3S HV: Handle memory slot deletion and modification correctly · dfe49dbd
      Paul Mackerras 提交于
      This adds an implementation of kvm_arch_flush_shadow_memslot for
      Book3S HV, and arranges for kvmppc_core_commit_memory_region to
      flush the dirty log when modifying an existing slot.  With this,
      we can handle deletion and modification of memory slots.
      
      kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which
      on Book3S HV now traverses the reverse map chains to remove any HPT
      (hashed page table) entries referring to pages in the memslot.  This
      gets called by generic code whenever deleting a memslot or changing
      the guest physical address for a memslot.
      
      We flush the dirty log in kvmppc_core_commit_memory_region for
      consistency with what x86 does.  We only need to flush when an
      existing memslot is being modified, because for a new memslot the
      rmap array (which stores the dirty bits) is all zero, meaning that
      every page is considered clean already, and when deleting a memslot
      we obviously don't care about the dirty bits any more.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      dfe49dbd
    • P
      KVM: PPC: Move kvm->arch.slot_phys into memslot.arch · a66b48c3
      Paul Mackerras 提交于
      Now that we have an architecture-specific field in the kvm_memory_slot
      structure, we can use it to store the array of page physical addresses
      that we need for Book3S HV KVM on PPC970 processors.  This reduces the
      size of struct kvm_arch for Book3S HV, and also reduces the size of
      struct kvm_arch_memory_slot for other PPC KVM variants since the fields
      in it are now only compiled in for Book3S HV.
      
      This necessitates making the kvm_arch_create_memslot and
      kvm_arch_free_memslot operations specific to each PPC KVM variant.
      That in turn means that we now don't allocate the rmap arrays on
      Book3S PR and Book E.
      
      Since we now unpin pages and free the slot_phys array in
      kvmppc_core_free_memslot, we no longer need to do it in
      kvmppc_core_destroy_vm, since the generic code takes care to free
      all the memslots when destroying a VM.
      
      We now need the new memslot to be passed in to
      kvmppc_core_prepare_memory_region, since we need to initialize its
      arch.slot_phys member on Book3S HV.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a66b48c3
    • P
      KVM: PPC: Book3S HV: Take the SRCU read lock before looking up memslots · 2c9097e4
      Paul Mackerras 提交于
      The generic KVM code uses SRCU (sleeping RCU) to protect accesses
      to the memslots data structures against updates due to userspace
      adding, modifying or removing memory slots.  We need to do that too,
      both to avoid accessing stale copies of the memslots and to avoid
      lockdep warnings.  This therefore adds srcu_read_lock/unlock pairs
      around code that accesses and uses memslots.
      
      Since the real-mode handlers for H_ENTER, H_REMOVE and H_BULK_REMOVE
      need to access the memslots, and we don't want to call the SRCU code
      in real mode (since we have no assurance that it would only access
      the linear mapping), we hold the SRCU read lock for the VM while
      in the guest.  This does mean that adding or removing memory slots
      while some vcpus are executing in the guest will block for up to
      two jiffies.  This tradeoff is acceptable since adding/removing
      memory slots only happens rarely, while H_ENTER/H_REMOVE/H_BULK_REMOVE
      are performance-critical hot paths.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2c9097e4
  6. 19 6月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Drop locks around call to kvmppc_pin_guest_page · 081f323b
      Paul Mackerras 提交于
      At the moment we call kvmppc_pin_guest_page() in kvmppc_update_vpa()
      with two spinlocks held: the vcore lock and the vcpu->vpa_update_lock.
      This is not good, since kvmppc_pin_guest_page() calls down_read() and
      get_user_pages_fast(), both of which can sleep.  This bug was introduced
      in 2e25aa5f ("KVM: PPC: Book3S HV: Make virtual processor area
      registration more robust").
      
      This arranges to drop those spinlocks before calling
      kvmppc_pin_guest_page() and re-take them afterwards.  Dropping the
      vcore lock in kvmppc_run_core() means we have to set the vcore_state
      field to VCORE_RUNNING before we drop the lock, so that other vcpus
      won't try to run this vcore.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      081f323b
  7. 30 5月, 2012 1 次提交
    • P
      KVM: PPC: Book3S HV: Make the guest hash table size configurable · 32fad281
      Paul Mackerras 提交于
      This adds a new ioctl to enable userspace to control the size of the guest
      hashed page table (HPT) and to clear it out when resetting the guest.
      The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter
      a pointer to a u32 containing the desired order of the HPT (log base 2
      of the size in bytes), which is updated on successful return to the
      actual order of the HPT which was allocated.
      
      There must be no vcpus running at the time of this ioctl.  To enforce
      this, we now keep a count of the number of vcpus running in
      kvm->arch.vcpus_running.
      
      If the ioctl is called when a HPT has already been allocated, we don't
      reallocate the HPT but just clear it out.  We first clear the
      kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold
      the kvm->lock mutex, it will prevent any vcpus from starting to run until
      we're done, and (b) it means that the first vcpu to run after we're done
      will re-establish the VRMA if necessary.
      
      If userspace doesn't call this ioctl before running the first vcpu, the
      kernel will allocate a default-sized HPT at that point.  We do it then
      rather than when creating the VM, as the code did previously, so that
      userspace has a chance to do the ioctl if it wants.
      
      When allocating the HPT, we can allocate either from the kernel page
      allocator, or from the preallocated pool.  If userspace is asking for
      a different size from the preallocated HPTs, we first try to allocate
      using the kernel page allocator.  Then we try to allocate from the
      preallocated pool, and then if that fails, we try allocating decreasing
      sizes from the kernel page allocator, down to the minimum size allowed
      (256kB).  Note that the kernel page allocator limits allocations to
      1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to
      16MB (on 64-bit powerpc, at least).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      [agraf: fix module compilation]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      32fad281
  8. 08 5月, 2012 1 次提交
    • D
      KVM: PPC: Book3S HV: Fix refcounting of hugepages · de6c0b02
      David Gibson 提交于
      The H_REGISTER_VPA hcall implementation in HV Power KVM needs to pin some
      guest memory pages into host memory so that they can be safely accessed
      from usermode.  It does this used get_user_pages_fast().  When the VPA is
      unregistered, or the VCPUs are cleaned up, these pages are released using
      put_page().
      
      However, the get_user_pages() is invoked on the specific memory are of the
      VPA which could lie within hugepages.  In case the pinned page is huge,
      we explicitly find the head page of the compound page before calling
      put_page() on it.
      
      At least with the latest kernel, this is not correct.  put_page() already
      handles finding the correct head page of a compound, and also deals with
      various counts on the individual tail page which are important for
      transparent huge pages.  We don't support transparent hugepages on Power,
      but even so, bypassing this count maintenance can lead (when the VM ends)
      to a hugepage being released back to the pool with a non-zero mapcount on
      one of the tail pages.  This can then lead to a bad_page() when the page
      is released from the hugepage pool.
      
      This removes the explicit compound_head() call to correct this bug.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      de6c0b02
  9. 06 5月, 2012 3 次提交
  10. 08 4月, 2012 3 次提交
    • P
      KVM: PPC: Book3S HV: Report stolen time to guest through dispatch trace log · 0456ec4f
      Paul Mackerras 提交于
      This adds code to measure "stolen" time per virtual core in units of
      timebase ticks, and to report the stolen time to the guest using the
      dispatch trace log (DTL).  The guest can register an area of memory
      for the DTL for a given vcpu.  The DTL is a ring buffer where KVM
      fills in one entry every time it enters the guest for that vcpu.
      
      Stolen time is measured as time when the virtual core is not running,
      either because the vcore is not runnable (e.g. some of its vcpus are
      executing elsewhere in the kernel or in userspace), or when the vcpu
      thread that is running the vcore is preempted.  This includes time
      when all the vcpus are idle (i.e. have executed the H_CEDE hypercall),
      which is OK because the guest accounts stolen time while idle as idle
      time.
      
      Each vcpu keeps a record of how much stolen time has been reported to
      the guest for that vcpu so far.  When we are about to enter the guest,
      we create a new DTL entry (if the guest vcpu has a DTL) and report the
      difference between total stolen time for the vcore and stolen time
      reported so far for the vcpu as the "enqueue to dispatch" time in the
      DTL entry.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0456ec4f
    • P
      KVM: PPC: Book3S HV: Make virtual processor area registration more robust · 2e25aa5f
      Paul Mackerras 提交于
      The PAPR API allows three sorts of per-virtual-processor areas to be
      registered (VPA, SLB shadow buffer, and dispatch trace log), and
      furthermore, these can be registered and unregistered for another
      virtual CPU.  Currently we just update the vcpu fields pointing to
      these areas at the time of registration or unregistration.  If this
      is done on another vcpu, there is the possibility that the target vcpu
      is using those fields at the time and could end up using a bogus
      pointer and corrupting memory.
      
      This fixes the race by making the target cpu itself do the update, so
      we can be sure that the update happens at a time when the fields
      aren't being used.  Each area now has a struct kvmppc_vpa which is
      used to manage these updates.  There is also a spinlock which protects
      access to all of the kvmppc_vpa structs, other than to the pinned_addr
      fields.  (We could have just taken the spinlock when using the vpa,
      slb_shadow or dtl fields, but that would mean taking the spinlock on
      every guest entry and exit.)
      
      This also changes 'struct dtl' (which was undefined) to 'struct dtl_entry',
      which is what the rest of the kernel uses.
      
      Thanks to Michael Ellerman <michael@ellerman.id.au> for pointing out
      the need to initialize vcpu->arch.vpa_update_lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2e25aa5f
    • P
      KVM: PPC: Book3S HV: Make secondary threads more robust against stray IPIs · f0888f70
      Paul Mackerras 提交于
      Currently on POWER7, if we are running the guest on a core and we don't
      need all the hardware threads, we do nothing to ensure that the unused
      threads aren't executing in the kernel (other than checking that they
      are offline).  We just assume they're napping and we don't do anything
      to stop them trying to enter the kernel while the guest is running.
      This means that a stray IPI can wake up the hardware thread and it will
      then try to enter the kernel, but since the core is in guest context,
      it will execute code from the guest in hypervisor mode once it turns the
      MMU on, which tends to lead to crashes or hangs in the host.
      
      This fixes the problem by adding two new one-byte flags in the
      kvmppc_host_state structure in the PACA which are used to interlock
      between the primary thread and the unused secondary threads when entering
      the guest.  With these flags, the primary thread can ensure that the
      unused secondaries are not already in kernel mode (i.e. handling a stray
      IPI) and then indicate that they should not try to enter the kernel
      if they do get woken for any reason.  Instead they will go into KVM code,
      find that there is no vcpu to run, acknowledge and clear the IPI and go
      back to nap mode.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f0888f70
  11. 29 3月, 2012 1 次提交
  12. 08 3月, 2012 1 次提交
  13. 05 3月, 2012 10 次提交
    • A
      KVM: PPC: Convert RMA allocation into generic code · b4e70611
      Alexander Graf 提交于
      We have code to allocate big chunks of linear memory on bootup for later use.
      This code is currently used for RMA allocation, but can be useful beyond that
      extent.
      
      Make it generic so we can reuse it for other stuff later.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b4e70611
    • P
      KVM: PPC: Move kvm_vcpu_ioctl_[gs]et_one_reg down to platform-specific code · 31f3438e
      Paul Mackerras 提交于
      This moves the get/set_one_reg implementation down from powerpc.c into
      booke.c, book3s_pr.c and book3s_hv.c.  This avoids #ifdefs in C code,
      but more importantly, it fixes a bug on Book3s HV where we were
      accessing beyond the end of the kvm_vcpu struct (via the to_book3s()
      macro) and corrupting memory, causing random crashes and file corruption.
      
      On Book3s HV we only accept setting the HIOR to zero, since the guest
      runs in supervisor mode and its vectors are never offset from zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      [agraf update to apply on top of changed ONE_REG patches]
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      31f3438e
    • S
      KVM: PPC: Use the vcpu kmem_cache when allocating new VCPUs · 6b75e6bf
      Sasha Levin 提交于
      Currently the code kzalloc()s new VCPUs instead of using the kmem_cache
      which is created when KVM is initialized.
      
      Modify it to allocate VCPUs from that kmem_cache.
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6b75e6bf
    • P
      KVM: PPC: Book3s HV: Implement get_dirty_log using hardware changed bit · 82ed3616
      Paul Mackerras 提交于
      This changes the implementation of kvm_vm_ioctl_get_dirty_log() for
      Book3s HV guests to use the hardware C (changed) bits in the guest
      hashed page table.  Since this makes the implementation quite different
      from the Book3s PR case, this moves the existing implementation from
      book3s.c to book3s_pr.c and creates a new implementation in book3s_hv.c.
      That implementation calls kvmppc_hv_get_dirty_log() to do the actual
      work by calling kvm_test_clear_dirty on each page.  It iterates over
      the HPTEs, clearing the C bit if set, and returns 1 if any C bit was
      set (including the saved C bit in the rmap entry).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      82ed3616
    • P
      KVM: PPC: Implement MMU notifiers for Book3S HV guests · 342d3db7
      Paul Mackerras 提交于
      This adds the infrastructure to enable us to page out pages underneath
      a Book3S HV guest, on processors that support virtualized partition
      memory, that is, POWER7.  Instead of pinning all the guest's pages,
      we now look in the host userspace Linux page tables to find the
      mapping for a given guest page.  Then, if the userspace Linux PTE
      gets invalidated, kvm_unmap_hva() gets called for that address, and
      we replace all the guest HPTEs that refer to that page with absent
      HPTEs, i.e. ones with the valid bit clear and the HPTE_V_ABSENT bit
      set, which will cause an HDSI when the guest tries to access them.
      Finally, the page fault handler is extended to reinstantiate the
      guest HPTE when the guest tries to access a page which has been paged
      out.
      
      Since we can't intercept the guest DSI and ISI interrupts on PPC970,
      we still have to pin all the guest pages on PPC970.  We have a new flag,
      kvm->arch.using_mmu_notifiers, that indicates whether we can page
      guest pages out.  If it is not set, the MMU notifier callbacks do
      nothing and everything operates as before.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      342d3db7
    • P
      KVM: PPC: Implement MMIO emulation support for Book3S HV guests · 697d3899
      Paul Mackerras 提交于
      This provides the low-level support for MMIO emulation in Book3S HV
      guests.  When the guest tries to map a page which is not covered by
      any memslot, that page is taken to be an MMIO emulation page.  Instead
      of inserting a valid HPTE, we insert an HPTE that has the valid bit
      clear but another hypervisor software-use bit set, which we call
      HPTE_V_ABSENT, to indicate that this is an absent page.  An
      absent page is treated much like a valid page as far as guest hcalls
      (H_ENTER, H_REMOVE, H_READ etc.) are concerned, except of course that
      an absent HPTE doesn't need to be invalidated with tlbie since it
      was never valid as far as the hardware is concerned.
      
      When the guest accesses a page for which there is an absent HPTE, it
      will take a hypervisor data storage interrupt (HDSI) since we now set
      the VPM1 bit in the LPCR.  Our HDSI handler for HPTE-not-present faults
      looks up the hash table and if it finds an absent HPTE mapping the
      requested virtual address, will switch to kernel mode and handle the
      fault in kvmppc_book3s_hv_page_fault(), which at present just calls
      kvmppc_hv_emulate_mmio() to set up the MMIO emulation.
      
      This is based on an earlier patch by Benjamin Herrenschmidt, but since
      heavily reworked.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      697d3899
    • P
      KVM: PPC: Allow use of small pages to back Book3S HV guests · da9d1d7f
      Paul Mackerras 提交于
      This relaxes the requirement that the guest memory be provided as
      16MB huge pages, allowing it to be provided as normal memory, i.e.
      in pages of PAGE_SIZE bytes (4k or 64k).  To allow this, we index
      the kvm->arch.slot_phys[] arrays with a small page index, even if
      huge pages are being used, and use the low-order 5 bits of each
      entry to store the order of the enclosing page with respect to
      normal pages, i.e. log_2(enclosing_page_size / PAGE_SIZE).
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      da9d1d7f
    • P
      KVM: PPC: Only get pages when actually needed, not in prepare_memory_region() · c77162de
      Paul Mackerras 提交于
      This removes the code from kvmppc_core_prepare_memory_region() that
      looked up the VMA for the region being added and called hva_to_page
      to get the pfns for the memory.  We have no guarantee that there will
      be anything mapped there at the time of the KVM_SET_USER_MEMORY_REGION
      ioctl call; userspace can do that ioctl and then map memory into the
      region later.
      
      Instead we defer looking up the pfn for each memory page until it is
      needed, which generally means when the guest does an H_ENTER hcall on
      the page.  Since we can't call get_user_pages in real mode, if we don't
      already have the pfn for the page, kvmppc_h_enter() will return
      H_TOO_HARD and we then call kvmppc_virtmode_h_enter() once we get back
      to kernel context.  That calls kvmppc_get_guest_page() to get the pfn
      for the page, and then calls back to kvmppc_h_enter() to redo the HPTE
      insertion.
      
      When the first vcpu starts executing, we need to have the RMO or VRMA
      region mapped so that the guest's real mode accesses will work.  Thus
      we now have a check in kvmppc_vcpu_run() to see if the RMO/VRMA is set
      up and if not, call kvmppc_hv_setup_rma().  It checks if the memslot
      starting at guest physical 0 now has RMO memory mapped there; if so it
      sets it up for the guest, otherwise on POWER7 it sets up the VRMA.
      The function that does that, kvmppc_map_vrma, is now a bit simpler,
      as it calls kvmppc_virtmode_h_enter instead of creating the HPTE itself.
      
      Since we are now potentially updating entries in the slot_phys[]
      arrays from multiple vcpu threads, we now have a spinlock protecting
      those updates to ensure that we don't lose track of any references
      to pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c77162de
    • P
      KVM: PPC: Add an interface for pinning guest pages in Book3s HV guests · 93e60249
      Paul Mackerras 提交于
      This adds two new functions, kvmppc_pin_guest_page() and
      kvmppc_unpin_guest_page(), and uses them to pin the guest pages where
      the guest has registered areas of memory for the hypervisor to update,
      (i.e. the per-cpu virtual processor areas, SLB shadow buffers and
      dispatch trace logs) and then unpin them when they are no longer
      required.
      
      Although it is not strictly necessary to pin the pages at this point,
      since all guest pages are already pinned, later commits in this series
      will mean that guest pages aren't all pinned.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      93e60249
    • P
      KVM: PPC: Keep page physical addresses in per-slot arrays · b2b2f165
      Paul Mackerras 提交于
      This allocates an array for each memory slot that is added to store
      the physical addresses of the pages in the slot.  This array is
      vmalloc'd and accessed in kvmppc_h_enter using real_vmalloc_addr().
      This allows us to remove the ram_pginfo field from the kvm_arch
      struct, and removes the 64GB guest RAM limit that we had.
      
      We use the low-order bits of the array entries to store a flag
      indicating that we have done get_page on the corresponding page,
      and therefore need to call put_page when we are finished with the
      page.  Currently this is set for all pages except those in our
      special RMO regions.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b2b2f165