1. 26 9月, 2011 1 次提交
    • A
      KVM: PPC: Add sanity checking to vcpu_run · af8f38b3
      Alexander Graf 提交于
      There are multiple features in PowerPC KVM that can now be enabled
      depending on the user's wishes. Some of the combinations don't make
      sense or don't work though.
      
      So this patch adds a way to check if the executing environment would
      actually be able to run the guest properly. It also adds sanity
      checks if PVR is set (should always be true given the current code
      flow), if PAPR is only used with book3s_64 where it works and that
      HV KVM is only used in PAPR mode.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      af8f38b3
  2. 12 7月, 2011 8 次提交
    • P
      KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests · aa04b4cc
      Paul Mackerras 提交于
      This adds infrastructure which will be needed to allow book3s_hv KVM to
      run on older POWER processors, including PPC970, which don't support
      the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
      Offset (RMO) facility.  These processors require a physically
      contiguous, aligned area of memory for each guest.  When the guest does
      an access in real mode (MMU off), the address is compared against a
      limit value, and if it is lower, the address is ORed with an offset
      value (from the Real Mode Offset Register (RMOR)) and the result becomes
      the real address for the access.  The size of the RMA has to be one of
      a set of supported values, which usually includes 64MB, 128MB, 256MB
      and some larger powers of 2.
      
      Since we are unlikely to be able to allocate 64MB or more of physically
      contiguous memory after the kernel has been running for a while, we
      allocate a pool of RMAs at boot time using the bootmem allocator.  The
      size and number of the RMAs can be set using the kvm_rma_size=xx and
      kvm_rma_count=xx kernel command line options.
      
      KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
      of the pool of preallocated RMAs.  The capability value is 1 if the
      processor can use an RMA but doesn't require one (because it supports
      the VRMA facility), or 2 if the processor requires an RMA for each guest.
      
      This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
      pool and returns a file descriptor which can be used to map the RMA.  It
      also returns the size of the RMA in the argument structure.
      
      Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
      ioctl calls from userspace.  To cope with this, we now preallocate the
      kvm->arch.ram_pginfo array when the VM is created with a size sufficient
      for up to 64GB of guest memory.  Subsequently we will get rid of this
      array and use memory associated with each memslot instead.
      
      This moves most of the code that translates the user addresses into
      host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
      to kvmppc_core_prepare_memory_region.  Also, instead of having to look
      up the VMA for each page in order to check the page size, we now check
      that the pages we get are compound pages of 16MB.  However, if we are
      adding memory that is mapped to an RMA, we don't bother with calling
      get_user_pages_fast and instead just offset from the base pfn for the
      RMA.
      
      Typically the RMA gets added after vcpus are created, which makes it
      inconvenient to have the LPCR (logical partition control register) value
      in the vcpu->arch struct, since the LPCR controls whether the processor
      uses RMA or VRMA for the guest.  This moves the LPCR value into the
      kvm->arch struct and arranges for the MER (mediated external request)
      bit, which is the only bit that varies between vcpus, to be set in
      assembly code when going into the guest if there is a pending external
      interrupt request.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      aa04b4cc
    • P
      KVM: PPC: Allow book3s_hv guests to use SMT processor modes · 371fefd6
      Paul Mackerras 提交于
      This lifts the restriction that book3s_hv guests can only run one
      hardware thread per core, and allows them to use up to 4 threads
      per core on POWER7.  The host still has to run single-threaded.
      
      This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
      capability.  The return value of the ioctl querying this capability
      is the number of vcpus per virtual CPU core (vcore), currently 4.
      
      To use this, the host kernel should be booted with all threads
      active, and then all the secondary threads should be offlined.
      This will put the secondary threads into nap mode.  KVM will then
      wake them from nap mode and use them for running guest code (while
      they are still offline).  To wake the secondary threads, we send
      them an IPI using a new xics_wake_cpu() function, implemented in
      arch/powerpc/sysdev/xics/icp-native.c.  In other words, at this stage
      we assume that the platform has a XICS interrupt controller and
      we are using icp-native.c to drive it.  Since the woken thread will
      need to acknowledge and clear the IPI, we also export the base
      physical address of the XICS registers using kvmppc_set_xics_phys()
      for use in the low-level KVM book3s code.
      
      When a vcpu is created, it is assigned to a virtual CPU core.
      The vcore number is obtained by dividing the vcpu number by the
      number of threads per core in the host.  This number is exported
      to userspace via the KVM_CAP_PPC_SMT capability.  If qemu wishes
      to run the guest in single-threaded mode, it should make all vcpu
      numbers be multiples of the number of threads per core.
      
      We distinguish three states of a vcpu: runnable (i.e., ready to execute
      the guest), blocked (that is, idle), and busy in host.  We currently
      implement a policy that the vcore can run only when all its threads
      are runnable or blocked.  This way, if a vcpu needs to execute elsewhere
      in the kernel or in qemu, it can do so without being starved of CPU
      by the other vcpus.
      
      When a vcore starts to run, it executes in the context of one of the
      vcpu threads.  The other vcpu threads all go to sleep and stay asleep
      until something happens requiring the vcpu thread to return to qemu,
      or to wake up to run the vcore (this can happen when another vcpu
      thread goes from busy in host state to blocked).
      
      It can happen that a vcpu goes from blocked to runnable state (e.g.
      because of an interrupt), and the vcore it belongs to is already
      running.  In that case it can start to run immediately as long as
      the none of the vcpus in the vcore have started to exit the guest.
      We send the next free thread in the vcore an IPI to get it to start
      to execute the guest.  It synchronizes with the other threads via
      the vcore->entry_exit_count field to make sure that it doesn't go
      into the guest if the other vcpus are exiting by the time that it
      is ready to actually enter the guest.
      
      Note that there is no fixed relationship between the hardware thread
      number and the vcpu number.  Hardware threads are assigned to vcpus
      as they become runnable, so we will always use the lower-numbered
      hardware threads in preference to higher-numbered threads if not all
      the vcpus in the vcore are runnable, regardless of which vcpus are
      runnable.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      371fefd6
    • D
      KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode · 54738c09
      David Gibson 提交于
      This improves I/O performance for guests using the PAPR
      paravirtualization interface by making the H_PUT_TCE hcall faster, by
      implementing it in real mode.  H_PUT_TCE is used for updating virtual
      IOMMU tables, and is used both for virtual I/O and for real I/O in the
      PAPR interface.
      
      Since this moves the IOMMU tables into the kernel, we define a new
      KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables.  The
      ioctl returns a file descriptor which can be used to mmap the newly
      created table.  The qemu driver models use them in the same way as
      userspace managed tables, but they can be updated directly by the
      guest with a real-mode H_PUT_TCE implementation, reducing the number
      of host/guest context switches during guest IO.
      
      There are certain circumstances where it is useful for userland qemu
      to write to the TCE table even if the kernel H_PUT_TCE path is used
      most of the time.  Specifically, allowing this will avoid awkwardness
      when we need to reset the table.  More importantly, we will in the
      future need to write the table in order to restore its state after a
      checkpoint resume or migration.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      54738c09
    • P
      KVM: PPC: Handle some PAPR hcalls in the kernel · a8606e20
      Paul Mackerras 提交于
      This adds the infrastructure for handling PAPR hcalls in the kernel,
      either early in the guest exit path while we are still in real mode,
      or later once the MMU has been turned back on and we are in the full
      kernel context.  The advantage of handling hcalls in real mode if
      possible is that we avoid two partition switches -- and this will
      become more important when we support SMT4 guests, since a partition
      switch means we have to pull all of the threads in the core out of
      the guest.  The disadvantage is that we can only access the kernel
      linear mapping, not anything vmalloced or ioremapped, since the MMU
      is off.
      
      This also adds code to handle the following hcalls in real mode:
      
      H_ENTER       Add an HPTE to the hashed page table
      H_REMOVE      Remove an HPTE from the hashed page table
      H_READ        Read HPTEs from the hashed page table
      H_PROTECT     Change the protection bits in an HPTE
      H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
      H_SET_DABR    Set the data address breakpoint register
      
      Plus code to handle the following hcalls in the kernel:
      
      H_CEDE        Idle the vcpu until an interrupt or H_PROD hcall arrives
      H_PROD        Wake up a ceded vcpu
      H_REGISTER_VPA Register a virtual processor area (VPA)
      
      The code that runs in real mode has to be in the base kernel, not in
      the module, if KVM is compiled as a module.  The real-mode code can
      only access the kernel linear mapping, not vmalloc or ioremap space.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a8606e20
    • P
      KVM: PPC: Add support for Book3S processors in hypervisor mode · de56a948
      Paul Mackerras 提交于
      This adds support for KVM running on 64-bit Book 3S processors,
      specifically POWER7, in hypervisor mode.  Using hypervisor mode means
      that the guest can use the processor's supervisor mode.  That means
      that the guest can execute privileged instructions and access privileged
      registers itself without trapping to the host.  This gives excellent
      performance, but does mean that KVM cannot emulate a processor
      architecture other than the one that the hardware implements.
      
      This code assumes that the guest is running paravirtualized using the
      PAPR (Power Architecture Platform Requirements) interface, which is the
      interface that IBM's PowerVM hypervisor uses.  That means that existing
      Linux distributions that run on IBM pSeries machines will also run
      under KVM without modification.  In order to communicate the PAPR
      hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
      to include/linux/kvm.h.
      
      Currently the choice between book3s_hv support and book3s_pr support
      (i.e. the existing code, which runs the guest in user mode) has to be
      made at kernel configuration time, so a given kernel binary can only
      do one or the other.
      
      This new book3s_hv code doesn't support MMIO emulation at present.
      Since we are running paravirtualized guests, this isn't a serious
      restriction.
      
      With the guest running in supervisor mode, most exceptions go straight
      to the guest.  We will never get data or instruction storage or segment
      interrupts, alignment interrupts, decrementer interrupts, program
      interrupts, single-step interrupts, etc., coming to the hypervisor from
      the guest.  Therefore this introduces a new KVMTEST_NONHV macro for the
      exception entry path so that we don't have to do the KVM test on entry
      to those exception handlers.
      
      We do however get hypervisor decrementer, hypervisor data storage,
      hypervisor instruction storage, and hypervisor emulation assist
      interrupts, so we have to handle those.
      
      In hypervisor mode, real-mode accesses can access all of RAM, not just
      a limited amount.  Therefore we put all the guest state in the vcpu.arch
      and use the shadow_vcpu in the PACA only for temporary scratch space.
      We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
      anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
      We don't have a shared page with the guest, but we still need a
      kvm_vcpu_arch_shared struct to store the values of various registers,
      so we include one in the vcpu_arch struct.
      
      The POWER7 processor has a restriction that all threads in a core have
      to be in the same partition.  MMU-on kernel code counts as a partition
      (partition 0), so we have to do a partition switch on every entry to and
      exit from the guest.  At present we require the host and guest to run
      in single-thread mode because of this hardware restriction.
      
      This code allocates a hashed page table for the guest and initializes
      it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
      require that the guest memory is allocated using 16MB huge pages, in
      order to simplify the low-level memory management.  This also means that
      we can get away without tracking paging activity in the host for now,
      since huge pages can't be paged or swapped.
      
      This also adds a few new exports needed by the book3s_hv code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      de56a948
    • P
      KVM: PPC: Move guest enter/exit down into subarch-specific code · df6909e5
      Paul Mackerras 提交于
      Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
      calls in powerpc.c, this moves them down into the subarch-specific
      book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
      call in book3s_pr.c, and will be needed for when we do SMT4 guest
      support in the book3s hypervisor mode code.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      df6909e5
    • P
      KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down · f9e0554d
      Paul Mackerras 提交于
      This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
      pass down some of the calls it gets to the lower-level subarchitecture
      specific code.  The lower-level implementations (in booke.c and book3s.c)
      are no-ops.  The coming book3s_hv.c will need this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      f9e0554d
    • S
      KVM: PPC: e500: enable magic page · a4cd8b23
      Scott Wood 提交于
      This is a shared page used for paravirtualization.  It is always present
      in the guest kernel's effective address space at the address indicated
      by the hypercall that enables it.
      
      The physical address specified by the hypercall is not used, as
      e500 does not have real mode.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      a4cd8b23
  3. 22 5月, 2011 1 次提交
  4. 24 10月, 2010 1 次提交
    • A
      KVM: PPC: Implement hypervisor interface · 2a342ed5
      Alexander Graf 提交于
      To communicate with KVM directly we need to plumb some sort of interface
      between the guest and KVM. Usually those interfaces use hypercalls.
      
      This hypercall implementation is described in the last patch of the series
      in a special documentation file. Please read that for further information.
      
      This patch implements stubs to handle KVM PPC hypercalls on the host and
      guest side alike.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      2a342ed5
  5. 17 5月, 2010 3 次提交
    • A
      KVM: PPC: Extract MMU init · 9cc5e953
      Alexander Graf 提交于
      The host shadow mmu code needs to get initialized. It needs to fetch a
      segment it can use to put shadow PTEs into.
      
      That initialization code was in generic code, which is icky. Let's move
      it over to the respective MMU file.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      9cc5e953
    • A
      KVM: PPC: Improve indirect svcpu accessors · c7f38f46
      Alexander Graf 提交于
      We already have some inline fuctions we use to access vcpu or svcpu structs,
      depending on whether we're on booke or book3s. Since we just put a few more
      registers into the svcpu, we also need to make sure the respective callbacks
      are available and get used.
      
      So this patch moves direct use of the now in the svcpu struct fields to
      inline function calls. While at it, it also moves the definition of those
      inline function calls to respective header files for booke and book3s,
      greatly improving readability.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c7f38f46
    • A
      KVM: PPC: Allow userspace to unset the IRQ line · 18978768
      Alexander Graf 提交于
      Userspace can tell us that it wants to trigger an interrupt. But
      so far it can't tell us that it wants to stop triggering one.
      
      So let's interpret the parameter to the ioctl that we have anyways
      to tell us if we want to raise or lower the interrupt line.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      
      v2 -> v3:
      
       - Add CAP for unset irq
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      18978768
  6. 25 4月, 2010 4 次提交
    • A
      KVM: PPC: Add helpers to modify ppc fields · 0564ee8a
      Alexander Graf 提交于
      The PowerPC specification always lists bits from MSB to LSB. That is
      really confusing when you're trying to write C code, because it fits
      in pretty badly with the normal (1 << xx) schemes.
      
      So I came up with some nice wrappers that allow to get and set fields
      in a u64 with bit numbers exactly as given in the spec. That makes the
      code in KVM and the spec easier comparable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0564ee8a
    • A
      KVM: PPC: Add AGAIN type for emulation return · 37f5bca6
      Alexander Graf 提交于
      Emulation of an instruction can have different outcomes. It can succeed,
      fail, require MMIO, do funky BookE stuff - or it can just realize something's
      odd and will be fixed the next time around.
      
      Exactly that is what EMULATE_AGAIN means. Using that flag we can now tell
      the caller that nothing happened, but we still want to go back to the
      guest and see what happens next time we come around.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      37f5bca6
    • A
      KVM: PPC: Teach MMIO Signedness · 3587d534
      Alexander Graf 提交于
      The guest I was trying to get to run uses the LHA and LHAU instructions.
      Those instructions basically do a load, but also sign extend the result.
      
      Since we need to fill our registers by hand when doing MMIO, we also need
      to sign extend manually.
      
      This patch implements sign extended MMIO and the LHA(U) instructions.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3587d534
    • A
      KVM: PPC: Enable MMIO to do 64 bits, fprs and qprs · b104d066
      Alexander Graf 提交于
      Right now MMIO access can only happen for GPRs and is at most 32 bit wide.
      That's actually enough for almost all types of hardware out there.
      
      Unfortunately, the guest I was using used FPU writes to MMIO regions, so
      it ended up writing 64 bit MMIOs using FPRs and QPRs.
      
      So let's add code to handle those odd cases too.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b104d066
  7. 01 3月, 2010 6 次提交
    • A
      KVM: PPC: Fix initial GPR settings · 1c0006d8
      Alexander Graf 提交于
      Commit 7d01b4c3ed2bb33ceaf2d270cb4831a67a76b51b introduced PACA backed vcpu
      values. With this patch, when a userspace app was setting GPRs before it was
      actually first loaded, the set values get discarded.
      
      This is because vcpu_load loads them from the vcpu backing store that we use
      whenever we're not owning the PACA.
      
      That behavior is not really a major problem, because we don't need it for
      qemu. Other users (like kvmctl) do have problems with it though, so let's
      better do it right.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1c0006d8
    • A
      KVM: PPC: Emulate trap SRR1 flags properly · 25a8a02d
      Alexander Graf 提交于
      Book3S needs some flags in SRR1 to get to know details about an interrupt.
      
      One such example is the trap instruction. It tells the guest kernel that
      a program interrupt is due to a trap using a bit in SRR1.
      
      This patch implements above behavior, making WARN_ON behave like WARN_ON.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      25a8a02d
    • A
      KVM: PPC: Use PACA backed shadow vcpu · 7e57cba0
      Alexander Graf 提交于
      We're being horribly racy right now. All the entry and exit code hijacks
      random fields from the PACA that could easily be used by different code in
      case we get interrupted, for example by a #MC or even page fault.
      
      After discussing this with Ben, we figured it's best to reserve some more
      space in the PACA and just shove off some vcpu state to there.
      
      That way we can drastically improve the readability of the code, make it
      less racy and less complex.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7e57cba0
    • A
      KVM: PPC: Add helpers for CR, XER · 992b5b29
      Alexander Graf 提交于
      We now have helpers for the GPRs, so let's also add some for CR and XER.
      
      Having them in the PACA simplifies code a lot, as we don't need to care
      about where to store CC or not to overflow any integers.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      992b5b29
    • A
      KVM: PPC: Use accessor functions for GPR access · 8e5b26b5
      Alexander Graf 提交于
      All code in PPC KVM currently accesses gprs in the vcpu struct directly.
      
      While there's nothing wrong with that wrt the current way gprs are stored
      and loaded, it doesn't suffice for the PACA acceleration that will follow
      in this patchset.
      
      So let's just create little wrapper inline functions that we call whenever
      a GPR needs to be read from or written to. The compiled code shouldn't really
      change at all for now.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      8e5b26b5
    • A
      KVM: powerpc: Improve DEC handling · 7706664d
      Alexander Graf 提交于
      We treated the DEC interrupt like an edge based one. This is not true for
      Book3s. The DEC keeps firing until mtdec is issued again and thus clears
      the interrupt line.
      
      So let's implement this logic in KVM too. This patch moves the line clearing
      from the firing of the interrupt to the mtdec emulation.
      
      This makes PPC64 guests work without AGGRESSIVE_DEC defined.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Acked-by: NAcked-by: Hollis Blanchard <hollis@penguinppc.org>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7706664d
  8. 05 11月, 2009 1 次提交
  9. 24 3月, 2009 6 次提交
  10. 31 12月, 2008 9 次提交