1. 10 4月, 2017 1 次提交
  2. 02 3月, 2017 2 次提交
  3. 31 1月, 2017 10 次提交
    • D
      KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation · 5e985969
      David Gibson 提交于
      This adds a not yet working outline of the HPT resizing PAPR
      extension.  Specifically it adds the necessary ioctl() functions,
      their basic steps, the work function which will handle preparation for
      the resize, and synchronization between these, the guest page fault
      path and guest HPT update path.
      
      The actual guts of the implementation isn't here yet, so for now the
      calls will always fail.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5e985969
    • D
      KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size · f98a8bf9
      David Gibson 提交于
      The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
      table (HPT) that userspace expects a guest VM to have, and is also used to
      clear that HPT when necessary (e.g. guest reboot).
      
      At present, once the ioctl() is called for the first time, the HPT size can
      never be changed thereafter - it will be cleared but always sized as from
      the first call.
      
      With upcoming HPT resize implementation, we're going to need to allow
      userspace to resize the HPT at reset (to change it back to the default size
      if the guest changed it).
      
      So, we need to allow this ioctl() to change the HPT size.
      
      This patch also updates Documentation/virtual/kvm/api.txt to reflect
      the new behaviour.  In fact the documentation was already slightly
      incorrect since 572abd56 "KVM: PPC: Book3S HV: Don't fall back to
      smaller HPT size in allocation ioctl"
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f98a8bf9
    • D
      KVM: PPC: Book3S HV: Split HPT allocation from activation · aae0777f
      David Gibson 提交于
      Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
      and sets it up as the active page table for a VM.  For the upcoming HPT
      resize implementation we're going to want to allocate HPTs separately from
      activating them.
      
      So, split the allocation itself out into kvmppc_allocate_hpt() and perform
      the activation with a new kvmppc_set_hpt() function.  Likewise we split
      kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
      which unsets it as an active HPT, then frees it.
      
      We also move the logic to fall back to smaller HPT sizes if the first try
      fails into the single caller which used that behaviour,
      kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
      that previously if the initial attempt at CMA allocation failed, we would
      fall back to attempting smaller sizes with the page allocator.  Now, we
      try first CMA, then the page allocator at each size.  As far as I can tell
      this change should be harmless.
      
      To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
      call to kvmppc_free_lpid() that was there, we move to the single caller.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aae0777f
    • D
      KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure · 3f9d4f5a
      David Gibson 提交于
      Currently, the powerpc kvm_arch structure contains a number of variables
      tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
      patch gathers them all together into a single kvm_hpt_info substructure.
      This makes life more convenient for the upcoming HPT resizing
      implementation.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f9d4f5a
    • P
      KVM: PPC: Book3S HV: Enable radix guest support · 8cf4ecc0
      Paul Mackerras 提交于
      This adds a few last pieces of the support for radix guests:
      
      * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
        KVM_PPC_GET_RMMU_INFO ioctls for radix guests
      
      * On POWER9, allow secondary threads to be on/off-lined while guests
        are running.
      
      * Set up LPCR and the partition table entry for radix guests.
      
      * Don't allocate the rmap array in the kvm_memory_slot structure
        on radix.
      
      * Don't try to initialize the HPT for radix guests, since they don't
        have an HPT.
      
      * Take out the code that prevents the HV KVM module from
        initializing on radix hosts.
      
      At this stage, we only support radix guests if the host is running
      in radix mode, and only support HPT guests if the host is running in
      HPT mode.  Thus a guest cannot switch from one mode to the other,
      which enables some simplifications.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cf4ecc0
    • P
      KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement · a29ebeaf
      Paul Mackerras 提交于
      With radix, the guest can do TLB invalidations itself using the tlbie
      (global) and tlbiel (local) TLB invalidation instructions.  Linux guests
      use local TLB invalidations for translations that have only ever been
      accessed on one vcpu.  However, that doesn't mean that the translations
      have only been accessed on one physical cpu (pcpu) since vcpus can move
      around from one pcpu to another.  Thus a tlbiel might leave behind stale
      TLB entries on a pcpu where the vcpu previously ran, and if that task
      then moves back to that previous pcpu, it could see those stale TLB
      entries and thus access memory incorrectly.  The usual symptom of this
      is random segfaults in userspace programs in the guest.
      
      To cope with this, we detect when a vcpu is about to start executing on
      a thread in a core that is a different core from the last time it
      executed.  If that is the case, then we mark the core as needing a
      TLB flush and then send an interrupt to any thread in the core that is
      currently running a vcpu from the same guest.  This will get those vcpus
      out of the guest, and the first one to re-enter the guest will do the
      TLB flush.  The reason for interrupting the vcpus executing on the old
      core is to cope with the following scenario:
      
      	CPU 0			CPU 1			CPU 4
      	(core 0)			(core 0)			(core 1)
      
      	VCPU 0 runs task X      VCPU 1 runs
      	core 0 TLB gets
      	entries from task X
      	VCPU 0 moves to CPU 4
      							VCPU 0 runs task X
      							Unmap pages of task X
      							tlbiel
      
      				(still VCPU 1)			task X moves to VCPU 1
      				task X runs
      				task X sees stale TLB
      				entries
      
      That is, as soon as the VCPU starts executing on the new core, it
      could unmap and tlbiel some page table entries, and then the task
      could migrate to one of the VCPUs running on the old core and
      potentially see stale TLB entries.
      
      Since the TLB is shared between all the threads in a core, we only
      use the bit of kvm->arch.need_tlb_flush corresponding to the first
      thread in the core.  To ensure that we don't have a window where we
      can miss a flush, this moves the clearing of the bit from before the
      actual flush to after it.  This way, two threads might both do the
      flush, but we prevent the situation where one thread can enter the
      guest before the flush is finished.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a29ebeaf
    • P
      KVM: PPC: Book3S HV: Implement dirty page logging for radix guests · 8f7b79b8
      Paul Mackerras 提交于
      This adds code to keep track of dirty pages when requested (that is,
      when memslot->dirty_bitmap is non-NULL) for radix guests.  We use the
      dirty bits in the PTEs in the second-level (partition-scoped) page
      tables, together with a bitmap of pages that were dirty when their
      PTE was invalidated (e.g., when the page was paged out).  This bitmap
      is stored in the first half of the memslot->dirty_bitmap area, and
      kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
      bitmap that gets returned to userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8f7b79b8
    • P
      KVM: PPC: Book3S HV: Page table construction and page faults for radix guests · 5a319350
      Paul Mackerras 提交于
      This adds the code to construct the second-level ("partition-scoped" in
      architecturese) page tables for guests using the radix MMU.  Apart from
      the PGD level, which is allocated when the guest is created, the rest
      of the tree is all constructed in response to hypervisor page faults.
      
      As well as hypervisor page faults for missing pages, we also get faults
      for reference/change (RC) bits needing to be set, as well as various
      other error conditions.  For now, we only set the R or C bit in the
      guest page table if the same bit is set in the host PTE for the
      backing page.
      
      This code can take advantage of the guest being backed with either
      transparent or ordinary 2MB huge pages, and insert 2MB page entries
      into the guest page tables.  There is no support for 1GB huge pages
      yet.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5a319350
    • P
      KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9 · 468808bd
      Paul Mackerras 提交于
      This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
      for HPT guests on POWER9.  With this, we can return 1 for the
      KVM_CAP_PPC_MMU_HASH_V3 capability.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      468808bd
    • P
      KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU · c9270132
      Paul Mackerras 提交于
      This adds two capabilities and two ioctls to allow userspace to
      find out about and configure the POWER9 MMU in a guest.  The two
      capabilities tell userspace whether KVM can support a guest using
      the radix MMU, or using the hashed page table (HPT) MMU with a
      process table and segment tables.  (Note that the MMUs in the
      POWER9 processor cores do not use the process and segment tables
      when in HPT mode, but the nest MMU does).
      
      The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
      whether a guest will use the radix MMU or the HPT MMU, and to
      specify the size and location (in guest space) of the process
      table.
      
      The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
      the radix MMU.  It returns a list of supported radix tree geometries
      (base page size and number of bits indexed at each level of the
      radix tree) and the encoding used to specify the various page
      sizes for the TLB invalidate entry instruction.
      
      Initially, both capabilities return 0 and the ioctls return -EINVAL,
      until the necessary infrastructure for them to operate correctly
      is added.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c9270132
  4. 27 1月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu · 8464c884
      Paul Mackerras 提交于
      The H_PROD hypercall is supposed to wake up an idle vcpu.  We have
      an implementation, but because Linux doesn't use it except when
      doing cpu hotplug, it was never tested properly.  AIX does use it,
      and reported it broken.  It turns out we were waking the wrong
      vcpu (the one doing H_PROD, not the target of the prod) and we
      weren't handling the case where the target needs an IPI to wake
      it.  Fix it by using the existing kvmppc_fast_vcpu_kick_hv()
      function, which is intended for this kind of thing, and by using
      the target vcpu not the current vcpu.
      
      We were also not looking at the prodded flag when checking whether a
      ceded vcpu should wake up, so this adds checks for the prodded flag
      alongside the checks for pending exceptions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8464c884
    • P
      KVM: PPC: Book3S HV: Don't try to signal cpu -1 · 3deda5e5
      Paul Mackerras 提交于
      If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
      any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
      happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
      -1 as the cpu argument.  Although this is not meaningful, in the past,
      before commit 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
      to other cores on POWER9", 2016-11-18), it was harmless because CPU
      -1 is not in the same core as any real CPU thread.  On a POWER9,
      however, we don't do the "same core" check, so we were trying to
      do a msgsnd to thread -1, which is invalid.  To avoid this, we add
      a check to see that vcpu->arch.thread_cpu is >= 0 before calling
      kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
      asynchronously, we use READ_ONCE to ensure that the value we check is
      the same value that we use as the argument to kvmppc_ipi_thread().
      
      Fixes: 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3deda5e5
  5. 26 12月, 2016 1 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
  6. 25 12月, 2016 1 次提交
  7. 02 12月, 2016 1 次提交
  8. 28 11月, 2016 3 次提交
  9. 24 11月, 2016 8 次提交
    • S
      KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00 · 2ee13be3
      Suraj Jitindar Singh 提交于
      The function kvmppc_set_arch_compat() is used to determine the value of the
      processor compatibility register (PCR) for a guest running in a given
      compatibility mode. There is currently no support for v3.00 of the ISA.
      
      Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
      to the PCR.
      
      We also add a check to ensure the processor we are running on is capable of
      emulating the chosen processor (for example a POWER7 cannot emulate a
      POWER8, similarly with a POWER8 and a POWER9).
      
      Based on work by: Paul Mackerras <paulus@ozlabs.org>
      
      [paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
       guest_pcr_bit when arch_compat == 0, added comment.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ee13be3
    • P
      KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores · 45c940ba
      Paul Mackerras 提交于
      With POWER9, each CPU thread has its own MMU context and can be
      in the host or a guest independently of the other threads; there is
      still however a restriction that all threads must use the same type
      of address translation, either radix tree or hashed page table (HPT).
      
      Since we only support HPT guests on a HPT host at this point, we
      can treat the threads as being independent, and avoid all of the
      work of coordinating the CPU threads.  To make this simpler, we
      introduce a new threads_per_vcore() function that returns 1 on
      POWER9 and threads_per_subcore on POWER7/8, and use that instead
      of threads_per_subcore or threads_per_core in various places.
      
      This also changes the value of the KVM_CAP_PPC_SMT capability on
      POWER9 systems from 4 to 1, so that userspace will not try to
      create VMs with multiple vcpus per vcore.  (If userspace did create
      a VM that thought it was in an SMT mode, the VM might try to use
      the msgsndp instruction, which will not work as expected.  In
      future it may be possible to trap and emulate msgsndp in order to
      allow VMs to think they are in an SMT mode, if only for the purpose
      of allowing migration from POWER8 systems.)
      
      With all this, we can now run guests on POWER9 as long as the host
      is running with HPT translation.  Since userspace currently has no
      way to request radix tree translation for the guest, the guest has
      no choice but to use HPT translation.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      45c940ba
    • P
      KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest · 84f7139c
      Paul Mackerras 提交于
      The new XIVE interrupt controller on POWER9 can direct external
      interrupts to the hypervisor or the guest.  The interrupts directed to
      the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
      come in as a "hypervisor virtualization interrupt".  This sets the
      LPCR bit so that hypervisor virtualization interrupts can occur while
      we are in the guest.  We then also need to cope with exiting the guest
      because of a hypervisor virtualization interrupt.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      84f7139c
    • P
      KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 · f725758b
      Paul Mackerras 提交于
      POWER9 includes a new interrupt controller, called XIVE, which is
      quite different from the XICS interrupt controller on POWER7 and
      POWER8 machines.  KVM-HV accesses the XICS directly in several places
      in order to send and clear IPIs and handle interrupts from PCI
      devices being passed through to the guest.
      
      In order to make the transition to XIVE easier, OPAL firmware will
      include an emulation of XICS on top of XIVE.  Access to the emulated
      XICS is via OPAL calls.  The one complication is that the EOI
      (end-of-interrupt) function can now return a value indicating that
      another interrupt is pending; in this case, the XIVE will not signal
      an interrupt in hardware to the CPU, and software is supposed to
      acknowledge the new interrupt without waiting for another interrupt
      to be delivered in hardware.
      
      This adapts KVM-HV to use the OPAL calls on machines where there is
      no XICS hardware.  When there is no XICS, we look for a device-tree
      node with "ibm,opal-intc" in its compatible property, which is how
      OPAL indicates that it provides XICS emulation.
      
      In order to handle the EOI return value, kvmppc_read_intr() has
      become kvmppc_read_one_intr(), with a boolean variable passed by
      reference which can be set by the EOI functions to indicate that
      another interrupt is pending.  The new kvmppc_read_intr() keeps
      calling kvmppc_read_one_intr() until there are no more interrupts
      to process.  The return value from kvmppc_read_intr() is the
      largest non-zero value of the returns from kvmppc_read_one_intr().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f725758b
    • P
      KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9 · 1704a81c
      Paul Mackerras 提交于
      On POWER9, the msgsnd instruction is able to send interrupts to
      other cores, as well as other threads on the local core.  Since
      msgsnd is generally simpler and faster than sending an IPI via the
      XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1704a81c
    • P
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras 提交于
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca
    • P
      KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs · e9cf1e08
      Paul Mackerras 提交于
      This adds code to handle two new guest-accessible special-purpose
      registers on POWER9: TIDR (thread ID register) and PSSCR (processor
      stop status and control register).  They are context-switched
      between host and guest, and the guest values can be read and set
      via the one_reg interface.
      
      The PSSCR contains some fields which are guest-accessible and some
      which are only accessible in hypervisor mode.  We only allow the
      guest-accessible fields to be read or set by userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e9cf1e08
    • P
      KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9 · 7a84084c
      Paul Mackerras 提交于
      On POWER9, the SDR1 register (hashed page table base address) is no
      longer used, and instead the hardware reads the HPT base address
      and size from the partition table.  The partition table entry also
      contains the bits that specify the page size for the VRMA mapping,
      which were previously in the LPCR.  The VPM0 bit of the LPCR is
      now reserved; the processor now always uses the VRMA (virtual
      real-mode area) mechanism for guest real-mode accesses in HPT mode,
      and the RMO (real-mode offset) mechanism has been dropped.
      
      When entering or exiting the guest, we now only have to set the
      LPIDR (logical partition ID register), not the SDR1 register.
      There is also no requirement now to transition via a reserved
      LPID value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7a84084c
  10. 21 11月, 2016 3 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state · 0d808df0
      Paul Mackerras 提交于
      When switching from/to a guest that has a transaction in progress,
      we need to save/restore the checkpointed register state.  Although
      XER is part of the CPU state that gets checkpointed, the code that
      does this saving and restoring doesn't save/restore XER.
      
      This fixes it by saving and restoring the XER.  To allow userspace
      to read/write the checkpointed XER value, we also add a new ONE_REG
      specifier.
      
      The visible effect of this bug is that the guest may see its XER
      value being corrupted when it uses transactions.
      
      Fixes: e4e38121 ("KVM: PPC: Book3S HV: Add transactional memory support")
      Fixes: 0a8eccef ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0d808df0
    • Y
      KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries · a56ee9f8
      Yongji Xie 提交于
      This keeps a per vcpu cache for recently page faulted MMIO entries.
      On a page fault, if the entry exists in the cache, we can avoid some
      time-consuming paths, for example, looking up HPT, locking HPTE twice
      and searching mmio gfn from memslots, then directly call
      kvmppc_hv_emulate_mmio().
      
      In current implenment, we limit the size of cache to four. We think
      it's enough to cover the high-frequency MMIO HPTEs in most case.
      For example, considering the case of using virtio device, for virtio
      legacy devices, one HPTE could handle notifications from up to
      1024 (64K page / 64 byte Port IO register) devices, so one cache entry
      is enough; for virtio modern devices, we always need one HPTE to handle
      notification for each device because modern device would use a 8M MMIO
      register to notify host instead of Port IO register, typically the
      system's configuration should not exceed four virtio devices per
      vcpu, four cache entry is also enough in this case. Of course, if needed,
      we could also modify the macro to a module parameter in the future.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a56ee9f8
    • W
      KVM: PPC: Book3S HV: Use list_move_tail instead of list_del/list_add_tail · 28d057c8
      Wei Yongjun 提交于
      Using list_move_tail() instead of list_del() + list_add_tail().
      Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      28d057c8
  11. 27 9月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Take out virtual core piggybacking code · b009031f
      Paul Mackerras 提交于
      This takes out the code that arranges to run two (or more) virtual
      cores on a single subcore when possible, that is, when both vcores
      are from the same VM, the VM is configured with one CPU thread per
      virtual core, and all the per-subcore registers have the same value
      in each vcore.  Since the VTB (virtual timebase) is a per-subcore
      register, and will almost always differ between vcores, this code
      is disabled on POWER8 machines, meaning that it is only usable on
      POWER7 machines (which don't have VTB).  Given the tiny number of
      POWER7 machines which have firmware that allows them to run HV KVM,
      the benefit of simplifying the code outweighs the loss of this
      feature on POWER7 machines.
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b009031f
    • P
      KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread · 88b02cf9
      Paul Mackerras 提交于
      POWER8 has one virtual timebase (VTB) register per subcore, not one
      per CPU thread.  The HV KVM code currently treats VTB as a per-thread
      register, which can lead to spurious soft lockup messages from guests
      which use the VTB as the time source for the soft lockup detector.
      (CPUs before POWER8 did not have the VTB register.)
      
      For HV KVM, this fixes the problem by making only the primary thread
      in each virtual core save and restore the VTB value.  With this,
      the VTB state becomes part of the kvmppc_vcore structure.  This
      also means that "piggybacking" of multiple virtual cores onto one
      subcore is not possible on POWER8, because then the virtual cores
      would share a single VTB register.
      
      PR KVM emulates a VTB register, which is per-vcpu because PR KVM
      has no notion of CPU threads or SMT.  For PR KVM we move the VTB
      state into the kvmppc_vcpu_book3s struct.
      
      Cc: stable@vger.kernel.org # v3.14+
      Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      88b02cf9
  12. 12 9月, 2016 4 次提交
    • P
      KVM: PPC: Book3S HV: Set server for passed-through interrupts · 5d375199
      Paul Mackerras 提交于
      When a guest has a PCI pass-through device with an interrupt, it
      will direct the interrupt to a particular guest VCPU.  In fact the
      physical interrupt might arrive on any CPU, and then get
      delivered to the target VCPU in the emulated XICS (guest interrupt
      controller), and eventually delivered to the target VCPU.
      
      Now that we have code to handle device interrupts in real mode
      without exiting to the host kernel, there is an advantage to having
      the device interrupt arrive on the same sub(core) as the target
      VCPU is running on.  In this situation, the interrupt can be
      delivered to the target VCPU without any exit to the host kernel
      (using a hypervisor doorbell interrupt between threads if
      necessary).
      
      This patch aims to get passed-through device interrupts arriving
      on the correct core by setting the interrupt server in the real
      hardware XICS for the interrupt to the first thread in the (sub)core
      where its target VCPU is running.  We do this in the real-mode H_EOI
      code because the H_EOI handler already needs to look at the
      emulated ICS state for the interrupt (whereas the H_XIRR handler
      doesn't), and we know we are running in the target VCPU context
      at that point.
      
      We set the server CPU in hardware using an OPAL call, regardless of
      what the IRQ affinity mask for the interrupt says, and without
      updating the affinity mask.  This amounts to saying that when an
      interrupt is passed through to a guest, as a matter of policy we
      allow the guest's affinity for the interrupt to override the host's.
      
      This is inspired by an earlier patch from Suresh Warrier, although
      none of this code came from that earlier patch.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5d375199
    • S
      KVM: PPC: Book3S HV: Tunable to disable KVM IRQ bypass · 644abbb2
      Suresh Warrier 提交于
      Add a  module parameter kvm_irq_bypass for kvm_hv.ko to
      disable IRQ bypass for passthrough interrupts. The default
      value of this tunable is 1 - that is enable the feature.
      
      Since the tunable is used by built-in kernel code, we use
      the module_param_cb macro to achieve this.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      644abbb2
    • S
      KVM: PPC: Book3S HV: Complete passthrough interrupt in host · f7af5209
      Suresh Warrier 提交于
      In existing real mode ICP code, when updating the virtual ICP
      state, if there is a required action that cannot be completely
      handled in real mode, as for instance, a VCPU needs to be woken
      up, flags are set in the ICP to indicate the required action.
      This is checked when returning from hypercalls to decide whether
      the call needs switch back to the host where the action can be
      performed in virtual mode. Note that if h_ipi_redirect is enabled,
      real mode code will first try to message a free host CPU to
      complete this job instead of returning the host to do it ourselves.
      
      Currently, the real mode PCI passthrough interrupt handling code
      checks if any of these flags are set and simply returns to the host.
      This is not good enough as the trap value (0x500) is treated as an
      external interrupt by the host code. It is only when the trap value
      is a hypercall that the host code searches for and acts on unfinished
      work by calling kvmppc_xics_rm_complete.
      
      This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
      which is returned by KVM if there is unfinished business to be
      completed in host virtual mode after handling a PCI passthrough
      interrupt. The host checks for this special interrupt condition
      and calls into the kvmppc_xics_rm_complete, which is made an
      exported function for this reason.
      
      [paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
       in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f7af5209
    • S
      KVM: PPC: Book3S HV: Handle passthrough interrupts in guest · e3c13e56
      Suresh Warrier 提交于
      Currently, KVM switches back to the host to handle any external
      interrupt (when the interrupt is received while running in the
      guest). This patch updates real-mode KVM to check if an interrupt
      is generated by a passthrough adapter that is owned by this guest.
      If so, the real mode KVM will directly inject the corresponding
      virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
      in hardware. In short, the interrupt is handled entirely in real
      mode in the guest context without switching back to the host.
      
      In some rare cases, the interrupt cannot be completely handled in
      real mode, for instance, a VCPU that is sleeping needs to be woken
      up. In this case, KVM simply switches back to the host with trap
      reason set to 0x500. This works, but it is clearly not very efficient.
      A following patch will distinguish this case and handle it
      correctly in the host. Note that we can use the existing
      check_too_hard() routine even though we are not in a hypercall to
      determine if there is unfinished business that needs to be
      completed in host virtual mode.
      
      The patch assumes that the mapping between hardware interrupt IRQ
      and virtual IRQ to be injected to the guest already exists for the
      PCI passthrough interrupts that need to be handled in real mode.
      If the mapping does not exist, KVM falls back to the default
      existing behavior.
      
      The KVM real mode code reads mappings from the mapped array in the
      passthrough IRQ map without taking any lock.  We carefully order the
      loads and stores of the fields in the kvmppc_irq_map data structure
      using memory barriers to avoid an inconsistent mapping being seen by
      the reader. Thus, although it is possible to miss a map entry, it is
      not possible to read a stale value.
      
      [paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
       pulled out powernv eoi change into a separate patch, made
       kvmppc_read_intr get the vcpu from the paca rather than being
       passed in, rewrote the logic at the end of kvmppc_read_intr to
       avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
       since we were always restoring SRR0/1 anyway, get rid of the cached
       array (just use the mapped array), removed the kick_all_cpus_sync()
       call, clear saved_xirr PACA field when we handle the interrupt in
       real mode, fix compilation with CONFIG_KVM_XICS=n.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e3c13e56
  13. 09 9月, 2016 2 次提交
    • S
      KVM: PPC: Book3S HV: Enable IRQ bypass · c57875f5
      Suresh Warrier 提交于
      Add the irq_bypass_add_producer and irq_bypass_del_producer
      functions. These functions get called whenever a GSI is being
      defined for a guest. They create/remove the mapping between
      host real IRQ numbers and the guest GSI.
      
      Add the following helper functions to manage the
      passthrough IRQ map.
      
      kvmppc_set_passthru_irq()
        Creates a mapping in the passthrough IRQ map that maps a host
        IRQ to a guest GSI. It allocates the structure (one per guest VM)
        the first time it is called.
      
      kvmppc_clr_passthru_irq()
        Removes the passthrough IRQ map entry given a guest GSI.
        The passthrough IRQ map structure is not freed even when the
        number of mapped entries goes to zero. It is only freed when
        the VM is destroyed.
      
      [paulus@ozlabs.org - modified to use is_pnv_opal_msi() rather than
       requiring all passed-through interrupts to use the same irq_chip;
       changed deletion so it zeroes out the r_hwirq field rather than
       copying the last entry down and decrementing the number of entries.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c57875f5
    • S
      KVM: PPC: Book3S HV: Introduce kvmppc_passthru_irqmap · 8daaafc8
      Suresh Warrier 提交于
      This patch introduces an IRQ mapping structure, the
      kvmppc_passthru_irqmap structure that is to be used
      to map the real hardware IRQ in the host with the virtual
      hardware IRQ (gsi) that is injected into a guest by KVM for
      passthrough adapters.
      
      Currently, we assume a separate IRQ mapping structure for
      each guest. Each kvmppc_passthru_irqmap has a mapping arrays,
      containing all defined real<->virtual IRQs.
      
      [paulus@ozlabs.org - removed irq_chip field from struct
       kvmppc_passthru_irqmap; changed parameter for
       kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct
       kvm *, removed small cached array.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8daaafc8