1. 20 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't sleep if XIVE interrupt pending on POWER9 · ee3308a2
      Paul Mackerras 提交于
      On a POWER9 system, it is possible for an interrupt to become pending
      for a VCPU when that VCPU is about to cede (execute a H_CEDE hypercall)
      and has already disabled interrupts, or in the H_CEDE processing up
      to the point where the XIVE context is pulled from the hardware.  In
      such a case, the H_CEDE should not sleep, but should return immediately
      to the guest.  However, the conditions tested in kvmppc_vcpu_woken()
      don't include the condition that a XIVE interrupt is pending, so the
      VCPU could sleep until the next decrementer interrupt.
      
      To fix this, we add a new xive_interrupt_pending() helper which looks
      in the XIVE context that was pulled from the hardware to see if the
      priority of any pending interrupt is higher (numerically lower than)
      the CPU priority.  If so then kvmppc_vcpu_woken() will return true.
      If the XIVE context has never been used, then both the pipr and the
      cppr fields will be zero and the test will indicate that no interrupt
      is pending.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ee3308a2
  2. 19 6月, 2017 5 次提交
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
    • P
      KVM: PPC: Book3S HV: Don't let VCPU sleep if it has a doorbell pending · 1da4e2f4
      Paul Mackerras 提交于
      It is possible, through a narrow race condition, for a VCPU to exit
      the guest with a H_CEDE hypercall while it has a doorbell interrupt
      pending.  In this case, the H_CEDE should return immediately, but in
      fact it puts the VCPU to sleep until some other interrupt becomes
      pending or a prod is received (via another VCPU doing H_PROD).
      
      This fixes it by checking the DPDES (Directed Privileged Doorbell
      Exception Status) bit for the thread along with the other interrupt
      pending bits.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1da4e2f4
    • P
      KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 · 1bc3fe81
      Paul Mackerras 提交于
      This allows userspace (e.g. QEMU) to enable large decrementer mode for
      the guest when running on a POWER9 host, by setting the LPCR_LD bit in
      the guest LPCR value.  With this, the guest exit code saves 64 bits of
      the guest DEC value on exit.  Other places that use the guest DEC
      value check the LPCR_LD bit in the guest LPCR value, and if it is set,
      omit the 32-bit sign extension that would otherwise be done.
      
      This doesn't change the DEC emulation used by PR KVM because PR KVM
      is not supported on POWER9 yet.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1bc3fe81
  3. 16 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1 · 3d3efb68
      Paul Mackerras 提交于
      POWER9 DD1 has an erratum where writing to the TBU40 register, which
      is used to apply an offset to the timebase, can cause the timebase to
      lose counts.  This results in the timebase on some CPUs getting out of
      sync with other CPUs, which then results in misbehaviour of the
      timekeeping code.
      
      To work around the problem, we make KVM ignore the timebase offset for
      all guests on POWER9 DD1 machines.  This means that live migration
      cannot be supported on POWER9 DD1 machines.
      
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3d3efb68
  4. 15 6月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Preserve userspace HTM state properly · 46a704f8
      Paul Mackerras 提交于
      If userspace attempts to call the KVM_RUN ioctl when it has hardware
      transactional memory (HTM) enabled, the values that it has put in the
      HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
      guest values.  To fix this, we detect this condition and save those
      SPR values in the thread struct, and disable HTM for the task.  If
      userspace goes to access those SPRs or the HTM facility in future,
      a TM-unavailable interrupt will occur and the handler will reload
      those SPRs and re-enable HTM.
      
      If userspace has started a transaction and suspended it, we would
      currently lose the transactional state in the guest entry path and
      would almost certainly get a "TM Bad Thing" interrupt, which would
      cause the host to crash.  To avoid this, we detect this case and
      return from the KVM_RUN ioctl with an EINVAL error, with the KVM
      exit reason set to KVM_EXIT_FAIL_ENTRY.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46a704f8
    • P
      KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit · 4c3bb4cc
      Paul Mackerras 提交于
      This restores several special-purpose registers (SPRs) to sane values
      on guest exit that were missed before.
      
      TAR and VRSAVE are readable and writable by userspace, and we need to
      save and restore them to prevent the guest from potentially affecting
      userspace execution (not that TAR or VRSAVE are used by any known
      program that run uses the KVM_RUN ioctl).  We save/restore these
      in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.
      
      FSCR affects userspace execution in that it can prohibit access to
      certain facilities by userspace.  We restore it to the normal value
      for the task on exit from the KVM_RUN ioctl.
      
      IAMR is normally 0, and is restored to 0 on guest exit.  However,
      with a radix host on POWER9, it is set to a value that prevents the
      kernel from executing user-accessible memory.  On POWER9, we save
      IAMR on guest entry and restore it on guest exit to the saved value
      rather than 0.  On POWER8 we continue to set it to 0 on guest exit.
      
      PSPB is normally 0.  We restore it to 0 on guest exit to prevent
      userspace taking advantage of the guest having set it non-zero
      (which would allow userspace to set its SMT priority to high).
      
      UAMOR is normally 0.  We restore it to 0 on guest exit to prevent
      the AMR from being used as a covert channel between userspace
      processes, since the AMR is not context-switched at present.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4c3bb4cc
  5. 13 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Context-switch EBB registers properly · ca8efa1d
      Paul Mackerras 提交于
      This adds code to save the values of three SPRs (special-purpose
      registers) used by userspace to control event-based branches (EBBs),
      which are essentially interrupts that get delivered directly to
      userspace.  These registers are loaded up with guest values when
      entering the guest, and their values are saved when exiting the
      guest, but we were not saving the host values and restoring them
      before going back to userspace.
      
      On POWER8 this would only affect userspace programs which explicitly
      request the use of EBBs and also use the KVM_RUN ioctl, since the
      only source of EBBs on POWER8 is the PMU, and there is an explicit
      enable bit in the PMU registers (and those PMU registers do get
      properly context-switched between host and guest).  On POWER9 there
      is provision for externally-generated EBBs, and these are not subject
      to the control in the PMU registers.
      
      Since these registers only affect userspace, we can save them when
      we first come in from userspace and restore them before returning to
      userspace, rather than saving/restoring the host values on every
      guest entry/exit.  Similarly, we don't need to worry about their
      values on offline secondary threads since they execute in the context
      of the idle task, which never executes in userspace.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ca8efa1d
  6. 28 4月, 2017 1 次提交
    • D
      KVM: PPC: Book3S HV: Avoid preemptibility warning in module initialization · db4b0dfa
      Denis Kirjanov 提交于
      With CONFIG_DEBUG_PREEMPT, get_paca() produces the following warning
      in kvmppc_book3s_init_hv() since it calls debug_smp_processor_id().
      
      There is no real issue with the xics_phys field.
      If paca->kvm_hstate.xics_phys is non-zero on one cpu, it will be
      non-zero on them all.  Therefore this is not fixing any actual
      problem, just the warning.
      
      [  138.521188] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/5596
      [  138.521308] caller is .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
      [  138.521404] CPU: 5 PID: 5596 Comm: modprobe Not tainted 4.11.0-rc3-00022-gc7e790c5 #1
      [  138.521509] Call Trace:
      [  138.521563] [c0000007d018b810] [c0000000023eef10] .dump_stack+0xe4/0x150 (unreliable)
      [  138.521694] [c0000007d018b8a0] [c000000001f6ec04] .check_preemption_disabled+0x134/0x150
      [  138.521829] [c0000007d018b940] [d00000000a010274] .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
      [  138.521963] [c0000007d018ba00] [c00000000191d5cc] .do_one_initcall+0x5c/0x1c0
      [  138.522082] [c0000007d018bad0] [c0000000023e9494] .do_init_module+0x84/0x240
      [  138.522201] [c0000007d018bb70] [c000000001aade18] .load_module+0x1f68/0x2a10
      [  138.522319] [c0000007d018bd20] [c000000001aaeb30] .SyS_finit_module+0xc0/0xf0
      [  138.522439] [c0000007d018be30] [c00000000191baec] system_call+0x38/0xfc
      Signed-off-by: NDenis Kirjanov <kda@linux-powerpc.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      db4b0dfa
  7. 27 4月, 2017 1 次提交
  8. 20 4月, 2017 1 次提交
  9. 10 4月, 2017 1 次提交
  10. 02 3月, 2017 2 次提交
  11. 31 1月, 2017 10 次提交
    • D
      KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation · 5e985969
      David Gibson 提交于
      This adds a not yet working outline of the HPT resizing PAPR
      extension.  Specifically it adds the necessary ioctl() functions,
      their basic steps, the work function which will handle preparation for
      the resize, and synchronization between these, the guest page fault
      path and guest HPT update path.
      
      The actual guts of the implementation isn't here yet, so for now the
      calls will always fail.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5e985969
    • D
      KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size · f98a8bf9
      David Gibson 提交于
      The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
      table (HPT) that userspace expects a guest VM to have, and is also used to
      clear that HPT when necessary (e.g. guest reboot).
      
      At present, once the ioctl() is called for the first time, the HPT size can
      never be changed thereafter - it will be cleared but always sized as from
      the first call.
      
      With upcoming HPT resize implementation, we're going to need to allow
      userspace to resize the HPT at reset (to change it back to the default size
      if the guest changed it).
      
      So, we need to allow this ioctl() to change the HPT size.
      
      This patch also updates Documentation/virtual/kvm/api.txt to reflect
      the new behaviour.  In fact the documentation was already slightly
      incorrect since 572abd56 "KVM: PPC: Book3S HV: Don't fall back to
      smaller HPT size in allocation ioctl"
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f98a8bf9
    • D
      KVM: PPC: Book3S HV: Split HPT allocation from activation · aae0777f
      David Gibson 提交于
      Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
      and sets it up as the active page table for a VM.  For the upcoming HPT
      resize implementation we're going to want to allocate HPTs separately from
      activating them.
      
      So, split the allocation itself out into kvmppc_allocate_hpt() and perform
      the activation with a new kvmppc_set_hpt() function.  Likewise we split
      kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
      which unsets it as an active HPT, then frees it.
      
      We also move the logic to fall back to smaller HPT sizes if the first try
      fails into the single caller which used that behaviour,
      kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
      that previously if the initial attempt at CMA allocation failed, we would
      fall back to attempting smaller sizes with the page allocator.  Now, we
      try first CMA, then the page allocator at each size.  As far as I can tell
      this change should be harmless.
      
      To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
      call to kvmppc_free_lpid() that was there, we move to the single caller.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aae0777f
    • D
      KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure · 3f9d4f5a
      David Gibson 提交于
      Currently, the powerpc kvm_arch structure contains a number of variables
      tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
      patch gathers them all together into a single kvm_hpt_info substructure.
      This makes life more convenient for the upcoming HPT resizing
      implementation.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f9d4f5a
    • P
      KVM: PPC: Book3S HV: Enable radix guest support · 8cf4ecc0
      Paul Mackerras 提交于
      This adds a few last pieces of the support for radix guests:
      
      * Implement the backends for the KVM_PPC_CONFIGURE_V3_MMU and
        KVM_PPC_GET_RMMU_INFO ioctls for radix guests
      
      * On POWER9, allow secondary threads to be on/off-lined while guests
        are running.
      
      * Set up LPCR and the partition table entry for radix guests.
      
      * Don't allocate the rmap array in the kvm_memory_slot structure
        on radix.
      
      * Don't try to initialize the HPT for radix guests, since they don't
        have an HPT.
      
      * Take out the code that prevents the HV KVM module from
        initializing on radix hosts.
      
      At this stage, we only support radix guests if the host is running
      in radix mode, and only support HPT guests if the host is running in
      HPT mode.  Thus a guest cannot switch from one mode to the other,
      which enables some simplifications.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cf4ecc0
    • P
      KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement · a29ebeaf
      Paul Mackerras 提交于
      With radix, the guest can do TLB invalidations itself using the tlbie
      (global) and tlbiel (local) TLB invalidation instructions.  Linux guests
      use local TLB invalidations for translations that have only ever been
      accessed on one vcpu.  However, that doesn't mean that the translations
      have only been accessed on one physical cpu (pcpu) since vcpus can move
      around from one pcpu to another.  Thus a tlbiel might leave behind stale
      TLB entries on a pcpu where the vcpu previously ran, and if that task
      then moves back to that previous pcpu, it could see those stale TLB
      entries and thus access memory incorrectly.  The usual symptom of this
      is random segfaults in userspace programs in the guest.
      
      To cope with this, we detect when a vcpu is about to start executing on
      a thread in a core that is a different core from the last time it
      executed.  If that is the case, then we mark the core as needing a
      TLB flush and then send an interrupt to any thread in the core that is
      currently running a vcpu from the same guest.  This will get those vcpus
      out of the guest, and the first one to re-enter the guest will do the
      TLB flush.  The reason for interrupting the vcpus executing on the old
      core is to cope with the following scenario:
      
      	CPU 0			CPU 1			CPU 4
      	(core 0)			(core 0)			(core 1)
      
      	VCPU 0 runs task X      VCPU 1 runs
      	core 0 TLB gets
      	entries from task X
      	VCPU 0 moves to CPU 4
      							VCPU 0 runs task X
      							Unmap pages of task X
      							tlbiel
      
      				(still VCPU 1)			task X moves to VCPU 1
      				task X runs
      				task X sees stale TLB
      				entries
      
      That is, as soon as the VCPU starts executing on the new core, it
      could unmap and tlbiel some page table entries, and then the task
      could migrate to one of the VCPUs running on the old core and
      potentially see stale TLB entries.
      
      Since the TLB is shared between all the threads in a core, we only
      use the bit of kvm->arch.need_tlb_flush corresponding to the first
      thread in the core.  To ensure that we don't have a window where we
      can miss a flush, this moves the clearing of the bit from before the
      actual flush to after it.  This way, two threads might both do the
      flush, but we prevent the situation where one thread can enter the
      guest before the flush is finished.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a29ebeaf
    • P
      KVM: PPC: Book3S HV: Implement dirty page logging for radix guests · 8f7b79b8
      Paul Mackerras 提交于
      This adds code to keep track of dirty pages when requested (that is,
      when memslot->dirty_bitmap is non-NULL) for radix guests.  We use the
      dirty bits in the PTEs in the second-level (partition-scoped) page
      tables, together with a bitmap of pages that were dirty when their
      PTE was invalidated (e.g., when the page was paged out).  This bitmap
      is stored in the first half of the memslot->dirty_bitmap area, and
      kvm_vm_ioctl_get_dirty_log_hv() now uses the second half for the
      bitmap that gets returned to userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8f7b79b8
    • P
      KVM: PPC: Book3S HV: Page table construction and page faults for radix guests · 5a319350
      Paul Mackerras 提交于
      This adds the code to construct the second-level ("partition-scoped" in
      architecturese) page tables for guests using the radix MMU.  Apart from
      the PGD level, which is allocated when the guest is created, the rest
      of the tree is all constructed in response to hypervisor page faults.
      
      As well as hypervisor page faults for missing pages, we also get faults
      for reference/change (RC) bits needing to be set, as well as various
      other error conditions.  For now, we only set the R or C bit in the
      guest page table if the same bit is set in the host PTE for the
      backing page.
      
      This code can take advantage of the guest being backed with either
      transparent or ordinary 2MB huge pages, and insert 2MB page entries
      into the guest page tables.  There is no support for 1GB huge pages
      yet.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5a319350
    • P
      KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9 · 468808bd
      Paul Mackerras 提交于
      This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
      for HPT guests on POWER9.  With this, we can return 1 for the
      KVM_CAP_PPC_MMU_HASH_V3 capability.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      468808bd
    • P
      KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU · c9270132
      Paul Mackerras 提交于
      This adds two capabilities and two ioctls to allow userspace to
      find out about and configure the POWER9 MMU in a guest.  The two
      capabilities tell userspace whether KVM can support a guest using
      the radix MMU, or using the hashed page table (HPT) MMU with a
      process table and segment tables.  (Note that the MMUs in the
      POWER9 processor cores do not use the process and segment tables
      when in HPT mode, but the nest MMU does).
      
      The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
      whether a guest will use the radix MMU or the HPT MMU, and to
      specify the size and location (in guest space) of the process
      table.
      
      The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
      the radix MMU.  It returns a list of supported radix tree geometries
      (base page size and number of bits indexed at each level of the
      radix tree) and the encoding used to specify the various page
      sizes for the TLB invalidate entry instruction.
      
      Initially, both capabilities return 0 and the ioctls return -EINVAL,
      until the necessary infrastructure for them to operate correctly
      is added.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c9270132
  12. 27 1月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix H_PROD to actually wake the target vcpu · 8464c884
      Paul Mackerras 提交于
      The H_PROD hypercall is supposed to wake up an idle vcpu.  We have
      an implementation, but because Linux doesn't use it except when
      doing cpu hotplug, it was never tested properly.  AIX does use it,
      and reported it broken.  It turns out we were waking the wrong
      vcpu (the one doing H_PROD, not the target of the prod) and we
      weren't handling the case where the target needs an IPI to wake
      it.  Fix it by using the existing kvmppc_fast_vcpu_kick_hv()
      function, which is intended for this kind of thing, and by using
      the target vcpu not the current vcpu.
      
      We were also not looking at the prodded flag when checking whether a
      ceded vcpu should wake up, so this adds checks for the prodded flag
      alongside the checks for pending exceptions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8464c884
    • P
      KVM: PPC: Book3S HV: Don't try to signal cpu -1 · 3deda5e5
      Paul Mackerras 提交于
      If the target vcpu for kvmppc_fast_vcpu_kick_hv() is not running on
      any CPU, then we will have vcpu->arch.thread_cpu == -1, and as it
      happens, kvmppc_fast_vcpu_kick_hv will call kvmppc_ipi_thread with
      -1 as the cpu argument.  Although this is not meaningful, in the past,
      before commit 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs
      to other cores on POWER9", 2016-11-18), it was harmless because CPU
      -1 is not in the same core as any real CPU thread.  On a POWER9,
      however, we don't do the "same core" check, so we were trying to
      do a msgsnd to thread -1, which is invalid.  To avoid this, we add
      a check to see that vcpu->arch.thread_cpu is >= 0 before calling
      kvmppc_ipi_thread() with it.  Since vcpu->arch.thread_vcpu can change
      asynchronously, we use READ_ONCE to ensure that the value we check is
      the same value that we use as the argument to kvmppc_ipi_thread().
      
      Fixes: 1704a81c ("KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3deda5e5
  13. 26 12月, 2016 1 次提交
    • T
      ktime: Cleanup ktime_set() usage · 8b0e1953
      Thomas Gleixner 提交于
      ktime_set(S,N) was required for the timespec storage type and is still
      useful for situations where a Seconds and Nanoseconds part of a time value
      needs to be converted. For anything where the Seconds argument is 0, this
      is pointless and can be replaced with a simple assignment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8b0e1953
  14. 25 12月, 2016 1 次提交
  15. 02 12月, 2016 1 次提交
  16. 28 11月, 2016 3 次提交
  17. 24 11月, 2016 6 次提交
    • S
      KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00 · 2ee13be3
      Suraj Jitindar Singh 提交于
      The function kvmppc_set_arch_compat() is used to determine the value of the
      processor compatibility register (PCR) for a guest running in a given
      compatibility mode. There is currently no support for v3.00 of the ISA.
      
      Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
      to the PCR.
      
      We also add a check to ensure the processor we are running on is capable of
      emulating the chosen processor (for example a POWER7 cannot emulate a
      POWER8, similarly with a POWER8 and a POWER9).
      
      Based on work by: Paul Mackerras <paulus@ozlabs.org>
      
      [paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
       guest_pcr_bit when arch_compat == 0, added comment.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ee13be3
    • P
      KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores · 45c940ba
      Paul Mackerras 提交于
      With POWER9, each CPU thread has its own MMU context and can be
      in the host or a guest independently of the other threads; there is
      still however a restriction that all threads must use the same type
      of address translation, either radix tree or hashed page table (HPT).
      
      Since we only support HPT guests on a HPT host at this point, we
      can treat the threads as being independent, and avoid all of the
      work of coordinating the CPU threads.  To make this simpler, we
      introduce a new threads_per_vcore() function that returns 1 on
      POWER9 and threads_per_subcore on POWER7/8, and use that instead
      of threads_per_subcore or threads_per_core in various places.
      
      This also changes the value of the KVM_CAP_PPC_SMT capability on
      POWER9 systems from 4 to 1, so that userspace will not try to
      create VMs with multiple vcpus per vcore.  (If userspace did create
      a VM that thought it was in an SMT mode, the VM might try to use
      the msgsndp instruction, which will not work as expected.  In
      future it may be possible to trap and emulate msgsndp in order to
      allow VMs to think they are in an SMT mode, if only for the purpose
      of allowing migration from POWER8 systems.)
      
      With all this, we can now run guests on POWER9 as long as the host
      is running with HPT translation.  Since userspace currently has no
      way to request radix tree translation for the guest, the guest has
      no choice but to use HPT translation.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      45c940ba
    • P
      KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest · 84f7139c
      Paul Mackerras 提交于
      The new XIVE interrupt controller on POWER9 can direct external
      interrupts to the hypervisor or the guest.  The interrupts directed to
      the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
      come in as a "hypervisor virtualization interrupt".  This sets the
      LPCR bit so that hypervisor virtualization interrupts can occur while
      we are in the guest.  We then also need to cope with exiting the guest
      because of a hypervisor virtualization interrupt.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      84f7139c
    • P
      KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 · f725758b
      Paul Mackerras 提交于
      POWER9 includes a new interrupt controller, called XIVE, which is
      quite different from the XICS interrupt controller on POWER7 and
      POWER8 machines.  KVM-HV accesses the XICS directly in several places
      in order to send and clear IPIs and handle interrupts from PCI
      devices being passed through to the guest.
      
      In order to make the transition to XIVE easier, OPAL firmware will
      include an emulation of XICS on top of XIVE.  Access to the emulated
      XICS is via OPAL calls.  The one complication is that the EOI
      (end-of-interrupt) function can now return a value indicating that
      another interrupt is pending; in this case, the XIVE will not signal
      an interrupt in hardware to the CPU, and software is supposed to
      acknowledge the new interrupt without waiting for another interrupt
      to be delivered in hardware.
      
      This adapts KVM-HV to use the OPAL calls on machines where there is
      no XICS hardware.  When there is no XICS, we look for a device-tree
      node with "ibm,opal-intc" in its compatible property, which is how
      OPAL indicates that it provides XICS emulation.
      
      In order to handle the EOI return value, kvmppc_read_intr() has
      become kvmppc_read_one_intr(), with a boolean variable passed by
      reference which can be set by the EOI functions to indicate that
      another interrupt is pending.  The new kvmppc_read_intr() keeps
      calling kvmppc_read_one_intr() until there are no more interrupts
      to process.  The return value from kvmppc_read_intr() is the
      largest non-zero value of the returns from kvmppc_read_one_intr().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f725758b
    • P
      KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9 · 1704a81c
      Paul Mackerras 提交于
      On POWER9, the msgsnd instruction is able to send interrupts to
      other cores, as well as other threads on the local core.  Since
      msgsnd is generally simpler and faster than sending an IPI via the
      XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1704a81c
    • P
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras 提交于
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca