1. 01 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready · 1b151ce4
      Paul Mackerras 提交于
      This renames the kvm->arch.hpte_setup_done field to mmu_ready because
      we will want to use it for radix guests too -- both for setting things
      up before vcpu execution, and for excluding vcpus from executing while
      MMU-related things get changed, such as in future switching the MMU
      from radix to HPT mode or vice-versa.
      
      This also moves the call to kvmppc_setup_partition_table() that was
      done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
      of mmu_ready, into the caller in kvmppc_vcpu_run_hv().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b151ce4
  2. 01 9月, 2017 1 次提交
  3. 22 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled · e20bbd3d
      Aravinda Prasad 提交于
      Enhance KVM to cause a guest exit with KVM_EXIT_NMI
      exit reason upon a machine check exception (MCE) in
      the guest address space if the KVM_CAP_PPC_FWNMI
      capability is enabled (instead of delivering a 0x200
      interrupt to guest). This enables QEMU to build error
      log and deliver machine check exception to guest via
      guest registered machine check handler.
      
      This approach simplifies the delivery of machine
      check exception to guest OS compared to the earlier
      approach of KVM directly invoking 0x200 guest interrupt
      vector.
      
      This design/approach is based on the feedback for the
      QEMU patches to handle machine check exception. Details
      of earlier approach of handling machine check exception
      in QEMU and related discussions can be found at:
      
      https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
      
      Note:
      
      This patch now directly invokes machine_check_print_event_info()
      from kvmppc_handle_exit_hv() to print the event to host console
      at the time of guest exit before the exception is passed on to the
      guest. Hence, the host-side handling which was performed earlier
      via machine_check_fwnmi is removed.
      
      The reasons for this approach is (i) it is not possible
      to distinguish whether the exception occurred in the
      guest or the host from the pt_regs passed on the
      machine_check_exception(). Hence machine_check_exception()
      calls panic, instead of passing on the exception to
      the guest, if the machine check exception is not
      recoverable. (ii) the approach introduced in this
      patch gives opportunity to the host kernel to perform
      actions in virtual mode before passing on the exception
      to the guest. This approach does not require complex
      tweaks to machine_check_fwnmi and friends.
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e20bbd3d
  4. 21 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Add new capability to control MCE behaviour · 134764ed
      Aravinda Prasad 提交于
      This introduces a new KVM capability to control how KVM behaves
      on machine check exception (MCE) in HV KVM guests.
      
      If this capability has not been enabled, KVM redirects machine check
      exceptions to guest's 0x200 vector, if the address in error belongs to
      the guest. With this capability enabled, KVM will cause a guest exit
      with the exit reason indicating an NMI.
      
      The new capability is required to avoid problems if a new kernel/KVM
      is used with an old QEMU, running a guest that doesn't issue
      "ibm,nmi-register".  As old QEMU does not understand the NMI exit
      type, it treats it as a fatal error.  However, the guest could have
      handled the machine check error if the exception was delivered to
      guest's 0x200 interrupt vector instead of NMI exit in case of old
      QEMU.
      
      [paulus@ozlabs.org - Reworded the commit message to be clearer,
       enable only on HV KVM.]
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      134764ed
  5. 19 6月, 2017 4 次提交
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
    • P
      KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 · 1bc3fe81
      Paul Mackerras 提交于
      This allows userspace (e.g. QEMU) to enable large decrementer mode for
      the guest when running on a POWER9 host, by setting the LPCR_LD bit in
      the guest LPCR value.  With this, the guest exit code saves 64 bits of
      the guest DEC value on exit.  Other places that use the guest DEC
      value check the LPCR_LD bit in the guest LPCR value, and if it is set,
      omit the 32-bit sign extension that would otherwise be done.
      
      This doesn't change the DEC emulation used by PR KVM because PR KVM
      is not supported on POWER9 yet.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1bc3fe81
  6. 04 6月, 2017 1 次提交
  7. 27 4月, 2017 1 次提交
  8. 20 4月, 2017 3 次提交
    • A
      KVM: PPC: VFIO: Add in-kernel acceleration for VFIO · 121f80ba
      Alexey Kardashevskiy 提交于
      This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
      and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
      without passing them to user space which saves time on switching
      to user space and back.
      
      This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
      KVM tries to handle a TCE request in the real mode, if failed
      it passes the request to the virtual mode to complete the operation.
      If it a virtual mode handler fails, the request is passed to
      the user space; this is not expected to happen though.
      
      To avoid dealing with page use counters (which is tricky in real mode),
      this only accelerates SPAPR TCE IOMMU v2 clients which are required
      to pre-register the userspace memory. The very first TCE request will
      be handled in the VFIO SPAPR TCE driver anyway as the userspace view
      of the TCE table (iommu_table::it_userspace) is not allocated till
      the very first mapping happens and we cannot call vmalloc in real mode.
      
      If we fail to update a hardware IOMMU table unexpected reason, we just
      clear it and move on as there is nothing really we can do about it -
      for example, if we hot plug a VFIO device to a guest, existing TCE tables
      will be mirrored automatically to the hardware and there is no interface
      to report to the guest about possible failures.
      
      This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
      the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
      and associates a physical IOMMU table with the SPAPR TCE table (which
      is a guest view of the hardware IOMMU table). The iommu_table object
      is cached and referenced so we do not have to look up for it in real mode.
      
      This does not implement the UNSET counterpart as there is no use for it -
      once the acceleration is enabled, the existing userspace won't
      disable it unless a VFIO container is destroyed; this adds necessary
      cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
      
      This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
      space.
      
      This adds real mode version of WARN_ON_ONCE() as the generic version
      causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
      returns in the code, this also adds a check for already existing
      vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
      
      This finally makes use of vfio_external_user_iommu_id() which was
      introduced quite some time ago and was considered for removal.
      
      Tests show that this patch increases transmission speed from 220MB/s
      to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      121f80ba
    • A
      KVM: PPC: Book3S PR: Preserve storage control bits · 96df2267
      Alexey Kardashevskiy 提交于
      PR KVM page fault handler performs eaddr to pte translation for a guest,
      however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
      (storage control) in the kvmppc_pte struct. If PR KVM is running as
      a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
      this fails in HV KVM if it already has this mapping.
      
      This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
      kvmppc_mmu_map_page().
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      96df2267
    • B
      KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions · 6f63e81b
      Bin Lu 提交于
      This patch provides the MMIO load/store emulation for instructions
      of 'double & vector unsigned char & vector signed char & vector
      unsigned short & vector signed short & vector unsigned int & vector
      signed int & vector double '.
      
      The instructions that this adds emulation for are:
      
      - ldx, ldux, lwax,
      - lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
      - stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
      - lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
      - stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
      
      [paulus@ozlabs.org - some cleanups, fixes and rework, make it
       compile for Book E, fix build when PR KVM is built in]
      Signed-off-by: NBin Lu <lblulb@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6f63e81b
  9. 07 4月, 2017 1 次提交
  10. 31 1月, 2017 7 次提交
    • D
      KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation · 5e985969
      David Gibson 提交于
      This adds a not yet working outline of the HPT resizing PAPR
      extension.  Specifically it adds the necessary ioctl() functions,
      their basic steps, the work function which will handle preparation for
      the resize, and synchronization between these, the guest page fault
      path and guest HPT update path.
      
      The actual guts of the implementation isn't here yet, so for now the
      calls will always fail.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5e985969
    • D
      KVM: PPC: Book3S HV: Don't store values derivable from HPT order · 3d089f84
      David Gibson 提交于
      Currently the kvm_hpt_info structure stores the hashed page table's order,
      and also the number of HPTEs it contains and a mask for its size.  The
      last two can be easily derived from the order, so remove them and just
      calculate them as necessary with a couple of helper inlines.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3d089f84
    • D
      KVM: PPC: Book3S HV: Gather HPT related variables into sub-structure · 3f9d4f5a
      David Gibson 提交于
      Currently, the powerpc kvm_arch structure contains a number of variables
      tracking the state of the guest's hashed page table (HPT) in KVM HV.  This
      patch gathers them all together into a single kvm_hpt_info substructure.
      This makes life more convenient for the upcoming HPT resizing
      implementation.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3f9d4f5a
    • P
      KVM: PPC: Book3S HV: Invalidate TLB on radix guest vcpu movement · a29ebeaf
      Paul Mackerras 提交于
      With radix, the guest can do TLB invalidations itself using the tlbie
      (global) and tlbiel (local) TLB invalidation instructions.  Linux guests
      use local TLB invalidations for translations that have only ever been
      accessed on one vcpu.  However, that doesn't mean that the translations
      have only been accessed on one physical cpu (pcpu) since vcpus can move
      around from one pcpu to another.  Thus a tlbiel might leave behind stale
      TLB entries on a pcpu where the vcpu previously ran, and if that task
      then moves back to that previous pcpu, it could see those stale TLB
      entries and thus access memory incorrectly.  The usual symptom of this
      is random segfaults in userspace programs in the guest.
      
      To cope with this, we detect when a vcpu is about to start executing on
      a thread in a core that is a different core from the last time it
      executed.  If that is the case, then we mark the core as needing a
      TLB flush and then send an interrupt to any thread in the core that is
      currently running a vcpu from the same guest.  This will get those vcpus
      out of the guest, and the first one to re-enter the guest will do the
      TLB flush.  The reason for interrupting the vcpus executing on the old
      core is to cope with the following scenario:
      
      	CPU 0			CPU 1			CPU 4
      	(core 0)			(core 0)			(core 1)
      
      	VCPU 0 runs task X      VCPU 1 runs
      	core 0 TLB gets
      	entries from task X
      	VCPU 0 moves to CPU 4
      							VCPU 0 runs task X
      							Unmap pages of task X
      							tlbiel
      
      				(still VCPU 1)			task X moves to VCPU 1
      				task X runs
      				task X sees stale TLB
      				entries
      
      That is, as soon as the VCPU starts executing on the new core, it
      could unmap and tlbiel some page table entries, and then the task
      could migrate to one of the VCPUs running on the old core and
      potentially see stale TLB entries.
      
      Since the TLB is shared between all the threads in a core, we only
      use the bit of kvm->arch.need_tlb_flush corresponding to the first
      thread in the core.  To ensure that we don't have a window where we
      can miss a flush, this moves the clearing of the bit from before the
      actual flush to after it.  This way, two threads might both do the
      flush, but we prevent the situation where one thread can enter the
      guest before the flush is finished.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a29ebeaf
    • P
      KVM: PPC: Book3S HV: Modify guest entry/exit paths to handle radix guests · f4c51f84
      Paul Mackerras 提交于
      This adds code to  branch around the parts that radix guests don't
      need - clearing and loading the SLB with the guest SLB contents,
      saving the guest SLB contents on exit, and restoring the host SLB
      contents.
      
      Since the host is now using radix, we need to save and restore the
      host value for the PID register.
      
      On hypervisor data/instruction storage interrupts, we don't do the
      guest HPT lookup on radix, but just save the guest physical address
      for the fault (from the ASDR register) in the vcpu struct.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4c51f84
    • P
      KVM: PPC: Book3S HV: Add basic infrastructure for radix guests · 9e04ba69
      Paul Mackerras 提交于
      This adds a field in struct kvm_arch and an inline helper to
      indicate whether a guest is a radix guest or not, plus a new file
      to contain the radix MMU code, which currently contains just a
      translate function which knows how to traverse the guest page
      tables to translate an address.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9e04ba69
    • P
      KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9 · 468808bd
      Paul Mackerras 提交于
      This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
      for HPT guests on POWER9.  With this, we can return 1 for the
      KVM_CAP_PPC_MMU_HASH_V3 capability.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      468808bd
  11. 28 11月, 2016 1 次提交
    • S
      KVM: PPC: Decrease the powerpc default halt poll max value · f4944613
      Suraj Jitindar Singh 提交于
      KVM_HALT_POLL_NS_DEFAULT is an arch specific constant which sets the
      default value of the halt_poll_ns kvm module parameter which determines
      the global maximum halt polling interval.
      
      The current value for powerpc is 500000 (500us) which means that any
      repetitive workload with a period of less than that can drive the cpu
      usage to 100% where it may have been mostly idle without halt polling.
      This presents the possibility of a large increase in power usage with
      a comparatively small performance benefit.
      
      Reduce the default to 10000 (10us) and a user can tune this themselves
      to set their affinity for halt polling based on the trade off between power
      and performance which they are willing to make.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f4944613
  12. 24 11月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9 · 7c5b06ca
      Paul Mackerras 提交于
      POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
      and tlbiel (local tlbie) instructions.  Both instructions get a
      set of new parameters (RIC, PRS and R) which appear as bits in the
      instruction word.  The tlbiel instruction now has a second register
      operand, which contains a PID and/or LPID value if needed, and
      should otherwise contain 0.
      
      This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
      as well as older processors.  Since we only handle HPT guests so
      far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
      word as on previous processors, so we don't need to conditionally
      execute different instructions depending on the processor.
      
      The local flush on first entry to a guest in book3s_hv_rmhandlers.S
      is a loop which depends on the number of TLB sets.  Rather than
      using feature sections to set the number of iterations based on
      which CPU we're on, we now work out this number at VM creation time
      and store it in the kvm_arch struct.  That will make it possible to
      get the number from the device tree in future, which will help with
      compatibility with future processors.
      
      Since mmu_partition_table_set_entry() does a global flush of the
      whole LPID, we don't need to do the TLB flush on first entry to the
      guest on each processor.  Therefore we don't set all bits in the
      tlb_need_flush bitmap on VM startup on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c5b06ca
    • P
      KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs · e9cf1e08
      Paul Mackerras 提交于
      This adds code to handle two new guest-accessible special-purpose
      registers on POWER9: TIDR (thread ID register) and PSSCR (processor
      stop status and control register).  They are context-switched
      between host and guest, and the guest values can be read and set
      via the one_reg interface.
      
      The PSSCR contains some fields which are guest-accessible and some
      which are only accessible in hypervisor mode.  We only allow the
      guest-accessible fields to be read or set by userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e9cf1e08
  13. 21 11月, 2016 2 次提交
    • P
      KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state · 0d808df0
      Paul Mackerras 提交于
      When switching from/to a guest that has a transaction in progress,
      we need to save/restore the checkpointed register state.  Although
      XER is part of the CPU state that gets checkpointed, the code that
      does this saving and restoring doesn't save/restore XER.
      
      This fixes it by saving and restoring the XER.  To allow userspace
      to read/write the checkpointed XER value, we also add a new ONE_REG
      specifier.
      
      The visible effect of this bug is that the guest may see its XER
      value being corrupted when it uses transactions.
      
      Fixes: e4e38121 ("KVM: PPC: Book3S HV: Add transactional memory support")
      Fixes: 0a8eccef ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
      Cc: stable@vger.kernel.org # v3.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0d808df0
    • Y
      KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries · a56ee9f8
      Yongji Xie 提交于
      This keeps a per vcpu cache for recently page faulted MMIO entries.
      On a page fault, if the entry exists in the cache, we can avoid some
      time-consuming paths, for example, looking up HPT, locking HPTE twice
      and searching mmio gfn from memslots, then directly call
      kvmppc_hv_emulate_mmio().
      
      In current implenment, we limit the size of cache to four. We think
      it's enough to cover the high-frequency MMIO HPTEs in most case.
      For example, considering the case of using virtio device, for virtio
      legacy devices, one HPTE could handle notifications from up to
      1024 (64K page / 64 byte Port IO register) devices, so one cache entry
      is enough; for virtio modern devices, we always need one HPTE to handle
      notification for each device because modern device would use a 8M MMIO
      register to notify host instead of Port IO register, typically the
      system's configuration should not exceed four virtio devices per
      vcpu, four cache entry is also enough in this case. Of course, if needed,
      we could also modify the macro to a module parameter in the future.
      Signed-off-by: NYongji Xie <xyjxie@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a56ee9f8
  14. 27 9月, 2016 1 次提交
    • P
      KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread · 88b02cf9
      Paul Mackerras 提交于
      POWER8 has one virtual timebase (VTB) register per subcore, not one
      per CPU thread.  The HV KVM code currently treats VTB as a per-thread
      register, which can lead to spurious soft lockup messages from guests
      which use the VTB as the time source for the soft lockup detector.
      (CPUs before POWER8 did not have the VTB register.)
      
      For HV KVM, this fixes the problem by making only the primary thread
      in each virtual core save and restore the VTB value.  With this,
      the VTB state becomes part of the kvmppc_vcore structure.  This
      also means that "piggybacking" of multiple virtual cores onto one
      subcore is not possible on POWER8, because then the virtual cores
      would share a single VTB register.
      
      PR KVM emulates a VTB register, which is per-vcpu because PR KVM
      has no notion of CPU threads or SMT.  For PR KVM we move the VTB
      state into the kvmppc_vcpu_book3s struct.
      
      Cc: stable@vger.kernel.org # v3.14+
      Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      88b02cf9
  15. 12 9月, 2016 1 次提交
    • S
      KVM: PPC: Book3S HV: Counters for passthrough IRQ stats · 65e7026a
      Suresh Warrier 提交于
      Add VCPU stat counters to track affinity for passthrough
      interrupts.
      
      pthru_all: Counts all passthrough interrupts whose IRQ mappings are
                 in the kvmppc_passthru_irq_map structure.
      pthru_host: Counts all cached passthrough interrupts that were injected
      	    from the host through kvm_set_irq (i.e. not handled in
      	    real mode).
      pthru_bad_aff: Counts how many cached passthrough interrupts have
                     bad affinity (receiving CPU is not running VCPU that is
      	       the target of the virtual interrupt in the guest).
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      65e7026a
  16. 09 9月, 2016 1 次提交
    • S
      KVM: PPC: Book3S HV: Introduce kvmppc_passthru_irqmap · 8daaafc8
      Suresh Warrier 提交于
      This patch introduces an IRQ mapping structure, the
      kvmppc_passthru_irqmap structure that is to be used
      to map the real hardware IRQ in the host with the virtual
      hardware IRQ (gsi) that is injected into a guest by KVM for
      passthrough adapters.
      
      Currently, we assume a separate IRQ mapping structure for
      each guest. Each kvmppc_passthru_irqmap has a mapping arrays,
      containing all defined real<->virtual IRQs.
      
      [paulus@ozlabs.org - removed irq_chip field from struct
       kvmppc_passthru_irqmap; changed parameter for
       kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct
       kvm *, removed small cached array.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8daaafc8
  17. 08 9月, 2016 5 次提交
    • S
      KVM: PPC: Implement existing and add new halt polling vcpu stats · 2a27f514
      Suraj Jitindar Singh 提交于
      vcpu stats are used to collect information about a vcpu which can be viewed
      in the debugfs. For example halt_attempted_poll and halt_successful_poll
      are used to keep track of the number of times the vcpu attempts to and
      successfully polls. These stats are currently not used on powerpc.
      
      Implement incrementation of the halt_attempted_poll and
      halt_successful_poll vcpu stats for powerpc. Since these stats are summed
      over all the vcpus for all running guests it doesn't matter which vcpu
      they are attributed to, thus we choose the current runner vcpu of the
      vcore.
      
      Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and
      halt_wait_ns to be used to accumulate the total time spend polling
      successfully, polling unsuccessfully and waiting respectively, and
      halt_successful_wait to accumulate the number of times the vcpu waits.
      Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are
      expressed in nanoseconds it is necessary to represent these as 64-bit
      quantities, otherwise they would overflow after only about 4 seconds.
      
      Given that the total time spend either polling or waiting will be known and
      the number of times that each was done, it will be possible to determine
      the average poll and wait times. This will give the ability to tune the kvm
      module parameters based on the calculated average wait and poll times.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2a27f514
    • S
      KVM: Add provisioning for ulong vm stats and u64 vcpu stats · 8a7e75d4
      Suraj Jitindar Singh 提交于
      vms and vcpus have statistics associated with them which can be viewed
      within the debugfs. Currently it is assumed within the vcpu_stat_get() and
      vm_stat_get() functions that all of these statistics are represented as
      u32s, however the next patch adds some u64 vcpu statistics.
      
      Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
      Since vcpu statistics are per vcpu, they will only be updated by a single
      vcpu at a time so this shouldn't present a problem on 32-bit machines
      which can't atomically increment 64-bit numbers. However vm statistics
      could potentially be updated by multiple vcpus from that vm at a time.
      To avoid the overhead of atomics make all vm statistics ulong such that
      they are 64-bit on 64-bit systems where they can be atomically incremented
      and are 32-bit on 32-bit systems which may not be able to atomically
      increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8a7e75d4
    • S
      KVM: PPC: Book3S HV: Implement halt polling · 0cda69dd
      Suraj Jitindar Singh 提交于
      This patch introduces new halt polling functionality into the kvm_hv kernel
      module. When a vcore is idle it will poll for some period of time before
      scheduling itself out.
      
      When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
      idle) we schedule ourselves out to allow something else to run. In the
      event that we need to wake up very quickly (for example an interrupt
      arrives), we are required to wait until we get scheduled again.
      
      Implement halt polling so that when a vcore is idle, and before scheduling
      ourselves, we poll for vcpus in the runnable_threads list which have
      pending exceptions or which leave the ceded state. If we poll successfully
      then we can get back into the guest very quickly without ever scheduling
      ourselves, otherwise we schedule ourselves out as before.
      
      There exists generic halt_polling code in virt/kvm_main.c, however on
      powerpc the polling conditions are different to the generic case. It would
      be nice if we could just implement an arch specific kvm_check_block()
      function, but there is still other arch specific things which need to be
      done for kvm_hv (for example manipulating vcore states) which means that a
      separate implementation is the best option.
      
      Testing of this patch with a TCP round robin test between two guests with
      virtio network interfaces has found a decrease in round trip time of ~15us
      on average. A performance gain is only seen when going out of and
      back into the guest often and quickly, otherwise there is no net benefit
      from the polling. The polling interval is adjusted such that when we are
      often scheduled out for long periods of time it is reduced, and when we
      often poll successfully it is increased. The rate at which the polling
      interval increases or decreases, and the maximum polling interval, can
      be set through module parameters.
      
      Based on the implementation in the generic kvm module by Wanpeng Li and
      Paolo Bonzini, and on direction from Paul Mackerras.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0cda69dd
    • S
      KVM: PPC: Book3S HV: Change vcore element runnable_threads from linked-list to array · 7b5f8272
      Suraj Jitindar Singh 提交于
      The struct kvmppc_vcore is a structure used to store various information
      about a virtual core for a kvm guest. The runnable_threads element of the
      struct provides a list of all of the currently runnable vcpus on the core
      (those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of
      this list was a linked_list. The next patch requires that the list be able
      to be iterated over without holding the vcore lock.
      
      Reimplement the runnable_threads list in the kvmppc_vcore struct as an
      array. Implement function to iterate over valid entries in the array and
      update access sites accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b5f8272
    • S
      KVM: PPC: Book3S HV: Move struct kvmppc_vcore from kvm_host.h to kvm_book3s.h · e64fb7e2
      Suraj Jitindar Singh 提交于
      The next commit will introduce a member to the kvmppc_vcore struct which
      references MAX_SMT_THREADS which is defined in kvm_book3s_asm.h, however
      this file isn't included in kvm_host.h directly. Thus compiling for
      certain platforms such as pmac32_defconfig and ppc64e_defconfig with KVM
      fails due to MAX_SMT_THREADS not being defined.
      
      Move the struct kvmppc_vcore definition to kvm_book3s.h which explicitly
      includes kvm_book3s_asm.h.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e64fb7e2
  18. 19 8月, 2016 1 次提交
  19. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  20. 12 5月, 2016 1 次提交
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
  21. 02 3月, 2016 2 次提交
  22. 25 2月, 2016 1 次提交
    • M
      KVM: Use simple waitqueue for vcpu->wq · 8577370f
      Marcelo Tosatti 提交于
      The problem:
      
      On -rt, an emulated LAPIC timer instances has the following path:
      
      1) hard interrupt
      2) ksoftirqd is scheduled
      3) ksoftirqd wakes up vcpu thread
      4) vcpu thread is scheduled
      
      This extra context switch introduces unnecessary latency in the
      LAPIC path for a KVM guest.
      
      The solution:
      
      Allow waking up vcpu thread from hardirq context,
      thus avoiding the need for ksoftirqd to be scheduled.
      
      Normal waitqueues make use of spinlocks, which on -RT
      are sleepable locks. Therefore, waking up a waitqueue
      waiter involves locking a sleeping lock, which
      is not allowed from hard interrupt context.
      
      cyclictest command line:
      
      This patch reduces the average latency in my tests from 14us to 11us.
      
      Daniel writes:
      Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
      benchmark on mainline. The test was run 1000 times on
      tip/sched/core 4.4.0-rc8-01134-g0905f04e:
      
        ./x86-run x86/tscdeadline_latency.flat -cpu host
      
      with idle=poll.
      
      The test seems not to deliver really stable numbers though most of
      them are smaller. Paolo write:
      
      "Anything above ~10000 cycles means that the host went to C1 or
      lower---the number means more or less nothing in that case.
      
      The mean shows an improvement indeed."
      
      Before:
      
                     min             max         mean           std
      count  1000.000000     1000.000000  1000.000000   1000.000000
      mean   5162.596000  2019270.084000  5824.491541  20681.645558
      std      75.431231   622607.723969    89.575700   6492.272062
      min    4466.000000    23928.000000  5537.926500    585.864966
      25%    5163.000000  16132529.750000  5790.132275  16683.745433
      50%    5175.000000  2281919.000000  5834.654000  23151.990026
      75%    5190.000000  2382865.750000  5861.412950  24148.206168
      max    5228.000000  4175158.000000  6254.827300  46481.048691
      
      After
                     min            max         mean           std
      count  1000.000000     1000.00000  1000.000000   1000.000000
      mean   5143.511000  2076886.10300  5813.312474  21207.357565
      std      77.668322   610413.09583    86.541500   6331.915127
      min    4427.000000    25103.00000  5529.756600    559.187707
      25%    5148.000000  1691272.75000  5784.889825  17473.518244
      50%    5160.000000  2308328.50000  5832.025000  23464.837068
      75%    5172.000000  2393037.75000  5853.177675  24223.969976
      max    5222.000000  3922458.00000  6186.720500  42520.379830
      
      [Patch was originaly based on the swait implementation found in the -rt
       tree. Daniel ported it to mainline's version and gathered the
       benchmark numbers for tscdeadline_latency test.]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8577370f