1. 19 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
  2. 17 12月, 2018 2 次提交
  3. 09 10月, 2018 5 次提交
    • P
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras 提交于
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • P
      KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests · 95a6432c
      Paul Mackerras 提交于
      This creates an alternative guest entry/exit path which is used for
      radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
      these circumstances there is exactly one vcpu per vcore and there is
      no coordination required between vcpus or vcores; the vcpu can enter
      the guest without needing to synchronize with anything else.
      
      The new fast path is implemented almost entirely in C in book3s_hv.c
      and runs with the MMU on until the guest is entered.  On guest exit
      we use the existing path until the point where we are committed to
      exiting the guest (as distinct from handling an interrupt in the
      low-level code and returning to the guest) and we have pulled the
      guest context from the XIVE.  At that point we check a flag in the
      stack frame to see whether we came in via the old path and the new
      path; if we came in via the new path then we go back to C code to do
      the rest of the process of saving the guest context and restoring the
      host context.
      
      The C code is split into separate functions for handling the
      OS-accessible state and the hypervisor state, with the idea that the
      latter can be replaced by a hypercall when we implement nested
      virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix CONFIG_ALTIVEC=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      95a6432c
    • P
      KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code · f7035ce9
      Paul Mackerras 提交于
      This is based on a patch by Suraj Jitindar Singh.
      
      This moves the code in book3s_hv_rmhandlers.S that generates an
      external, decrementer or privileged doorbell interrupt just before
      entering the guest to C code in book3s_hv_builtin.c.  This is to
      make future maintenance and modification easier.  The algorithm
      expressed in the C code is almost identical to the previous
      algorithm.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7035ce9
    • A
      KVM: PPC: Remove redundand permission bits removal · a3ac077b
      Alexey Kardashevskiy 提交于
      The kvmppc_gpa_to_ua() helper itself takes care of the permission
      bits in the TCE and yet every single caller removes them.
      
      This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
      (which are GPAs + TCE permission bits) to make the callers simpler.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3ac077b
    • A
      KVM: PPC: Validate TCEs against preregistered memory page sizes · 42de7b9e
      Alexey Kardashevskiy 提交于
      The userspace can request an arbitrary supported page size for a DMA
      window and this works fine as long as the mapped memory is backed with
      the pages of the same or bigger size; if this is not the case,
      mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
      dangerously incorrect TCEs.
      
      However since it is quite easy to misconfigure the KVM and we do not do
      reverts to all changes made to TCE tables if an error happens in a middle,
      we better do the acceptable page size validation before we even touch
      the tables.
      
      This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
      against the preregistered memory page sizes.
      
      Since the new check uses real/virtual mode helpers, this renames
      kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
      case and mirrors it for the virtual mode under the old name. The real
      mode handler is not used for the virtual mode as:
      1. it uses _lockless() list traversing primitives instead of RCU;
      2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
      virtual mode does not have to use and since on POWER9+radix only virtual
      mode handlers actually work, we do not want to slow down that path even
      a bit.
      
      This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
      are static now.
      
      From now on the attempts on mapping IOMMU pages bigger than allowed
      will result in KVM exit.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix KVM_HV=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      42de7b9e
  4. 22 5月, 2018 3 次提交
  5. 30 3月, 2018 1 次提交
  6. 19 3月, 2018 1 次提交
  7. 09 2月, 2018 1 次提交
  8. 19 1月, 2018 3 次提交
  9. 23 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts · ded13fc1
      Paul Mackerras 提交于
      This fixes two errors that prevent a guest using the HPT MMU from
      successfully migrating to a POWER9 host in radix MMU mode, or resizing
      its HPT when running on a radix host.
      
      The first bug was that commit 8dc6cca5 ("KVM: PPC: Book3S HV:
      Don't rely on host's page size information", 2017-09-11) missed two
      uses of hpte_base_page_size(), one in the HPT rehashing code and
      one in kvm_htab_write() (which is used on the destination side in
      migrating a HPT guest).  Instead we use kvmppc_hpte_base_page_shift().
      Having the shift count means that we can use left and right shifts
      instead of multiplication and division in a few places.
      
      Along the way, this adds a check in kvm_htab_write() to ensure that the
      page size encoding in the incoming HPTEs is recognized, and if not
      return an EINVAL error to userspace.
      
      The second bug was that kvm_htab_write was performing some but not all
      of the functions of kvmhv_setup_mmu(), resulting in the destination VM
      being left in radix mode as far as the hardware is concerned.  The
      simplest fix for now is make kvm_htab_write() call
      kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
      In future it would be better to refactor the code more extensively
      to remove the duplication.
      
      Fixes: 8dc6cca5 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
      Fixes: 7a84084c ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
      Reported-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ded13fc1
  10. 01 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host · 18c3640c
      Paul Mackerras 提交于
      This sets up the machinery for switching a guest between HPT (hashed
      page table) and radix MMU modes, so that in future we can run a HPT
      guest on a radix host on POWER9 machines.
      
      * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
        radix mode, on a radix host.
      
      * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
        with HV KVM on a radix host.
      
      * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
        radix host.
      
      * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
        guest to HPT mode and allocate a HPT.
      
      * For simplicity, we now allocate the rmap array for each memslot,
        even on a radix host, since it will be needed if the guest switches
        to HPT mode.
      
      * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
        ioctl will return an EINVAL error in that case.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      18c3640c
  11. 19 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
  12. 27 4月, 2017 1 次提交
  13. 20 4月, 2017 5 次提交
    • A
      KVM: PPC: VFIO: Add in-kernel acceleration for VFIO · 121f80ba
      Alexey Kardashevskiy 提交于
      This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
      and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
      without passing them to user space which saves time on switching
      to user space and back.
      
      This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
      KVM tries to handle a TCE request in the real mode, if failed
      it passes the request to the virtual mode to complete the operation.
      If it a virtual mode handler fails, the request is passed to
      the user space; this is not expected to happen though.
      
      To avoid dealing with page use counters (which is tricky in real mode),
      this only accelerates SPAPR TCE IOMMU v2 clients which are required
      to pre-register the userspace memory. The very first TCE request will
      be handled in the VFIO SPAPR TCE driver anyway as the userspace view
      of the TCE table (iommu_table::it_userspace) is not allocated till
      the very first mapping happens and we cannot call vmalloc in real mode.
      
      If we fail to update a hardware IOMMU table unexpected reason, we just
      clear it and move on as there is nothing really we can do about it -
      for example, if we hot plug a VFIO device to a guest, existing TCE tables
      will be mirrored automatically to the hardware and there is no interface
      to report to the guest about possible failures.
      
      This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
      the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
      and associates a physical IOMMU table with the SPAPR TCE table (which
      is a guest view of the hardware IOMMU table). The iommu_table object
      is cached and referenced so we do not have to look up for it in real mode.
      
      This does not implement the UNSET counterpart as there is no use for it -
      once the acceleration is enabled, the existing userspace won't
      disable it unless a VFIO container is destroyed; this adds necessary
      cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
      
      This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
      space.
      
      This adds real mode version of WARN_ON_ONCE() as the generic version
      causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
      returns in the code, this also adds a check for already existing
      vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
      
      This finally makes use of vfio_external_user_iommu_id() which was
      introduced quite some time ago and was considered for removal.
      
      Tests show that this patch increases transmission speed from 220MB/s
      to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      121f80ba
    • A
      KVM: PPC: iommu: Unify TCE checking · b1af23d8
      Alexey Kardashevskiy 提交于
      This reworks helpers for checking TCE update parameters in way they
      can be used in KVM.
      
      This should cause no behavioral change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b1af23d8
    • A
      KVM: PPC: Pass kvm* to kvmppc_find_table() · 503bfcbe
      Alexey Kardashevskiy 提交于
      The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
      there. This will be used in the following patches where we will be
      attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
      to VCPU).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      503bfcbe
    • B
      KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions · 6f63e81b
      Bin Lu 提交于
      This patch provides the MMIO load/store emulation for instructions
      of 'double & vector unsigned char & vector signed char & vector
      unsigned short & vector signed short & vector unsigned int & vector
      signed int & vector double '.
      
      The instructions that this adds emulation for are:
      
      - ldx, ldux, lwax,
      - lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
      - stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
      - lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
      - stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
      
      [paulus@ozlabs.org - some cleanups, fixes and rework, make it
       compile for Book E, fix build when PR KVM is built in]
      Signed-off-by: NBin Lu <lblulb@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6f63e81b
    • P
      KVM: PPC: Provide functions for queueing up FP/VEC/VSX unavailable interrupts · 307d9279
      Paul Mackerras 提交于
      This provides functions that can be used for generating interrupts
      indicating that a given functional unit (floating point, vector, or
      VSX) is unavailable.  These functions will be used in instruction
      emulation code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      307d9279
  14. 10 4月, 2017 3 次提交
  15. 31 1月, 2017 5 次提交
    • D
      KVM: PPC: Book3S HV: Outline of KVM-HV HPT resizing implementation · 5e985969
      David Gibson 提交于
      This adds a not yet working outline of the HPT resizing PAPR
      extension.  Specifically it adds the necessary ioctl() functions,
      their basic steps, the work function which will handle preparation for
      the resize, and synchronization between these, the guest page fault
      path and guest HPT update path.
      
      The actual guts of the implementation isn't here yet, so for now the
      calls will always fail.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5e985969
    • D
      KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size · f98a8bf9
      David Gibson 提交于
      The KVM_PPC_ALLOCATE_HTAB ioctl() is used to set the size of hashed page
      table (HPT) that userspace expects a guest VM to have, and is also used to
      clear that HPT when necessary (e.g. guest reboot).
      
      At present, once the ioctl() is called for the first time, the HPT size can
      never be changed thereafter - it will be cleared but always sized as from
      the first call.
      
      With upcoming HPT resize implementation, we're going to need to allow
      userspace to resize the HPT at reset (to change it back to the default size
      if the guest changed it).
      
      So, we need to allow this ioctl() to change the HPT size.
      
      This patch also updates Documentation/virtual/kvm/api.txt to reflect
      the new behaviour.  In fact the documentation was already slightly
      incorrect since 572abd56 "KVM: PPC: Book3S HV: Don't fall back to
      smaller HPT size in allocation ioctl"
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f98a8bf9
    • D
      KVM: PPC: Book3S HV: Split HPT allocation from activation · aae0777f
      David Gibson 提交于
      Currently, kvmppc_alloc_hpt() both allocates a new hashed page table (HPT)
      and sets it up as the active page table for a VM.  For the upcoming HPT
      resize implementation we're going to want to allocate HPTs separately from
      activating them.
      
      So, split the allocation itself out into kvmppc_allocate_hpt() and perform
      the activation with a new kvmppc_set_hpt() function.  Likewise we split
      kvmppc_free_hpt(), which just frees the HPT, from kvmppc_release_hpt()
      which unsets it as an active HPT, then frees it.
      
      We also move the logic to fall back to smaller HPT sizes if the first try
      fails into the single caller which used that behaviour,
      kvmppc_hv_setup_htab_rma().  This introduces a slight semantic change, in
      that previously if the initial attempt at CMA allocation failed, we would
      fall back to attempting smaller sizes with the page allocator.  Now, we
      try first CMA, then the page allocator at each size.  As far as I can tell
      this change should be harmless.
      
      To match, we make kvmppc_free_hpt() just free the actual HPT itself.  The
      call to kvmppc_free_lpid() that was there, we move to the single caller.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aae0777f
    • D
      KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity · db9a290d
      David Gibson 提交于
      The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
      all obvious from the name.  In practice kvmppc_alloc_hpt() allocates an HPT
      by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
      it with CMA only.
      
      To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
      Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      db9a290d
    • P
      KVM: PPC: Book3S HV: Add userspace interfaces for POWER9 MMU · c9270132
      Paul Mackerras 提交于
      This adds two capabilities and two ioctls to allow userspace to
      find out about and configure the POWER9 MMU in a guest.  The two
      capabilities tell userspace whether KVM can support a guest using
      the radix MMU, or using the hashed page table (HPT) MMU with a
      process table and segment tables.  (Note that the MMUs in the
      POWER9 processor cores do not use the process and segment tables
      when in HPT mode, but the nest MMU does).
      
      The KVM_PPC_CONFIGURE_V3_MMU ioctl allows userspace to specify
      whether a guest will use the radix MMU or the HPT MMU, and to
      specify the size and location (in guest space) of the process
      table.
      
      The KVM_PPC_GET_RMMU_INFO ioctl gives userspace information about
      the radix MMU.  It returns a list of supported radix tree geometries
      (base page size and number of bits indexed at each level of the
      radix tree) and the encoding used to specify the various page
      sizes for the TLB invalidate entry instruction.
      
      Initially, both capabilities return 0 and the ioctls return -EINVAL,
      until the necessary infrastructure for them to operate correctly
      is added.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c9270132
  16. 01 12月, 2016 1 次提交
  17. 24 11月, 2016 1 次提交
    • P
      KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 · f725758b
      Paul Mackerras 提交于
      POWER9 includes a new interrupt controller, called XIVE, which is
      quite different from the XICS interrupt controller on POWER7 and
      POWER8 machines.  KVM-HV accesses the XICS directly in several places
      in order to send and clear IPIs and handle interrupts from PCI
      devices being passed through to the guest.
      
      In order to make the transition to XIVE easier, OPAL firmware will
      include an emulation of XICS on top of XIVE.  Access to the emulated
      XICS is via OPAL calls.  The one complication is that the EOI
      (end-of-interrupt) function can now return a value indicating that
      another interrupt is pending; in this case, the XIVE will not signal
      an interrupt in hardware to the CPU, and software is supposed to
      acknowledge the new interrupt without waiting for another interrupt
      to be delivered in hardware.
      
      This adapts KVM-HV to use the OPAL calls on machines where there is
      no XICS hardware.  When there is no XICS, we look for a device-tree
      node with "ibm,opal-intc" in its compatible property, which is how
      OPAL indicates that it provides XICS emulation.
      
      In order to handle the EOI return value, kvmppc_read_intr() has
      become kvmppc_read_one_intr(), with a boolean variable passed by
      reference which can be set by the EOI functions to indicate that
      another interrupt is pending.  The new kvmppc_read_intr() keeps
      calling kvmppc_read_one_intr() until there are no more interrupts
      to process.  The return value from kvmppc_read_intr() is the
      largest non-zero value of the returns from kvmppc_read_one_intr().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f725758b
  18. 12 9月, 2016 4 次提交
    • P
      KVM: PPC: Book3S HV: Set server for passed-through interrupts · 5d375199
      Paul Mackerras 提交于
      When a guest has a PCI pass-through device with an interrupt, it
      will direct the interrupt to a particular guest VCPU.  In fact the
      physical interrupt might arrive on any CPU, and then get
      delivered to the target VCPU in the emulated XICS (guest interrupt
      controller), and eventually delivered to the target VCPU.
      
      Now that we have code to handle device interrupts in real mode
      without exiting to the host kernel, there is an advantage to having
      the device interrupt arrive on the same sub(core) as the target
      VCPU is running on.  In this situation, the interrupt can be
      delivered to the target VCPU without any exit to the host kernel
      (using a hypervisor doorbell interrupt between threads if
      necessary).
      
      This patch aims to get passed-through device interrupts arriving
      on the correct core by setting the interrupt server in the real
      hardware XICS for the interrupt to the first thread in the (sub)core
      where its target VCPU is running.  We do this in the real-mode H_EOI
      code because the H_EOI handler already needs to look at the
      emulated ICS state for the interrupt (whereas the H_XIRR handler
      doesn't), and we know we are running in the target VCPU context
      at that point.
      
      We set the server CPU in hardware using an OPAL call, regardless of
      what the IRQ affinity mask for the interrupt says, and without
      updating the affinity mask.  This amounts to saying that when an
      interrupt is passed through to a guest, as a matter of policy we
      allow the guest's affinity for the interrupt to override the host's.
      
      This is inspired by an earlier patch from Suresh Warrier, although
      none of this code came from that earlier patch.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5d375199
    • S
      KVM: PPC: Book3S HV: Tunable to disable KVM IRQ bypass · 644abbb2
      Suresh Warrier 提交于
      Add a  module parameter kvm_irq_bypass for kvm_hv.ko to
      disable IRQ bypass for passthrough interrupts. The default
      value of this tunable is 1 - that is enable the feature.
      
      Since the tunable is used by built-in kernel code, we use
      the module_param_cb macro to achieve this.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      644abbb2
    • S
      KVM: PPC: Book3S HV: Complete passthrough interrupt in host · f7af5209
      Suresh Warrier 提交于
      In existing real mode ICP code, when updating the virtual ICP
      state, if there is a required action that cannot be completely
      handled in real mode, as for instance, a VCPU needs to be woken
      up, flags are set in the ICP to indicate the required action.
      This is checked when returning from hypercalls to decide whether
      the call needs switch back to the host where the action can be
      performed in virtual mode. Note that if h_ipi_redirect is enabled,
      real mode code will first try to message a free host CPU to
      complete this job instead of returning the host to do it ourselves.
      
      Currently, the real mode PCI passthrough interrupt handling code
      checks if any of these flags are set and simply returns to the host.
      This is not good enough as the trap value (0x500) is treated as an
      external interrupt by the host code. It is only when the trap value
      is a hypercall that the host code searches for and acts on unfinished
      work by calling kvmppc_xics_rm_complete.
      
      This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
      which is returned by KVM if there is unfinished business to be
      completed in host virtual mode after handling a PCI passthrough
      interrupt. The host checks for this special interrupt condition
      and calls into the kvmppc_xics_rm_complete, which is made an
      exported function for this reason.
      
      [paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
       in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f7af5209
    • S
      KVM: PPC: Book3S HV: Handle passthrough interrupts in guest · e3c13e56
      Suresh Warrier 提交于
      Currently, KVM switches back to the host to handle any external
      interrupt (when the interrupt is received while running in the
      guest). This patch updates real-mode KVM to check if an interrupt
      is generated by a passthrough adapter that is owned by this guest.
      If so, the real mode KVM will directly inject the corresponding
      virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
      in hardware. In short, the interrupt is handled entirely in real
      mode in the guest context without switching back to the host.
      
      In some rare cases, the interrupt cannot be completely handled in
      real mode, for instance, a VCPU that is sleeping needs to be woken
      up. In this case, KVM simply switches back to the host with trap
      reason set to 0x500. This works, but it is clearly not very efficient.
      A following patch will distinguish this case and handle it
      correctly in the host. Note that we can use the existing
      check_too_hard() routine even though we are not in a hypercall to
      determine if there is unfinished business that needs to be
      completed in host virtual mode.
      
      The patch assumes that the mapping between hardware interrupt IRQ
      and virtual IRQ to be injected to the guest already exists for the
      PCI passthrough interrupts that need to be handled in real mode.
      If the mapping does not exist, KVM falls back to the default
      existing behavior.
      
      The KVM real mode code reads mappings from the mapped array in the
      passthrough IRQ map without taking any lock.  We carefully order the
      loads and stores of the fields in the kvmppc_irq_map data structure
      using memory barriers to avoid an inconsistent mapping being seen by
      the reader. Thus, although it is possible to miss a map entry, it is
      not possible to read a stale value.
      
      [paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
       pulled out powernv eoi change into a separate patch, made
       kvmppc_read_intr get the vcpu from the paca rather than being
       passed in, rewrote the logic at the end of kvmppc_read_intr to
       avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
       since we were always restoring SRR0/1 anyway, get rid of the cached
       array (just use the mapped array), removed the kick_all_cpus_sync()
       call, clear saved_xirr PACA field when we handle the interrupt in
       real mode, fix compilation with CONFIG_KVM_XICS=n.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e3c13e56