1. 19 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
  2. 09 10月, 2018 2 次提交
  3. 18 8月, 2018 1 次提交
  4. 18 5月, 2018 2 次提交
    • N
      KVM: PPC: Book3S HV: Send kvmppc_bad_interrupt NMIs to Linux handlers · 7c1bd80c
      Nicholas Piggin 提交于
      It's possible to take a SRESET or MCE in these paths due to a bug
      in the host code or a NMI IPI, etc. A recent bug attempting to load
      a virtual address from real mode gave th complete but cryptic error,
      abridged:
      
            Oops: Bad interrupt in KVM entry/exit code, sig: 6 [#1]
            LE SMP NR_CPUS=2048 NUMA PowerNV
            CPU: 53 PID: 6582 Comm: qemu-system-ppc Not tainted
            NIP:  c0000000000155ac LR: c0000000000c2430 CTR: c000000000015580
            REGS: c000000fff76dd80 TRAP: 0200   Not tainted
            MSR:  9000000000201003 <SF,HV,ME,RI,LE>  CR: 48082222  XER: 00000000
            CFAR: 0000000102900ef0 DAR: d00017fffd941a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c2430] do_tlbies+0x230/0x2f0
      
      Sending the NMIs through the Linux handlers gives a nicer output:
      
            Severe Machine check interrupt [Not recovered]
              NIP [c0000000000155ac]: perf_trace_tlbie+0x2c/0x1a0
              Initiator: CPU
              Error type: Real address [Load (bad)]
                Effective address: d00017fffcc01a28
            opal: Machine check interrupt unrecoverable: MSR(RI=0)
            opal: Hardware platform error: Unrecoverable Machine Check exception
            CPU: 0 PID: 6700 Comm: qemu-system-ppc Tainted: G   M
            NIP:  c0000000000155ac LR: c0000000000c23c0 CTR: c000000000015580
            REGS: c000000fff9e9d80 TRAP: 0200   Tainted: G   M
            MSR:  9000000000201001 <SF,HV,ME,LE>  CR: 48082222  XER: 00000000
            CFAR: 000000010cbc1a30 DAR: d00017fffcc01a28 DSISR: 00000040 SOFTE: 3
            NIP [c0000000000155ac] perf_trace_tlbie+0x2c/0x1a0
            LR [c0000000000c23c0] do_tlbies+0x1c0/0x280
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7c1bd80c
    • S
      KVM: PPC: Add pt_regs into kvm_vcpu_arch and move vcpu->arch.gpr[] into it · 1143a706
      Simon Guo 提交于
      Current regs are scattered at kvm_vcpu_arch structure and it will
      be more neat to organize them into pt_regs structure.
      
      Also it will enable reimplementation of MMIO emulation code with
      analyse_instr() later.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1143a706
  5. 30 3月, 2018 1 次提交
  6. 01 11月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts · c0101509
      Paul Mackerras 提交于
      This patch removes the restriction that a radix host can only run
      radix guests, allowing us to run HPT (hashed page table) guests as
      well.  This is useful because it provides a way to run old guest
      kernels that know about POWER8 but not POWER9.
      
      Unfortunately, POWER9 currently has a restriction that all threads
      in a given code must either all be in HPT mode, or all in radix mode.
      This means that when entering a HPT guest, we have to obtain control
      of all 4 threads in the core and get them to switch their LPIDR and
      LPCR registers, even if they are not going to run a guest.  On guest
      exit we also have to get all threads to switch LPIDR and LPCR back
      to host values.
      
      To make this feasible, we require that KVM not be in the "independent
      threads" mode, and that the CPU cores be in single-threaded mode from
      the host kernel's perspective (only thread 0 online; threads 1, 2 and
      3 offline).  That allows us to use the same code as on POWER8 for
      obtaining control of the secondary threads.
      
      To manage the LPCR/LPIDR changes required, we extend the kvm_split_info
      struct to contain the information needed by the secondary threads.
      All threads perform a barrier synchronization (where all threads wait
      for every other thread to reach the synchronization point) on guest
      entry, both before and after loading LPCR and LPIDR.  On guest exit,
      they all once again perform a barrier synchronization both before
      and after loading host values into LPCR and LPIDR.
      
      Finally, it is also currently necessary to flush the entire TLB every
      time we enter a HPT guest on a radix host.  We do this on thread 0
      with a loop of tlbiel instructions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c0101509
    • P
      KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabled · 00bb6ae5
      Paul Mackerras 提交于
      When running a guest on a POWER9 system with the in-kernel XICS
      emulation disabled (for example by running QEMU with the parameter
      "-machine pseries,kernel_irqchip=off"), the kernel does not pass
      the XICS-related hypercalls such as H_CPPR up to userspace for
      emulation there as it should.
      
      The reason for this is that the real-mode handlers for these
      hypercalls don't check whether a XICS device has been instantiated
      before calling the xics-on-xive code.  That code doesn't check
      either, leading to potential NULL pointer dereferences because
      vcpu->arch.xive_vcpu is NULL.  Those dereferences won't cause an
      exception in real mode but will lead to kernel memory corruption.
      
      This fixes it by adding kvmppc_xics_enabled() checks before calling
      the XICS functions.
      
      Cc: stable@vger.kernel.org # v4.11+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00bb6ae5
  7. 14 10月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Handle unexpected interrupts better · 857b99e1
      Paul Mackerras 提交于
      At present, if an interrupt (i.e. an exception or trap) occurs in the
      code where KVM is switching the MMU to or from guest context, we jump
      to kvmppc_bad_host_intr, where we simply spin with interrupts disabled.
      In this situation, it is hard to debug what happened because we get no
      indication as to which interrupt occurred or where.  Typically we get
      a cascade of stall and soft lockup warnings from other CPUs.
      
      In order to get more information for debugging, this adds code to
      create a stack frame on the emergency stack and save register values
      to it.  We start half-way down the emergency stack in order to give
      ourselves some chance of being able to do a stack trace on secondary
      threads that are already on the emergency stack.
      
      On POWER7 or POWER8, we then just spin, as before, because we don't
      know what state the MMU context is in or what other threads are doing,
      and we can't switch back to host context without coordinating with
      other threads.  On POWER9 we can do better; there we load up the host
      MMU context and jump to C code, which prints an oops message to the
      console and panics.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      857b99e1
  8. 01 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras 提交于
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      898b25b2
  9. 12 5月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers · acde2572
      Paul Mackerras 提交于
      POWER9 running a radix guest will take some hypervisor interrupts
      without going to real mode (turning off the MMU).  This means that
      early hypercall handlers may now be called in virtual mode.  Most of
      the handlers work just fine in both modes, but there are some that
      can crash the host if called in virtual mode, notably the TCE (IOMMU)
      hypercalls H_PUT_TCE, H_STUFF_TCE and H_PUT_TCE_INDIRECT.  These
      already have both a real-mode and a virtual-mode version, so we
      arrange for the real-mode version to return H_TOO_HARD for radix
      guests, which will result in the virtual-mode version being called.
      
      The other hypercall which is sensitive to the MMU mode is H_RANDOM.
      It doesn't have a virtual-mode version, so this adds code to enable
      it to be called in either mode.
      
      An alternative solution was considered which would refuse to call any
      of the early hypercall handlers when doing a virtual-mode exit from a
      radix guest.  However, the XICS-on-XIVE code depends on the XICS
      hypercalls being handled early even for virtual-mode exits, because
      the handlers need to be called before the XIVE vCPU state has been
      pulled off the hardware.  Therefore that solution would have become
      quite invasive and complicated, and was rejected in favour of the
      simpler, though less elegant, solution presented here.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      acde2572
  10. 27 4月, 2017 1 次提交
  11. 19 4月, 2017 1 次提交
  12. 10 4月, 2017 2 次提交
    • B
      powerpc: Consolidate variants of real-mode MMIOs · d381d7ca
      Benjamin Herrenschmidt 提交于
      We have all sort of variants of MMIO accessors for the real mode
      instructions. This creates a clean set of accessors based on
      Linux normal naming conventions, replacing all occurrences of
      the old ones in the tree.
      
      I have purposefully removed the "out/in" variants in favor of
      only including __raw variants. Any code using these is already
      pretty much hand tuned to operate in a very specific environment.
      I've fixed up the 2 users (only one of them actually needed
      a barrier in the first place).
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d381d7ca
    • B
      powerpc/xive: Native exploitation of the XIVE interrupt controller · 243e2511
      Benjamin Herrenschmidt 提交于
      The XIVE interrupt controller is the new interrupt controller
      found in POWER9. It supports advanced virtualization capabilities
      among other things.
      
      Currently we use a set of firmware calls that simulate the old
      "XICS" interrupt controller but this is fairly inefficient.
      
      This adds the framework for using XIVE along with a native
      backend which OPAL for configuration. Later, a backend allowing
      the use in a KVM or PowerVM guest will also be provided.
      
      This disables some fast path for interrupts in KVM when XIVE is
      enabled as these rely on the firmware emulation code which is no
      longer available when the XIVE is used natively by Linux.
      
      A latter patch will make KVM also directly exploit the XIVE, thus
      recovering the lost performance (and more).
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Fixup pr_xxx("XIVE:"...), don't split pr_xxx() strings,
       tweak Kconfig so XIVE_NATIVE selects XIVE and depends on POWERNV,
       fix build errors when SMP=n, fold in fixes from Ben:
         Don't call cpu_online() on an invalid CPU number
         Fix irq target selection returning out of bounds cpu#
         Extra sanity checks on cpu numbers
       ]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      243e2511
  13. 25 2月, 2017 1 次提交
  14. 07 2月, 2017 1 次提交
  15. 31 1月, 2017 2 次提交
    • D
      KVM: PPC: Book3S HV: Rename kvm_alloc_hpt() for clarity · db9a290d
      David Gibson 提交于
      The difference between kvm_alloc_hpt() and kvmppc_alloc_hpt() is not at
      all obvious from the name.  In practice kvmppc_alloc_hpt() allocates an HPT
      by whatever means, and calls kvm_alloc_hpt() which will attempt to allocate
      it with CMA only.
      
      To make this less confusing, rename kvm_alloc_hpt() to kvm_alloc_hpt_cma().
      Similarly, kvm_release_hpt() is renamed kvm_free_hpt_cma().
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      db9a290d
    • P
      KVM: PPC: Book3S HV: Allow guest exit path to have MMU on · 53af3ba2
      Paul Mackerras 提交于
      If we allow LPCR[AIL] to be set for radix guests, then interrupts from
      the guest to the host can be delivered by the hardware with relocation
      on, and thus the code path starting at kvmppc_interrupt_hv can be
      executed in virtual mode (MMU on) for radix guests (previously it was
      only ever executed in real mode).
      
      Most of the code is indifferent to whether the MMU is on or off, but
      the calls to OPAL that use the real-mode OPAL entry code need to
      be switched to use the virtual-mode code instead.  The affected
      calls are the calls to the OPAL XICS emulation functions in
      kvmppc_read_one_intr() and related functions.  We test the MSR[IR]
      bit to detect whether we are in real or virtual mode, and call the
      opal_rm_* or opal_* function as appropriate.
      
      The other place that depends on the MMU being off is the optimization
      where the guest exit code jumps to the external interrupt vector or
      hypervisor doorbell interrupt vector, or returns to its caller (which
      is __kvmppc_vcore_entry).  If the MMU is on and we are returning to
      the caller, then we don't need to use an rfid instruction since the
      MMU is already on; a simple blr suffices.  If there is an external
      or hypervisor doorbell interrupt to handle, we branch to the
      relocation-on version of the interrupt vector.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53af3ba2
  16. 01 12月, 2016 1 次提交
  17. 24 11月, 2016 3 次提交
    • P
      KVM: PPC: Book3S HV: Fix compilation with unusual configurations · e2702871
      Paul Mackerras 提交于
      This adds the "again" parameter to the dummy version of
      kvmppc_check_passthru(), so that it matches the real version.
      This fixes compilation with CONFIG_BOOK3S_64_HV set but
      CONFIG_KVM_XICS=n.
      
      This includes asm/smp.h in book3s_hv_builtin.c to fix compilation
      with CONFIG_SMP=n.  The explicit inclusion is necessary to provide
      definitions of hard_smp_processor_id() and get_hard_smp_processor_id()
      in UP configs.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e2702871
    • P
      KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9 · f725758b
      Paul Mackerras 提交于
      POWER9 includes a new interrupt controller, called XIVE, which is
      quite different from the XICS interrupt controller on POWER7 and
      POWER8 machines.  KVM-HV accesses the XICS directly in several places
      in order to send and clear IPIs and handle interrupts from PCI
      devices being passed through to the guest.
      
      In order to make the transition to XIVE easier, OPAL firmware will
      include an emulation of XICS on top of XIVE.  Access to the emulated
      XICS is via OPAL calls.  The one complication is that the EOI
      (end-of-interrupt) function can now return a value indicating that
      another interrupt is pending; in this case, the XIVE will not signal
      an interrupt in hardware to the CPU, and software is supposed to
      acknowledge the new interrupt without waiting for another interrupt
      to be delivered in hardware.
      
      This adapts KVM-HV to use the OPAL calls on machines where there is
      no XICS hardware.  When there is no XICS, we look for a device-tree
      node with "ibm,opal-intc" in its compatible property, which is how
      OPAL indicates that it provides XICS emulation.
      
      In order to handle the EOI return value, kvmppc_read_intr() has
      become kvmppc_read_one_intr(), with a boolean variable passed by
      reference which can be set by the EOI functions to indicate that
      another interrupt is pending.  The new kvmppc_read_intr() keeps
      calling kvmppc_read_one_intr() until there are no more interrupts
      to process.  The return value from kvmppc_read_intr() is the
      largest non-zero value of the returns from kvmppc_read_one_intr().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f725758b
    • P
      KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9 · 1704a81c
      Paul Mackerras 提交于
      On POWER9, the msgsnd instruction is able to send interrupts to
      other cores, as well as other threads on the local core.  Since
      msgsnd is generally simpler and faster than sending an IPI via the
      XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1704a81c
  18. 21 11月, 2016 1 次提交
  19. 12 9月, 2016 2 次提交
    • S
      KVM: PPC: Book3S HV: Complete passthrough interrupt in host · f7af5209
      Suresh Warrier 提交于
      In existing real mode ICP code, when updating the virtual ICP
      state, if there is a required action that cannot be completely
      handled in real mode, as for instance, a VCPU needs to be woken
      up, flags are set in the ICP to indicate the required action.
      This is checked when returning from hypercalls to decide whether
      the call needs switch back to the host where the action can be
      performed in virtual mode. Note that if h_ipi_redirect is enabled,
      real mode code will first try to message a free host CPU to
      complete this job instead of returning the host to do it ourselves.
      
      Currently, the real mode PCI passthrough interrupt handling code
      checks if any of these flags are set and simply returns to the host.
      This is not good enough as the trap value (0x500) is treated as an
      external interrupt by the host code. It is only when the trap value
      is a hypercall that the host code searches for and acts on unfinished
      work by calling kvmppc_xics_rm_complete.
      
      This patch introduces a special trap BOOK3S_INTERRUPT_HV_RM_HARD
      which is returned by KVM if there is unfinished business to be
      completed in host virtual mode after handling a PCI passthrough
      interrupt. The host checks for this special interrupt condition
      and calls into the kvmppc_xics_rm_complete, which is made an
      exported function for this reason.
      
      [paulus@ozlabs.org - moved logic to set r12 to BOOK3S_INTERRUPT_HV_RM_HARD
       in book3s_hv_rmhandlers.S into the end of kvmppc_check_wake_reason.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f7af5209
    • S
      KVM: PPC: Book3S HV: Handle passthrough interrupts in guest · e3c13e56
      Suresh Warrier 提交于
      Currently, KVM switches back to the host to handle any external
      interrupt (when the interrupt is received while running in the
      guest). This patch updates real-mode KVM to check if an interrupt
      is generated by a passthrough adapter that is owned by this guest.
      If so, the real mode KVM will directly inject the corresponding
      virtual interrupt to the guest VCPU's ICS and also EOI the interrupt
      in hardware. In short, the interrupt is handled entirely in real
      mode in the guest context without switching back to the host.
      
      In some rare cases, the interrupt cannot be completely handled in
      real mode, for instance, a VCPU that is sleeping needs to be woken
      up. In this case, KVM simply switches back to the host with trap
      reason set to 0x500. This works, but it is clearly not very efficient.
      A following patch will distinguish this case and handle it
      correctly in the host. Note that we can use the existing
      check_too_hard() routine even though we are not in a hypercall to
      determine if there is unfinished business that needs to be
      completed in host virtual mode.
      
      The patch assumes that the mapping between hardware interrupt IRQ
      and virtual IRQ to be injected to the guest already exists for the
      PCI passthrough interrupts that need to be handled in real mode.
      If the mapping does not exist, KVM falls back to the default
      existing behavior.
      
      The KVM real mode code reads mappings from the mapped array in the
      passthrough IRQ map without taking any lock.  We carefully order the
      loads and stores of the fields in the kvmppc_irq_map data structure
      using memory barriers to avoid an inconsistent mapping being seen by
      the reader. Thus, although it is possible to miss a map entry, it is
      not possible to read a stale value.
      
      [paulus@ozlabs.org - get irq_chip from irq_map rather than pimap,
       pulled out powernv eoi change into a separate patch, made
       kvmppc_read_intr get the vcpu from the paca rather than being
       passed in, rewrote the logic at the end of kvmppc_read_intr to
       avoid deep indentation, simplified logic in book3s_hv_rmhandlers.S
       since we were always restoring SRR0/1 anyway, get rid of the cached
       array (just use the mapped array), removed the kick_all_cpus_sync()
       call, clear saved_xirr PACA field when we handle the interrupt in
       real mode, fix compilation with CONFIG_KVM_XICS=n.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e3c13e56
  20. 09 9月, 2016 1 次提交
    • S
      KVM: PPC: Book3S HV: Convert kvmppc_read_intr to a C function · 37f55d30
      Suresh Warrier 提交于
      Modify kvmppc_read_intr to make it a C function.  Because it is called
      from kvmppc_check_wake_reason, any of the assembler code that calls
      either kvmppc_read_intr or kvmppc_check_wake_reason now has to assume
      that the volatile registers might have been modified.
      
      This also adds in the optimization of clearing saved_xirr in the case
      where we completely handle and EOI an IPI.  Without this, the next
      device interrupt will require two trips through the host interrupt
      handling code.
      
      [paulus@ozlabs.org - made kvmppc_check_wake_reason create a stack frame
       when it is calling kvmppc_read_intr, which means we can set r12 to
       the trap number (0x500) after the call to kvmppc_read_intr, instead
       of using r31.  Also moved the deliver_guest_interrupt label so as to
       restore XER and CTR, plus other minor tweaks.]
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      37f55d30
  21. 29 2月, 2016 1 次提交
    • S
      KVM: PPC: Book3S HV: Host-side RM data structures · 79b6c247
      Suresh Warrier 提交于
      This patch defines the data structures to support the setting up
      of host side operations while running in real mode in the guest,
      and also the functions to allocate and free it.
      
      The operations are for now limited to virtual XICS operations.
      Currently, we have only defined one operation in the data
      structure:
               - Wake up a VCPU sleeping in the host when it
                 receives a virtual interrupt
      
      The operations are assigned at the core level because PowerKVM
      requires that the host run in SMT off mode. For each core,
      we will need to manage its state atomically - where the state
      is defined by:
      1. Is the core running in the host?
      2. Is there a Real Mode (RM) operation pending on the host?
      
      Currently, core state is only managed at the whole-core level
      even when the system is in split-core mode. This just limits
      the number of free or "available" cores in the host to perform
      any host-side operations.
      
      The kvmppc_host_rm_core.rm_data allows any data to be passed by
      KVM in real mode to the host core along with the operation to
      be performed.
      
      The kvmppc_host_rm_ops structure is allocated the very first time
      a guest VM is started. Initial core state is also set - all online
      cores are in the host. This structure is never deleted, not even
      when there are no active guests. However, it needs to be freed
      when the module is unloaded because the kvmppc_host_rm_ops_hv
      can contain function pointers to kvm-hv.ko functions for the
      different supported host operations.
      Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      79b6c247
  22. 22 8月, 2015 2 次提交
    • P
      KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 · b4deba5c
      Paul Mackerras 提交于
      This builds on the ability to run more than one vcore on a physical
      core by using the micro-threading (split-core) modes of the POWER8
      chip.  Previously, only vcores from the same VM could be run together,
      and (on POWER8) only if they had just one thread per core.  With the
      ability to split the core on guest entry and unsplit it on guest exit,
      we can run up to 8 vcpu threads from up to 4 different VMs, and we can
      run multiple vcores with 2 or 4 vcpus per vcore.
      
      Dynamic micro-threading is only available if the static configuration
      of the cores is whole-core mode (unsplit), and only on POWER8.
      
      To manage this, we introduce a new kvm_split_mode struct which is
      shared across all of the subcores in the core, with a pointer in the
      paca on each thread.  In addition we extend the core_info struct to
      have information on each subcore.  When deciding whether to add a
      vcore to the set already on the core, we now have two possibilities:
      (a) piggyback the vcore onto an existing subcore, or (b) start a new
      subcore.
      
      Currently, when any vcpu needs to exit the guest and switch to host
      virtual mode, we interrupt all the threads in all subcores and switch
      the core back to whole-core mode.  It may be possible in future to
      allow some of the subcores to keep executing in the guest while
      subcore 0 switches to the host, but that is not implemented in this
      patch.
      
      This adds a module parameter called dynamic_mt_modes which controls
      which micro-threading (split-core) modes the code will consider, as a
      bitmap.  In other words, if it is 0, no micro-threading mode is
      considered; if it is 2, only 2-way micro-threading is considered; if
      it is 4, only 4-way, and if it is 6, both 2-way and 4-way
      micro-threading mode will be considered.  The default is 6.
      
      With this, we now have secondary threads which are the primary thread
      for their subcore and therefore need to do the MMU switch.  These
      threads will need to be started even if they have no vcpu to run, so
      we use the vcore pointer in the PACA rather than the vcpu pointer to
      trigger them.
      
      It is now possible for thread 0 to find that an exit has been
      requested before it gets to switch the subcore state to the guest.  In
      that case we haven't added the guest's timebase offset to the
      timebase, so we need to be careful not to subtract the offset in the
      guest exit path.  In fact we just skip the whole path that switches
      back to host context, since we haven't switched to the guest context.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4deba5c
    • P
      KVM: PPC: Book3S HV: Make use of unused threads when running guests · ec257165
      Paul Mackerras 提交于
      When running a virtual core of a guest that is configured with fewer
      threads per core than the physical cores have, the extra physical
      threads are currently unused.  This makes it possible to use them to
      run one or more other virtual cores from the same guest when certain
      conditions are met.  This applies on POWER7, and on POWER8 to guests
      with one thread per virtual core.  (It doesn't apply to POWER8 guests
      with multiple threads per vcore because they require a 1-1 virtual to
      physical thread mapping in order to be able to use msgsndp and the
      TIR.)
      
      The idea is that we maintain a list of preempted vcores for each
      physical cpu (i.e. each core, since the host runs single-threaded).
      Then, when a vcore is about to run, it checks to see if there are
      any vcores on the list for its physical cpu that could be
      piggybacked onto this vcore's execution.  If so, those additional
      vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
      threads are started as well as the original vcore, which is called
      the master vcore.
      
      After the vcores have exited the guest, the extra ones are put back
      onto the preempted list if any of their VCPUs are still runnable and
      not idle.
      
      This means that vcpu->arch.ptid is no longer necessarily the same as
      the physical thread that the vcpu runs on.  In order to make it easier
      for code that wants to send an IPI to know which CPU to target, we
      now store that in a new field in struct vcpu_arch, called thread_cpu.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ec257165
  23. 21 4月, 2015 4 次提交
    • P
      KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8 · 66feed61
      Paul Mackerras 提交于
      This uses msgsnd where possible for signalling other threads within
      the same core on POWER8 systems, rather than IPIs through the XICS
      interrupt controller.  This includes waking secondary threads to run
      the guest, the interrupts generated by the virtual XICS, and the
      interrupts to bring the other threads out of the guest when exiting.
      
      Aggregated statistics from debugfs across vcpus for a guest with 32
      vcpus, 8 threads/vcore, running on a POWER8, show this before the
      change:
      
       rm_entry:     3387.6ns (228 - 86600, 1008969 samples)
        rm_exit:     4561.5ns (12 - 3477452, 1009402 samples)
        rm_intr:     1660.0ns (12 - 553050, 3600051 samples)
      
      and this after the change:
      
       rm_entry:     3060.1ns (212 - 65138, 953873 samples)
        rm_exit:     4244.1ns (12 - 9693408, 954331 samples)
        rm_intr:     1342.3ns (12 - 1104718, 3405326 samples)
      
      for a test of booting Fedora 20 big-endian to the login prompt.
      
      The time taken for a H_PROD hcall (which is handled in the host
      kernel) went down from about 35 microseconds to about 16 microseconds
      with this change.
      
      The noinline added to kvmppc_run_core turned out to be necessary for
      good performance, at least with gcc 4.9.2 as packaged with Fedora 21
      and a little-endian POWER8 host.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      66feed61
    • P
      KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C · eddb60fb
      Paul Mackerras 提交于
      This replaces the assembler code for kvmhv_commence_exit() with C code
      in book3s_hv_builtin.c.  It also moves the IPI sending code that was
      in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
      can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      eddb60fb
    • P
      KVM: PPC: Book3S HV: Use bitmap of active threads rather than count · 7d6c40da
      Paul Mackerras 提交于
      Currently, the entry_exit_count field in the kvmppc_vcore struct
      contains two 8-bit counts, one of the threads that have started entering
      the guest, and one of the threads that have started exiting the guest.
      This changes it to an entry_exit_map field which contains two bitmaps
      of 8 bits each.  The advantage of doing this is that it gives us a
      bitmap of which threads need to be signalled when exiting the guest.
      That means that we no longer need to use the trick of setting the
      HDEC to 0 to pull the other threads out of the guest, which led in
      some cases to a spurious HDEC interrupt on the next guest entry.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      7d6c40da
    • M
      KVM: PPC: Book3S HV: Add fast real-mode H_RANDOM implementation. · e928e9cb
      Michael Ellerman 提交于
      Some PowerNV systems include a hardware random-number generator.
      This HWRNG is present on POWER7+ and POWER8 chips and is capable of
      generating one 64-bit random number every microsecond.  The random
      numbers are produced by sampling a set of 64 unstable high-frequency
      oscillators and are almost completely entropic.
      
      PAPR defines an H_RANDOM hypercall which guests can use to obtain one
      64-bit random sample from the HWRNG.  This adds a real-mode
      implementation of the H_RANDOM hypercall.  This hypercall was
      implemented in real mode because the latency of reading the HWRNG is
      generally small compared to the latency of a guest exit and entry for
      all the threads in the same virtual core.
      
      Userspace can detect the presence of the HWRNG and the H_RANDOM
      implementation by querying the KVM_CAP_PPC_HWRNG capability.  The
      H_RANDOM hypercall implementation will only be invoked when the guest
      does an H_RANDOM hypercall if userspace first enables the in-kernel
      H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      e928e9cb
  24. 17 12月, 2014 2 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
  25. 10 11月, 2014 2 次提交
  26. 29 9月, 2014 1 次提交