1. 17 12月, 2018 1 次提交
  2. 14 12月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · 234ff0b7
      Paul Mackerras 提交于
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      234ff0b7
  3. 15 11月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12
      Michael Roth 提交于
      While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
      pending signal in the L0 QEMU process can generate the following
      sequence:
      
        ret0 = kvmppc_pseries_do_hcall()
          ret1 = kvmhv_enter_nested_guest()
            ret2 = kvmhv_run_single_vcpu()
            if (ret2 == -EINTR)
              return H_INTERRUPT
          if (ret1 == H_INTERRUPT)
            kvmppc_set_gpr(vcpu, 3, 0)
            return -EINTR
          /* skipped: */
          kvmppc_set_gpr(vcpu, 3, ret)
          vcpu->arch.hcall_needed = 0
          return RESUME_GUEST
      
      which causes an exit to L0 userspace with ret0 == -EINTR.
      
      The intention seems to be to set the hcall return value to 0 (via
      VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
      once we resume executing the VCPU. However, because we don't set
      vcpu->arch.hcall_needed = 0, we do the following once userspace
      resumes execution via kvm_arch_vcpu_ioctl_run():
      
        ...
        } else if (vcpu->arch.hcall_needed) {
          int i
      
          kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
          for (i = 0; i < 9; ++i)
                 kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
          vcpu->arch.hcall_needed = 0;
      
      since vcpu->arch.hcall_needed == 1 indicates that userspace should
      have handled the hcall and stored the return value in
      run->papr_hcall.ret. Since that's not the case here, we can get an
      unexpected value in VCPU r3, which can result in
      kvmhv_p9_guest_entry() reporting an unexpected trap value when it
      returns from H_ENTER_NESTED, causing the following register dump to
      console via subsequent call to kvmppc_handle_exit_hv() in L1:
      
        [  350.612854] vcpu 00000000f9564cf8 (0):
        [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
        [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
        [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
        [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
        [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
        [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
        [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
        [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
        [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
        [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
        [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
        [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
        [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
        [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
        [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
        [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
        [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
        [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
        [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
        [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
        [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
        [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
        [  350.613962] dar = 0000772f95390000
        [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
        [  350.614073] SLB (0 entries):
        [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
        [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
      
      followed by L1's QEMU reporting the following before stopping execution
      of the nested guest:
      
        KVM: unknown exit, hardware reason 1
        NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
        MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
        TB 00000000 00000000 DECR 00000000
        GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
        GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
        GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
        GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
        GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
        GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
        GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
        GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
        CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
         SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
        SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
        SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
        HSRR0 0000000000000000 HSRR1 0000000000000000
         CFAR 0000000000000000
         LPCR 0000000003d40413
         PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000
      
      Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
      of H_ENTER_NESTED before we exit to L0 userspace.
      
      Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: linuxppc-dev@ozlabs.org
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6c08ec12
  4. 26 10月, 2018 1 次提交
  5. 19 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips · 8d9fcacf
      Paul Mackerras 提交于
      This disables the use of the streamlined entry path for radix guests
      on early POWER9 chips that need the workaround added in commit
      a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM",
      2017-07-24), because the streamlined entry path does not include
      that workaround.  This also means that we can't do nested HV-KVM
      on those chips.
      
      Since the chips that need that workaround are the same ones that can't
      run both radix and HPT guests at the same time on different threads of
      a core, we use the existing 'no_mixing_hpt_and_radix' variable that
      identifies those chips to identify when we can't use the new guest
      entry path, and when we can't do nested virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8d9fcacf
  6. 09 10月, 2018 19 次提交
    • P
      KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f
      Paul Mackerras 提交于
      This adds a KVM_PPC_NO_HASH flag to the flags field of the
      kvm_ppc_smmu_info struct, and arranges for it to be set when
      running as a nested hypervisor, as an unambiguous indication
      to userspace that HPT guests are not supported.  Reporting the
      KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
      indicating only that the new HPT features in ISA V3.0 are not
      supported, leaving it ambiguous whether pre-V3.0 HPT features
      are supported.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      901f8c3f
    • P
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras 提交于
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • P
      KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode · de760db4
      Paul Mackerras 提交于
      With this, the KVM-HV module can be loaded in a guest running under
      KVM-HV, and if the hypervisor supports nested virtualization, this
      guest can now act as a nested hypervisor and run nested guests.
      
      This also adds some checks to inform userspace that HPT guests are not
      supported by nested hypervisors (by returning false for the
      KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
      configuring a guest to use HPT mode.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de760db4
    • P
      KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register · 30323418
      Paul Mackerras 提交于
      This adds a one-reg register identifier which can be used to read and
      set the virtual PTCR for the guest.  This register identifies the
      address and size of the virtual partition table for the guest, which
      contains information about the nested guests under this guest.
      
      Migrating this value is the only extra requirement for migrating a
      guest which has nested guests (assuming of course that the destination
      host supports nested virtualization in the kvm-hv module).
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      30323418
    • P
      KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested · f3c99f97
      Paul Mackerras 提交于
      When running as a nested hypervisor, this avoids reading hypervisor
      privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
      instead reasonable default values are used.  This also avoids writing
      LPIDR in the single-vcpu entry/exit path.
      
      Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
      since its only caller already checks this.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c99f97
    • S
      KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu · 9d0b048d
      Suraj Jitindar Singh 提交于
      This is only done at level 0, since only level 0 knows which physical
      CPU a vcpu is running on.  This does for nested guests what L0 already
      did for its own guests, which is to flush the TLB on a pCPU when it
      goes to run a vCPU there, and there is another vCPU in the same VM
      which previously ran on this pCPU and has now started to run on another
      pCPU.  This is to handle the situation where the other vCPU touched
      a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
      on that new pCPU and thus left behind a stale TLB entry on this pCPU.
      
      This introduces a limit on the the vcpu_token values used in the
      H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
      
      [paulus@ozlabs.org - made prev_cpu array be short[] to reduce
       memory consumption.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d0b048d
    • S
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh 提交于
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
    • S
      KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings · 8cf531ed
      Suraj Jitindar Singh 提交于
      When a host (L0) page which is mapped into a (L1) guest is in turn
      mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
      so that these mappings can be retrieved later.
      
      Whenever we create an entry in a shadow_pgtable for a nested guest we
      create a corresponding rmap entry and add it to the list for the
      L1 guest memslot at the index of the L1 guest page it maps. This means
      at the L1 guest memslot we end up with lists of rmaps.
      
      When we are notified of a host page being invalidated which has been
      mapped through to a (L1) guest, we can then walk the rmap list for that
      guest page, and find and invalidate all of the corresponding
      shadow_pgtable entries.
      
      In order to reduce memory consumption, we compress the information for
      each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
      for the guest real page frame number -- which will fit in a single
      unsigned long.  To avoid a scenario where a guest can trigger
      unbounded memory allocations, we scan the list when adding an entry to
      see if there is already an entry with the contents we need.  This can
      occur, because we don't ever remove entries from the middle of a list.
      
      A struct nested guest rmap is a list pointer and an rmap entry;
      ----------------
      | next pointer |
      ----------------
      | rmap entry   |
      ----------------
      
      Thus the rmap pointer for each guest frame number in the memslot can be
      either NULL, a single entry, or a pointer to a list of nested rmap entries.
      
      gfn	 memslot rmap array
       	-------------------------
       0	| NULL			|	(no rmap entry)
       	-------------------------
       1	| single rmap entry	|	(rmap entry with low bit set)
       	-------------------------
       2	| list head pointer	|	(list of rmap entries)
       	-------------------------
      
      The final entry always has the lowest bit set and is stored in the next
      pointer of the last list entry, or as a single rmap entry.
      With a list of rmap entries looking like;
      
      -----------------	-----------------	-------------------------
      | list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
      -----------------	-----------------	-------------------------
      			| rmap entry	|	| rmap entry		|
      			-----------------	-------------------------
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cf531ed
    • P
      KVM: PPC: Book3S HV: Handle hypercalls correctly when nested · 4bad7779
      Paul Mackerras 提交于
      When we are running as a nested hypervisor, we use a hypercall to
      enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
      that the hypercall handlers listed in hcall_real_table never get called.
      There are some hypercalls that are handled there and not in
      kvmppc_pseries_do_hcall(), which therefore won't get processed for
      a nested guest.
      
      To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
      hypercalls, with the following exceptions:
      
      - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
        we only support radix mode for nested guests.
      
      - H_CEDE has to be handled specially because the cede logic in
        kvmhv_run_single_vcpu assumes that it has been processed by the time
        that kvmhv_p9_guest_entry() returns.  Therefore we put a special
        case for H_CEDE in kvmhv_p9_guest_entry().
      
      For the XICS hypercalls, if real-mode processing is enabled, then the
      virtual-mode handlers assume that they are being called only to finish
      up the operation.  Therefore we turn off the real-mode flag in the XICS
      code when running as a nested hypervisor.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bad7779
    • P
      KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor · f3c18e93
      Paul Mackerras 提交于
      This adds code to call the H_IPI and H_EOI hypercalls when we are
      running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
      feature) and we would otherwise access the XICS interrupt controller
      directly or via an OPAL call.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c18e93
    • P
      KVM: PPC: Book3S HV: Nested guest entry via hypercall · 360cae31
      Paul Mackerras 提交于
      This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
      hypervisor to enter one of its nested guests.  The hypercall supplies
      register values in two structs.  Those values are copied by the level 0
      (L0) hypervisor (the one which is running in hypervisor mode) into the
      vcpu struct of the L1 guest, and then the guest is run until an
      interrupt or error occurs which needs to be reported to L1 via the
      hypercall return value.
      
      Currently this assumes that the L0 and L1 hypervisors are the same
      endianness, and the structs passed as arguments are in native
      endianness.  If they are of different endianness, the version number
      check will fail and the hcall will be rejected.
      
      Nested hypervisors do not support indep_threads_mode=N, so this adds
      code to print a warning message if the administrator has set
      indep_threads_mode=N, and treat it as Y.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      360cae31
    • P
      KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization · 8e3f5fc1
      Paul Mackerras 提交于
      This starts the process of adding the code to support nested HV-style
      virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
      a nested hypervisor can use to set the base address and size of a
      partition table in its memory (analogous to the PTCR register).
      On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
      hypercall from the guest is handled by code that saves the virtual
      PTCR value for the guest.
      
      This also adds code for creating and destroying nested guests and for
      reading the partition table entry for a nested guest from L1 memory.
      Each nested guest has its own shadow LPID value, different in general
      from the LPID value used by the nested hypervisor to refer to it.  The
      shadow LPID value is allocated at nested guest creation time.
      
      Nested hypervisor functionality is only available for a radix guest,
      which therefore means a radix host on a POWER9 (or later) processor.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8e3f5fc1
    • S
      KVM: PPC: Book3S HV: Clear partition table entry on vm teardown · 89329c0b
      Suraj Jitindar Singh 提交于
      When destroying a VM we return the LPID to the pool, however we never
      zero the partition table entry. This is instead done when we reallocate
      the LPID.
      
      Zero the partition table entry on VM teardown before returning the LPID
      to the pool. This means if we were running as a nested hypervisor the
      real hypervisor could use this to determine when it can free resources.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89329c0b
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · fd0944ba
      Paul Mackerras 提交于
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fd0944ba
    • P
      KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings · 9a94d3ee
      Paul Mackerras 提交于
      This adds a file called 'radix' in the debugfs directory for the
      guest, which when read gives all of the valid leaf PTEs in the
      partition-scoped radix tree for a radix guest, in human-readable
      format.  It is analogous to the existing 'htab' file which dumps
      the HPT entries for a HPT guest.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a94d3ee
    • P
      KVM: PPC: Book3S HV: Handle hypervisor instruction faults better · 32eb150a
      Paul Mackerras 提交于
      Currently the code for handling hypervisor instruction page faults
      passes 0 for the flags indicating the type of fault, which is OK in
      the usual case that the page is not mapped in the partition-scoped
      page tables.  However, there are other causes for hypervisor
      instruction page faults, such as not being to update a reference
      (R) or change (C) bit.  The cause is indicated in bits in HSRR1,
      including a bit which indicates that the fault is due to not being
      able to write to a page (for example to update an R or C bit).
      Not handling these other kinds of faults correctly can lead to a
      loop of continual faults without forward progress in the guest.
      
      In order to handle these faults better, this patch constructs a
      "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
      have in common, and passes it to kvmppc_book3s_hv_page_fault() so
      that it knows what caused the fault.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      32eb150a
    • P
      KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests · 95a6432c
      Paul Mackerras 提交于
      This creates an alternative guest entry/exit path which is used for
      radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
      these circumstances there is exactly one vcpu per vcore and there is
      no coordination required between vcpus or vcores; the vcpu can enter
      the guest without needing to synchronize with anything else.
      
      The new fast path is implemented almost entirely in C in book3s_hv.c
      and runs with the MMU on until the guest is entered.  On guest exit
      we use the existing path until the point where we are committed to
      exiting the guest (as distinct from handling an interrupt in the
      low-level code and returning to the guest) and we have pulled the
      guest context from the XIVE.  At that point we check a flag in the
      stack frame to see whether we came in via the old path and the new
      path; if we came in via the new path then we go back to C code to do
      the rest of the process of saving the guest context and restoring the
      host context.
      
      The C code is split into separate functions for handling the
      OS-accessible state and the hypervisor state, with the idea that the
      latter can be replaced by a hypercall when we implement nested
      virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix CONFIG_ALTIVEC=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      95a6432c
    • P
      KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked · 53655ddd
      Paul Mackerras 提交于
      Currently kvmppc_handle_exit_hv() is called with the vcore lock held
      because it is called within a for_each_runnable_thread loop.
      However, we already unlock the vcore within kvmppc_handle_exit_hv()
      under certain circumstances, and this is safe because (a) any vcpus
      that become runnable and are added to the runnable set by
      kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
      run in the guest (because the vcore state is VCORE_EXITING), and
      (b) for_each_runnable_thread is safe against addition or removal
      of vcpus from the runnable set.
      
      Therefore, in order to simplify things for following patches, let's
      drop the vcore lock in the for_each_runnable_thread loop, so
      kvmppc_handle_exit_hv() gets called without the vcore lock held.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53655ddd
    • P
      KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code · f7035ce9
      Paul Mackerras 提交于
      This is based on a patch by Suraj Jitindar Singh.
      
      This moves the code in book3s_hv_rmhandlers.S that generates an
      external, decrementer or privileged doorbell interrupt just before
      entering the guest to C code in book3s_hv_builtin.c.  This is to
      make future maintenance and modification easier.  The algorithm
      expressed in the C code is almost identical to the previous
      algorithm.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7035ce9
  7. 05 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VM · aa227864
      Paul Mackerras 提交于
      This adds a mode where the vcore scheduling logic in HV KVM limits itself
      to scheduling only virtual cores from the same VM on any given physical
      core.  This is enabled via a new module parameter on the kvm-hv module
      called "one_vm_per_core".  For this to work on POWER9, it is necessary to
      set indep_threads_mode=N.  (On POWER8, hardware limitations mean that KVM
      is never in independent threads mode, regardless of the indep_threads_mode
      setting.)
      
      Thus the settings needed for this to work are:
      
      1. The host is in SMT1 mode.
      2. On POWER8, the host is not in 2-way or 4-way static split-core mode.
      3. On POWER9, the indep_threads_mode parameter is N.
      4. The one_vm_per_core parameter is Y.
      
      With these settings, KVM can run up to 4 vcpus on a core at the same
      time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest
      threading mode), and will ensure that all of the vcpus belong to the
      same VM.
      
      This is intended for use in security-conscious settings where users are
      concerned about possible side-channel attacks between threads which could
      perhaps enable one VM to attack another VM on the same core, or the host.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa227864
  8. 21 8月, 2018 1 次提交
  9. 30 7月, 2018 1 次提交
  10. 26 7月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras 提交于
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • S
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff 提交于
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  11. 18 7月, 2018 2 次提交
  12. 16 7月, 2018 1 次提交
  13. 20 6月, 2018 1 次提交
  14. 13 6月, 2018 1 次提交
    • K
      treewide: Use array_size() in vzalloc() · fad953ce
      Kees Cook 提交于
      The vzalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vzalloc(a * b)
      
      with:
              vzalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vzalloc(a * b * c)
      
      with:
      
              vzalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vzalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vzalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vzalloc(C1 * C2 * C3, ...)
      |
        vzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vzalloc(C1 * C2, ...)
      |
        vzalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      fad953ce
  15. 02 6月, 2018 1 次提交
    • G
      kvm: no need to check return value of debugfs_create functions · 929f45e3
      Greg Kroah-Hartman 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      This cleans up the error handling a lot, as this code will never get
      hit.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: kvm@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      929f45e3
  16. 18 5月, 2018 2 次提交
  17. 17 5月, 2018 3 次提交
    • P
      KVM: PPC: Book3S HV: Set RWMR on POWER8 so PURR/SPURR count correctly · 7aa15842
      Paul Mackerras 提交于
      Although Linux doesn't use PURR and SPURR ((Scaled) Processor
      Utilization of Resources Register), other OSes depend on them.
      On POWER8 they count at a rate depending on whether the VCPU is
      idle or running, the activity of the VCPU, and the value in the
      RWMR (Region-Weighting Mode Register).  Hardware expects the
      hypervisor to update the RWMR when a core is dispatched to reflect
      the number of online VCPUs in the vcore.
      
      This adds code to maintain a count in the vcore struct indicating
      how many VCPUs are online.  In kvmppc_run_core we use that count
      to set the RWMR register on POWER8.  If the core is split because
      of a static or dynamic micro-threading mode, we use the value for
      8 threads.  The RWMR value is not relevant when the host is
      executing because Linux does not use the PURR or SPURR register,
      so we don't bother saving and restoring the host value.
      
      For the sake of old userspace which does not set the KVM_REG_PPC_ONLINE
      register, we set online to 1 if it was 0 at the time of a KVM_RUN
      ioctl.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7aa15842
    • P
      KVM: PPC: Book3S HV: Add 'online' register to ONE_REG interface · a1f15826
      Paul Mackerras 提交于
      This adds a new KVM_REG_PPC_ONLINE register which userspace can set
      to 0 or 1 via the GET/SET_ONE_REG interface to indicate whether it
      considers the VCPU to be offline (0), that is, not currently running,
      or online (1).  This will be used in a later patch to configure the
      register which controls PURR and SPURR accumulation.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a1f15826
    • P
      KVM: PPC: Book3S HV: Snapshot timebase offset on guest entry · 57b8daa7
      Paul Mackerras 提交于
      Currently, the HV KVM guest entry/exit code adds the timebase offset
      from the vcore struct to the timebase on guest entry, and subtracts
      it on guest exit.  Which is fine, except that it is possible for
      userspace to change the offset using the SET_ONE_REG interface while
      the vcore is running, as there is only one timebase offset per vcore
      but potentially multiple VCPUs in the vcore.  If that were to happen,
      KVM would subtract a different offset on guest exit from that which
      it had added on guest entry, leading to the timebase being out of sync
      between cores in the host, which then leads to bad things happening
      such as hangs and spurious watchdog timeouts.
      
      To fix this, we add a new field 'tb_offset_applied' to the vcore struct
      which stores the offset that is currently applied to the timebase.
      This value is set from the vcore tb_offset field on guest entry, and
      is what is subtracted from the timebase on guest exit.  Since it is
      zero when the timebase offset is not applied, we can simplify the
      logic in kvmhv_start_timing and kvmhv_accumulate_time.
      
      In addition, we had secondary threads reading the timebase while
      running concurrently with code on the primary thread which would
      eventually add or subtract the timebase offset from the timebase.
      This occurred while saving or restoring the DEC register value on
      the secondary threads.  Although no specific incorrect behaviour has
      been observed, this is a race which should be fixed.  To fix it, we
      move the DEC saving code to just before we call kvmhv_commence_exit,
      and the DEC restoring code to after the point where we have waited
      for the primary thread to switch the MMU context and add the timebase
      offset.  That way we are sure that the timebase contains the guest
      timebase value in both cases.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57b8daa7