1. 17 12月, 2018 8 次提交
    • S
      KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct · dceadcf9
      Suraj Jitindar Singh 提交于
      The kvmppc_ops struct is used to store function pointers to kvm
      implementation specific functions.
      
      Introduce two new functions load_from_eaddr and store_to_eaddr to be
      used to load from and store to a guest effective address respectively.
      
      Also implement these for the kvm-hv module. If we are using the radix
      mmu then we can call the functions to access quadrant 1 and 2.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      dceadcf9
    • S
      KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2 · d7b45615
      Suraj Jitindar Singh 提交于
      The POWER9 radix mmu has the concept of quadrants. The quadrant number
      is the two high bits of the effective address and determines the fully
      qualified address to be used for the translation. The fully qualified
      address consists of the effective lpid, the effective pid and the
      effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
      
      When accessing these quadrants the fully qualified address is obtained
      as follows:
      
      Quadrant		| Hypervisor		| Guest
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b00	| EA[0:1] = 0b00
      0			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = PIDR	| effPID  = PIDR
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b01	|
      1			| effLPID = LPIDR	| Invalid Access
      			| effPID  = PIDR	|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b10	|
      2			| effLPID = LPIDR	| Invalid Access
      			| effPID  = 0		|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b11	| EA[0:1] = 0b11
      3			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = 0		| effPID  = 0
      --------------------------------------------------------------------------
      
      In the Guest;
      Quadrant 3 is normally used to address the operating system since this
      uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
      be switched.
      Quadrant 0 is normally used to address user space since the effLPID and
      effPID are taken from the corresponding registers.
      
      In the Host;
      Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
      address the host.
      
      Quadrants 1 and 2 can be used by the host to address guest memory using
      a guest effective address. Since the effLPID comes from the LPID register,
      the host loads the LPID of the guest it would like to access (and the
      PID of the process) and can perform accesses to a guest effective
      address.
      
      This means quadrant 1 can be used to address the guest user space and
      quadrant 2 can be used to address the guest operating system from the
      hypervisor, using a guest effective address.
      
      Access to the quadrants can cause a Hypervisor Data Storage Interrupt
      (HDSI) due to being unable to perform partition scoped translation.
      Previously this could only be generated from a guest and so the code
      path expects us to take the KVM trampoline in the interrupt handler.
      This is no longer the case so we modify the handler to call
      bad_page_fault() to check if we were expecting this fault so we can
      handle it gracefully and just return with an error code. In the hash mmu
      case we still raise an unknown exception since quadrants aren't defined
      for the hash mmu.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d7b45615
    • S
      KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix() · d232afeb
      Suraj Jitindar Singh 提交于
      There exists a function kvm_is_radix() which is used to determine if a
      kvm instance is using the radix mmu. However this only applies to the
      first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
      be used to determine if the current execution context of the vcpu is
      radix, accounting for if the vcpu is running a nested guest.
      
      Currently all nested guests must be radix but this may change in the
      future.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d232afeb
    • S
      KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines · 693ac10a
      Suraj Jitindar Singh 提交于
      The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
      availability of in kernel tce acceleration for vfio. However it is
      currently the case that this is only available on a powernv machine,
      not for a pseries machine.
      
      Thus make this capability dependent on having the cpu feature
      CPU_FTR_HVMODE.
      
      [paulus@ozlabs.org - fixed compilation for Book E.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      693ac10a
    • P
      KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off · 5af3e9d0
      Paul Mackerras 提交于
      This adds code to flush the partition-scoped page tables for a radix
      guest when dirty tracking is turned on or off for a memslot.  Only the
      guest real addresses covered by the memslot are flushed.  The reason
      for this is to get rid of any 2M PTEs in the partition-scoped page
      tables that correspond to host transparent huge pages, so that page
      dirtiness is tracked at a system page (4k or 64k) granularity rather
      than a 2M granularity.  The page tables are also flushed when turning
      dirty tracking off so that the memslot's address space can be
      repopulated with THPs if possible.
      
      To do this, we add a new function kvmppc_radix_flush_memslot().  Since
      this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
      guest, we now make kvmppc_core_flush_memslot_hv() call the new
      kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
      for each page in the memslot.  This has the effect of fixing a bug in
      that kvmppc_core_flush_memslot_hv() was previously calling
      kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
      is required to be held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5af3e9d0
    • P
      KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments · c43c3a86
      Paul Mackerras 提交于
      This adds 'const' to the declarations for the struct kvm_memory_slot
      pointer parameters of some functions, which will make it possible to
      call those functions from kvmppc_core_commit_memory_region_hv()
      in the next patch.
      
      This also fixes some comments about locking.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c43c3a86
    • P
      KVM: PPC: Book3S HV: Map single pages when doing dirty page logging · f460f679
      Paul Mackerras 提交于
      For radix guests, this makes KVM map guest memory as individual pages
      when dirty page logging is enabled for the memslot corresponding to the
      guest real address.  Having a separate partition-scoped PTE for each
      system page mapped to the guest means that we have a separate dirty
      bit for each page, thus making the reported dirty bitmap more accurate.
      Without this, if part of guest memory is backed by transparent huge
      pages, the dirty status is reported at a 2MB granularity rather than
      a 64kB (or 4kB) granularity for that part, causing userspace to have
      to transmit more data when migrating the guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f460f679
    • B
      KVM: PPC: Pass change type down to memslot commit function · f032b734
      Bharata B Rao 提交于
      Currently, kvm_arch_commit_memory_region() gets called with a
      parameter indicating what type of change is being made to the memslot,
      but it doesn't pass it down to the platform-specific memslot commit
      functions.  This adds the `change' parameter to the lower-level
      functions so that they can use it in future.
      
      [paulus@ozlabs.org - fix book E also.]
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f032b734
  2. 14 12月, 2018 3 次提交
    • S
      KVM: PPC: Book3S PR: Set hflag to indicate that POWER9 supports 1T segments · 6142236c
      Suraj Jitindar Singh 提交于
      When booting a kvm-pr guest on a POWER9 machine the following message is
      observed:
      "qemu-system-ppc64: KVM does not support 1TiB segments which guest expects"
      
      This is because the guest is expecting to be able to use 1T segments
      however we don't indicate support for it. This is because we don't set
      the BOOK3S_HFLAG_MULTI_PGSIZE flag in the hflags in kvmppc_set_pvr_pr()
      on POWER9.
      
      POWER9 does indeed have support for 1T segments, so add a case for
      POWER9 to the switch statement to ensure it is set.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6142236c
    • Y
      KVM: PPC: Book3S HV: Change to use DEFINE_SHOW_ATTRIBUTE macro · 0f6ddf34
      Yangtao Li 提交于
      Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
      Signed-off-by: NYangtao Li <tiny.windzz@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0f6ddf34
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · 234ff0b7
      Paul Mackerras 提交于
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      234ff0b7
  3. 15 11月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12
      Michael Roth 提交于
      While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
      pending signal in the L0 QEMU process can generate the following
      sequence:
      
        ret0 = kvmppc_pseries_do_hcall()
          ret1 = kvmhv_enter_nested_guest()
            ret2 = kvmhv_run_single_vcpu()
            if (ret2 == -EINTR)
              return H_INTERRUPT
          if (ret1 == H_INTERRUPT)
            kvmppc_set_gpr(vcpu, 3, 0)
            return -EINTR
          /* skipped: */
          kvmppc_set_gpr(vcpu, 3, ret)
          vcpu->arch.hcall_needed = 0
          return RESUME_GUEST
      
      which causes an exit to L0 userspace with ret0 == -EINTR.
      
      The intention seems to be to set the hcall return value to 0 (via
      VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
      once we resume executing the VCPU. However, because we don't set
      vcpu->arch.hcall_needed = 0, we do the following once userspace
      resumes execution via kvm_arch_vcpu_ioctl_run():
      
        ...
        } else if (vcpu->arch.hcall_needed) {
          int i
      
          kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
          for (i = 0; i < 9; ++i)
                 kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
          vcpu->arch.hcall_needed = 0;
      
      since vcpu->arch.hcall_needed == 1 indicates that userspace should
      have handled the hcall and stored the return value in
      run->papr_hcall.ret. Since that's not the case here, we can get an
      unexpected value in VCPU r3, which can result in
      kvmhv_p9_guest_entry() reporting an unexpected trap value when it
      returns from H_ENTER_NESTED, causing the following register dump to
      console via subsequent call to kvmppc_handle_exit_hv() in L1:
      
        [  350.612854] vcpu 00000000f9564cf8 (0):
        [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
        [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
        [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
        [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
        [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
        [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
        [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
        [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
        [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
        [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
        [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
        [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
        [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
        [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
        [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
        [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
        [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
        [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
        [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
        [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
        [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
        [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
        [  350.613962] dar = 0000772f95390000
        [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
        [  350.614073] SLB (0 entries):
        [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
        [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
      
      followed by L1's QEMU reporting the following before stopping execution
      of the nested guest:
      
        KVM: unknown exit, hardware reason 1
        NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
        MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
        TB 00000000 00000000 DECR 00000000
        GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
        GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
        GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
        GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
        GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
        GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
        GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
        GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
        CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
         SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
        SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
        SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
        HSRR0 0000000000000000 HSRR1 0000000000000000
         CFAR 0000000000000000
         LPCR 0000000003d40413
         PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000
      
      Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
      of H_ENTER_NESTED before we exit to L0 userspace.
      
      Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: linuxppc-dev@ozlabs.org
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6c08ec12
  4. 07 11月, 2018 1 次提交
    • S
      KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE · 28c5bcf7
      Scott Wood 提交于
      TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
      <trace/define_trace.h>, so like that #include, they should
      be outside #ifdef protection.
      
      They also need to be #undefed before defining, in case multiple trace
      headers are included by the same C file.  This became the case on
      book3e after commit cf4a6085 ("powerpc/mm: Add missing tracepoint for
      tlbie"), leading to the following build error:
      
         CC      arch/powerpc/kvm/powerpc.o
      In file included from arch/powerpc/kvm/powerpc.c:51:0:
      arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
      [-Werror]
        #define TRACE_INCLUDE_PATH .
        ^
      In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
                        from arch/powerpc/kvm/powerpc.c:48:
      ./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
      the previous definition
        #define TRACE_INCLUDE_PATH asm
        ^
      cc1: all warnings being treated as errors
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NScott Wood <oss@buserror.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      28c5bcf7
  5. 26 10月, 2018 1 次提交
  6. 20 10月, 2018 1 次提交
    • A
      KVM: PPC: Optimize clearing TCEs for sparse tables · 6e301a8e
      Alexey Kardashevskiy 提交于
      The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
      table and a table with userspace addresses. These tables are radix trees,
      we allocate indirect levels when they are written to. Since
      the memory allocation is problematic in real mode, we have 2 accessors
      to the entries:
      - for virtual mode: it allocates the memory and it is always expected
      to return non-NULL;
      - fr real mode: it does not allocate and can return NULL.
      
      Also, DMA windows can span to up to 55 bits of the address space and since
      we never have this much RAM, such windows are sparse. However currently
      the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.
      
      Since we maintain a userspace addresses table for VFIO which is a mirror
      of the hardware table, we can use it to know which parts of the DMA
      window have not been mapped and skip these so does this patch.
      
      The bare metal systems do not have this problem as they use a bypass mode
      of a PHB which maps RAM directly.
      
      This helps a lot with sparse DMA windows, reducing the shutdown time from
      about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
      Just skipping the last level seems to be good enough.
      
      As non-allocating accessor is used now in virtual mode as well, rename it
      from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6e301a8e
  7. 19 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips · 8d9fcacf
      Paul Mackerras 提交于
      This disables the use of the streamlined entry path for radix guests
      on early POWER9 chips that need the workaround added in commit
      a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM",
      2017-07-24), because the streamlined entry path does not include
      that workaround.  This also means that we can't do nested HV-KVM
      on those chips.
      
      Since the chips that need that workaround are the same ones that can't
      run both radix and HPT guests at the same time on different threads of
      a core, we use the existing 'no_mixing_hpt_and_radix' variable that
      identifies those chips to identify when we can't use the new guest
      entry path, and when we can't do nested virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8d9fcacf
  8. 18 10月, 2018 1 次提交
    • M
      powerpc: Add -Werror at arch/powerpc level · 23ad1a27
      Michael Ellerman 提交于
      Back when I added -Werror in commit ba55bd74 ("powerpc: Add
      configurable -Werror for arch/powerpc") I did it by adding it to most
      of the arch Makefiles.
      
      At the time we excluded math-emu, because apparently it didn't build
      cleanly. But that seems to have been fixed somewhere in the interim.
      
      So move the -Werror addition to the top-level of the arch, this saves
      us from repeating it in every Makefile and means we won't forget to
      add it to any new sub-dirs.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      23ad1a27
  9. 09 10月, 2018 23 次提交
    • P
      KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f
      Paul Mackerras 提交于
      This adds a KVM_PPC_NO_HASH flag to the flags field of the
      kvm_ppc_smmu_info struct, and arranges for it to be set when
      running as a nested hypervisor, as an unambiguous indication
      to userspace that HPT guests are not supported.  Reporting the
      KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
      indicating only that the new HPT features in ISA V3.0 are not
      supported, leaving it ambiguous whether pre-V3.0 HPT features
      are supported.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      901f8c3f
    • P
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras 提交于
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • P
      KVM: PPC: Book3S HV: Add nested shadow page tables to debugfs · 83a05510
      Paul Mackerras 提交于
      This adds a list of valid shadow PTEs for each nested guest to
      the 'radix' file for the guest in debugfs.  This can be useful for
      debugging.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      83a05510
    • P
      KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode · de760db4
      Paul Mackerras 提交于
      With this, the KVM-HV module can be loaded in a guest running under
      KVM-HV, and if the hypervisor supports nested virtualization, this
      guest can now act as a nested hypervisor and run nested guests.
      
      This also adds some checks to inform userspace that HPT guests are not
      supported by nested hypervisors (by returning false for the
      KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
      configuring a guest to use HPT mode.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de760db4
    • S
      KVM: PPC: Book3S HV: Handle differing endianness for H_ENTER_NESTED · 10b5022d
      Suraj Jitindar Singh 提交于
      The hcall H_ENTER_NESTED takes two parameters: the address in L1 guest
      memory of a hv_regs struct and the address of a pt_regs struct.  The
      hcall requests the L0 hypervisor to use the register values in these
      structs to run a L2 guest and to return the exit state of the L2 guest
      in these structs.  These are in the endianness of the L1 guest, rather
      than being always big-endian as is usually the case for PAPR
      hypercalls.
      
      This is convenient because it means that the L1 guest can pass the
      address of the regs field in its kvm_vcpu_arch struct.  This also
      improves performance slightly by avoiding the need for two copies of
      the pt_regs struct.
      
      When reading/writing these structures, this patch handles the case
      where the endianness of the L1 guest differs from that of the L0
      hypervisor, by byteswapping the structures after reading and before
      writing them back.
      
      Since all the fields of the pt_regs are of the same type, i.e.,
      unsigned long, we treat it as an array of unsigned longs.  The fields
      of struct hv_guest_state are not all the same, so its fields are
      byteswapped individually.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10b5022d
    • S
      KVM: PPC: Book3S HV: Sanitise hv_regs on nested guest entry · 73937deb
      Suraj Jitindar Singh 提交于
      restore_hv_regs() is used to copy the hv_regs L1 wants to set to run the
      nested (L2) guest into the vcpu structure. We need to sanitise these
      values to ensure we don't let the L1 guest hypervisor do things we don't
      want it to.
      
      We don't let data address watchpoints or completed instruction address
      breakpoints be set to match in hypervisor state.
      
      We also don't let L1 enable features in the hypervisor facility status
      and control register (HFSCR) for L2 which we have disabled for L1. That
      is L2 will get the subset of features which the L0 hypervisor has
      enabled for L1 and the features L1 wants to enable for L2. This could
      mean we give L1 a hypervisor facility unavailable interrupt for a
      facility it thinks it has enabled, however it shouldn't have enabled a
      facility it itself doesn't have for the L2 guest.
      
      We sanitise the registers when copying in the L2 hv_regs. We don't need
      to sanitise when copying back the L1 hv_regs since these shouldn't be
      able to contain invalid values as they're just what was copied out.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      73937deb
    • P
      KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register · 30323418
      Paul Mackerras 提交于
      This adds a one-reg register identifier which can be used to read and
      set the virtual PTCR for the guest.  This register identifies the
      address and size of the virtual partition table for the guest, which
      contains information about the nested guests under this guest.
      
      Migrating this value is the only extra requirement for migrating a
      guest which has nested guests (assuming of course that the destination
      host supports nested virtualization in the kvm-hv module).
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      30323418
    • P
      KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested · f3c99f97
      Paul Mackerras 提交于
      When running as a nested hypervisor, this avoids reading hypervisor
      privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
      instead reasonable default values are used.  This also avoids writing
      LPIDR in the single-vcpu entry/exit path.
      
      Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
      since its only caller already checks this.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c99f97
    • S
      KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu · 9d0b048d
      Suraj Jitindar Singh 提交于
      This is only done at level 0, since only level 0 knows which physical
      CPU a vcpu is running on.  This does for nested guests what L0 already
      did for its own guests, which is to flush the TLB on a pCPU when it
      goes to run a vCPU there, and there is another vCPU in the same VM
      which previously ran on this pCPU and has now started to run on another
      pCPU.  This is to handle the situation where the other vCPU touched
      a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
      on that new pCPU and thus left behind a stale TLB entry on this pCPU.
      
      This introduces a limit on the the vcpu_token values used in the
      H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
      
      [paulus@ozlabs.org - made prev_cpu array be short[] to reduce
       memory consumption.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d0b048d
    • P
      KVM: PPC: Book3S HV: Use hypercalls for TLB invalidation when nested · 690ed4ca
      Paul Mackerras 提交于
      This adds code to call the H_TLB_INVALIDATE hypercall when running as
      a guest, in the cases where we need to invalidate TLBs (or other MMU
      caches) as part of managing the mappings for a nested guest.  Calling
      H_TLB_INVALIDATE lets the nested hypervisor inform the parent
      hypervisor about changes to partition-scoped page tables or the
      partition table without needing to do hypervisor-privileged tlbie
      instructions.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      690ed4ca
    • S
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh 提交于
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
    • S
      KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings · 8cf531ed
      Suraj Jitindar Singh 提交于
      When a host (L0) page which is mapped into a (L1) guest is in turn
      mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
      so that these mappings can be retrieved later.
      
      Whenever we create an entry in a shadow_pgtable for a nested guest we
      create a corresponding rmap entry and add it to the list for the
      L1 guest memslot at the index of the L1 guest page it maps. This means
      at the L1 guest memslot we end up with lists of rmaps.
      
      When we are notified of a host page being invalidated which has been
      mapped through to a (L1) guest, we can then walk the rmap list for that
      guest page, and find and invalidate all of the corresponding
      shadow_pgtable entries.
      
      In order to reduce memory consumption, we compress the information for
      each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
      for the guest real page frame number -- which will fit in a single
      unsigned long.  To avoid a scenario where a guest can trigger
      unbounded memory allocations, we scan the list when adding an entry to
      see if there is already an entry with the contents we need.  This can
      occur, because we don't ever remove entries from the middle of a list.
      
      A struct nested guest rmap is a list pointer and an rmap entry;
      ----------------
      | next pointer |
      ----------------
      | rmap entry   |
      ----------------
      
      Thus the rmap pointer for each guest frame number in the memslot can be
      either NULL, a single entry, or a pointer to a list of nested rmap entries.
      
      gfn	 memslot rmap array
       	-------------------------
       0	| NULL			|	(no rmap entry)
       	-------------------------
       1	| single rmap entry	|	(rmap entry with low bit set)
       	-------------------------
       2	| list head pointer	|	(list of rmap entries)
       	-------------------------
      
      The final entry always has the lowest bit set and is stored in the next
      pointer of the last list entry, or as a single rmap entry.
      With a list of rmap entries looking like;
      
      -----------------	-----------------	-------------------------
      | list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
      -----------------	-----------------	-------------------------
      			| rmap entry	|	| rmap entry		|
      			-----------------	-------------------------
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cf531ed
    • S
      KVM: PPC: Book3S HV: Handle page fault for a nested guest · fd10be25
      Suraj Jitindar Singh 提交于
      Consider a normal (L1) guest running under the main hypervisor (L0),
      and then a nested guest (L2) running under the L1 guest which is acting
      as a nested hypervisor. L0 has page tables to map the address space for
      L1 providing the translation from L1 real address -> L0 real address;
      
      	L1
      	|
      	| (L1 -> L0)
      	|
      	----> L0
      
      There are also page tables in L1 used to map the address space for L2
      providing the translation from L2 real address -> L1 read address. Since
      the hardware can only walk a single level of page table, we need to
      maintain in L0 a "shadow_pgtable" for L2 which provides the translation
      from L2 real address -> L0 real address. Which looks like;
      
      	L2				L2
      	|				|
      	| (L2 -> L1)			|
      	|				|
      	----> L1			| (L2 -> L0)
      	      |				|
      	      | (L1 -> L0)		|
      	      |				|
      	      ----> L0			--------> L0
      
      When a page fault occurs while running a nested (L2) guest we need to
      insert a pte into this "shadow_pgtable" for the L2 -> L0 mapping. To
      do this we need to:
      
      1. Walk the pgtable in L1 memory to find the L2 -> L1 mapping, and
         provide a page fault to L1 if this mapping doesn't exist.
      2. Use our L1 -> L0 pgtable to convert this L1 address to an L0 address,
         or try to insert a pte for that mapping if it doesn't exist.
      3. Now we have a L2 -> L0 mapping, insert this into our shadow_pgtable
      
      Once this mapping exists we can take rc faults when hardware is unable
      to automatically set the reference and change bits in the pte. On these
      we need to:
      
      1. Check the rc bits on the L2 -> L1 pte match, and otherwise reflect
         the fault down to L1.
      2. Set the rc bits in the L1 -> L0 pte which corresponds to the same
         host page.
      3. Set the rc bits in the L2 -> L0 pte.
      
      As we reuse a large number of functions in book3s_64_mmu_radix.c for
      this we also needed to refactor a number of these functions to take
      an lpid parameter so that the correct lpid is used for tlb invalidations.
      The functionality however has remained the same.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fd10be25
    • P
      KVM: PPC: Book3S HV: Handle hypercalls correctly when nested · 4bad7779
      Paul Mackerras 提交于
      When we are running as a nested hypervisor, we use a hypercall to
      enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
      that the hypercall handlers listed in hcall_real_table never get called.
      There are some hypercalls that are handled there and not in
      kvmppc_pseries_do_hcall(), which therefore won't get processed for
      a nested guest.
      
      To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
      hypercalls, with the following exceptions:
      
      - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
        we only support radix mode for nested guests.
      
      - H_CEDE has to be handled specially because the cede logic in
        kvmhv_run_single_vcpu assumes that it has been processed by the time
        that kvmhv_p9_guest_entry() returns.  Therefore we put a special
        case for H_CEDE in kvmhv_p9_guest_entry().
      
      For the XICS hypercalls, if real-mode processing is enabled, then the
      virtual-mode handlers assume that they are being called only to finish
      up the operation.  Therefore we turn off the real-mode flag in the XICS
      code when running as a nested hypervisor.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bad7779
    • P
      KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor · f3c18e93
      Paul Mackerras 提交于
      This adds code to call the H_IPI and H_EOI hypercalls when we are
      running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
      feature) and we would otherwise access the XICS interrupt controller
      directly or via an OPAL call.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c18e93
    • P
      KVM: PPC: Book3S HV: Nested guest entry via hypercall · 360cae31
      Paul Mackerras 提交于
      This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
      hypervisor to enter one of its nested guests.  The hypercall supplies
      register values in two structs.  Those values are copied by the level 0
      (L0) hypervisor (the one which is running in hypervisor mode) into the
      vcpu struct of the L1 guest, and then the guest is run until an
      interrupt or error occurs which needs to be reported to L1 via the
      hypercall return value.
      
      Currently this assumes that the L0 and L1 hypervisors are the same
      endianness, and the structs passed as arguments are in native
      endianness.  If they are of different endianness, the version number
      check will fail and the hcall will be rejected.
      
      Nested hypervisors do not support indep_threads_mode=N, so this adds
      code to print a warning message if the administrator has set
      indep_threads_mode=N, and treat it as Y.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      360cae31
    • P
      KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization · 8e3f5fc1
      Paul Mackerras 提交于
      This starts the process of adding the code to support nested HV-style
      virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
      a nested hypervisor can use to set the base address and size of a
      partition table in its memory (analogous to the PTCR register).
      On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
      hypercall from the guest is handled by code that saves the virtual
      PTCR value for the guest.
      
      This also adds code for creating and destroying nested guests and for
      reading the partition table entry for a nested guest from L1 memory.
      Each nested guest has its own shadow LPID value, different in general
      from the LPID value used by the nested hypervisor to refer to it.  The
      shadow LPID value is allocated at nested guest creation time.
      
      Nested hypervisor functionality is only available for a radix guest,
      which therefore means a radix host on a POWER9 (or later) processor.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8e3f5fc1
    • P
      KVM: PPC: Book3S HV: Use kvmppc_unmap_pte() in kvm_unmap_radix() · f0f825f0
      Paul Mackerras 提交于
      kvmppc_unmap_pte() does a sequence of operations that are open-coded in
      kvm_unmap_radix().  This extends kvmppc_unmap_pte() a little so that it
      can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f0f825f0
    • S
      KVM: PPC: Book3S HV: Refactor radix page fault handler · 04bae9d5
      Suraj Jitindar Singh 提交于
      The radix page fault handler accounts for all cases, including just
      needing to insert a pte.  This breaks it up into separate functions for
      the two main cases; setting rc and inserting a pte.
      
      This allows us to make the setting of rc and inserting of a pte
      generic for any pgtable, not specific to the one for this guest.
      
      [paulus@ozlabs.org - reduced diffs from previous code]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      04bae9d5
    • S
      KVM: PPC: Book3S HV: Make kvmppc_mmu_radix_xlate process/partition table agnostic · 9811c78e
      Suraj Jitindar Singh 提交于
      kvmppc_mmu_radix_xlate() is used to translate an effective address
      through the process tables. The process table and partition tables have
      identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate()
      function able to translate either an effective address through the
      process tables or a guest real address through the partition tables.
      
      [paulus@ozlabs.org - reduced diffs from previous code]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9811c78e
    • S
      KVM: PPC: Book3S HV: Clear partition table entry on vm teardown · 89329c0b
      Suraj Jitindar Singh 提交于
      When destroying a VM we return the LPID to the pool, however we never
      zero the partition table entry. This is instead done when we reallocate
      the LPID.
      
      Zero the partition table entry on VM teardown before returning the LPID
      to the pool. This means if we were running as a nested hypervisor the
      real hypervisor could use this to determine when it can free resources.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89329c0b
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · fd0944ba
      Paul Mackerras 提交于
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fd0944ba
    • P
      KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings · 9a94d3ee
      Paul Mackerras 提交于
      This adds a file called 'radix' in the debugfs directory for the
      guest, which when read gives all of the valid leaf PTEs in the
      partition-scoped radix tree for a radix guest, in human-readable
      format.  It is analogous to the existing 'htab' file which dumps
      the HPT entries for a HPT guest.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a94d3ee