1. 05 4月, 2019 1 次提交
    • S
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh 提交于
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  2. 22 2月, 2019 1 次提交
    • J
      KVM: PPC: Book3S HV: Fix build failure without IOMMU support · e40542af
      Jordan Niethe 提交于
      Currently trying to build without IOMMU support will fail:
      
        (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
        (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
        (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
        (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
      
      This happens because turning off IOMMU support will prevent
      book3s_64_vio_hv.c from being built because it is only built when
      SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
      
      Fix it using ifdefs for the undefined references.
      
      Fixes: 76d837a4 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e40542af
  3. 21 2月, 2019 6 次提交
    • P
      powerpc/64s: Better printing of machine check info for guest MCEs · c0577201
      Paul Mackerras 提交于
      This adds an "in_guest" parameter to machine_check_print_event_info()
      so that we can avoid trying to translate guest NIP values into
      symbolic form using the host kernel's symbol table.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0577201
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
    • M
      KVM: PPC: Book3S HV: Context switch AMR on Power9 · d976f680
      Michael Ellerman 提交于
      kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
      when guest and host are both running with the Radix MMU.
      
      Currently in that path we don't save the host AMR (Authority Mask
      Register) value, and we always restore 0 on return to the host. That
      is OK at the moment because the AMR is not used for storage keys with
      the Radix MMU.
      
      However we plan to start using the AMR on Radix to prevent the kernel
      from reading/writing to userspace outside of copy_to/from_user(). In
      order to make that work we need to save/restore the AMR value.
      
      We only restore the value if it is different from the guest value,
      which is already in the register when we exit to the host. This should
      mean we rarely need to actually restore the value when running a
      modern Linux as a guest, because it will be using the same value as
      us.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NRussell Currey <ruscur@russell.cc>
      d976f680
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
  4. 19 2月, 2019 2 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
    • W
      KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node · 08434ab4
      wangbo 提交于
      Replace kmalloc_node and memset with kzalloc_node
      Signed-off-by: Nwangbo <wang.bo116@zte.com.cn>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      08434ab4
  5. 17 12月, 2018 5 次提交
  6. 14 12月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · 234ff0b7
      Paul Mackerras 提交于
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      234ff0b7
  7. 15 11月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12
      Michael Roth 提交于
      While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
      pending signal in the L0 QEMU process can generate the following
      sequence:
      
        ret0 = kvmppc_pseries_do_hcall()
          ret1 = kvmhv_enter_nested_guest()
            ret2 = kvmhv_run_single_vcpu()
            if (ret2 == -EINTR)
              return H_INTERRUPT
          if (ret1 == H_INTERRUPT)
            kvmppc_set_gpr(vcpu, 3, 0)
            return -EINTR
          /* skipped: */
          kvmppc_set_gpr(vcpu, 3, ret)
          vcpu->arch.hcall_needed = 0
          return RESUME_GUEST
      
      which causes an exit to L0 userspace with ret0 == -EINTR.
      
      The intention seems to be to set the hcall return value to 0 (via
      VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
      once we resume executing the VCPU. However, because we don't set
      vcpu->arch.hcall_needed = 0, we do the following once userspace
      resumes execution via kvm_arch_vcpu_ioctl_run():
      
        ...
        } else if (vcpu->arch.hcall_needed) {
          int i
      
          kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
          for (i = 0; i < 9; ++i)
                 kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
          vcpu->arch.hcall_needed = 0;
      
      since vcpu->arch.hcall_needed == 1 indicates that userspace should
      have handled the hcall and stored the return value in
      run->papr_hcall.ret. Since that's not the case here, we can get an
      unexpected value in VCPU r3, which can result in
      kvmhv_p9_guest_entry() reporting an unexpected trap value when it
      returns from H_ENTER_NESTED, causing the following register dump to
      console via subsequent call to kvmppc_handle_exit_hv() in L1:
      
        [  350.612854] vcpu 00000000f9564cf8 (0):
        [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
        [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
        [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
        [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
        [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
        [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
        [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
        [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
        [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
        [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
        [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
        [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
        [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
        [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
        [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
        [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
        [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
        [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
        [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
        [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
        [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
        [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
        [  350.613962] dar = 0000772f95390000
        [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
        [  350.614073] SLB (0 entries):
        [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
        [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
      
      followed by L1's QEMU reporting the following before stopping execution
      of the nested guest:
      
        KVM: unknown exit, hardware reason 1
        NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
        MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
        TB 00000000 00000000 DECR 00000000
        GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
        GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
        GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
        GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
        GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
        GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
        GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
        GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
        CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
         SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
        SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
        SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
        HSRR0 0000000000000000 HSRR1 0000000000000000
         CFAR 0000000000000000
         LPCR 0000000003d40413
         PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000
      
      Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
      of H_ENTER_NESTED before we exit to L0 userspace.
      
      Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: linuxppc-dev@ozlabs.org
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6c08ec12
  8. 26 10月, 2018 1 次提交
  9. 19 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips · 8d9fcacf
      Paul Mackerras 提交于
      This disables the use of the streamlined entry path for radix guests
      on early POWER9 chips that need the workaround added in commit
      a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM",
      2017-07-24), because the streamlined entry path does not include
      that workaround.  This also means that we can't do nested HV-KVM
      on those chips.
      
      Since the chips that need that workaround are the same ones that can't
      run both radix and HPT guests at the same time on different threads of
      a core, we use the existing 'no_mixing_hpt_and_radix' variable that
      identifies those chips to identify when we can't use the new guest
      entry path, and when we can't do nested virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8d9fcacf
  10. 09 10月, 2018 19 次提交
    • P
      KVM: PPC: Book3S HV: Add NO_HASH flag to GET_SMMU_INFO ioctl result · 901f8c3f
      Paul Mackerras 提交于
      This adds a KVM_PPC_NO_HASH flag to the flags field of the
      kvm_ppc_smmu_info struct, and arranges for it to be set when
      running as a nested hypervisor, as an unambiguous indication
      to userspace that HPT guests are not supported.  Reporting the
      KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
      indicating only that the new HPT features in ISA V3.0 are not
      supported, leaving it ambiguous whether pre-V3.0 HPT features
      are supported.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      901f8c3f
    • P
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras 提交于
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • P
      KVM: PPC: Book3S HV: Allow HV module to load without hypervisor mode · de760db4
      Paul Mackerras 提交于
      With this, the KVM-HV module can be loaded in a guest running under
      KVM-HV, and if the hypervisor supports nested virtualization, this
      guest can now act as a nested hypervisor and run nested guests.
      
      This also adds some checks to inform userspace that HPT guests are not
      supported by nested hypervisors (by returning false for the
      KVM_CAP_PPC_MMU_HASH_V3 capability), and to prevent userspace from
      configuring a guest to use HPT mode.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      de760db4
    • P
      KVM: PPC: Book3S HV: Add one-reg interface to virtual PTCR register · 30323418
      Paul Mackerras 提交于
      This adds a one-reg register identifier which can be used to read and
      set the virtual PTCR for the guest.  This register identifies the
      address and size of the virtual partition table for the guest, which
      contains information about the nested guests under this guest.
      
      Migrating this value is the only extra requirement for migrating a
      guest which has nested guests (assuming of course that the destination
      host supports nested virtualization in the kvm-hv module).
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      30323418
    • P
      KVM: PPC: Book3S HV: Don't access HFSCR, LPIDR or LPCR when running nested · f3c99f97
      Paul Mackerras 提交于
      When running as a nested hypervisor, this avoids reading hypervisor
      privileged registers (specifically HFSCR, LPIDR and LPCR) at startup;
      instead reasonable default values are used.  This also avoids writing
      LPIDR in the single-vcpu entry/exit path.
      
      Also, this removes the check for CPU_FTR_HVMODE in kvmppc_mmu_hv_init()
      since its only caller already checks this.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c99f97
    • S
      KVM: PPC: Book3S HV: Invalidate TLB when nested vcpu moves physical cpu · 9d0b048d
      Suraj Jitindar Singh 提交于
      This is only done at level 0, since only level 0 knows which physical
      CPU a vcpu is running on.  This does for nested guests what L0 already
      did for its own guests, which is to flush the TLB on a pCPU when it
      goes to run a vCPU there, and there is another vCPU in the same VM
      which previously ran on this pCPU and has now started to run on another
      pCPU.  This is to handle the situation where the other vCPU touched
      a mapping, moved to another pCPU and did a tlbiel (local-only tlbie)
      on that new pCPU and thus left behind a stale TLB entry on this pCPU.
      
      This introduces a limit on the the vcpu_token values used in the
      H_ENTER_NESTED hcall -- they must now be less than NR_CPUS.
      
      [paulus@ozlabs.org - made prev_cpu array be short[] to reduce
       memory consumption.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d0b048d
    • S
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh 提交于
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
    • S
      KVM: PPC: Book3S HV: Introduce rmap to track nested guest mappings · 8cf531ed
      Suraj Jitindar Singh 提交于
      When a host (L0) page which is mapped into a (L1) guest is in turn
      mapped through to a nested (L2) guest we keep a reverse mapping (rmap)
      so that these mappings can be retrieved later.
      
      Whenever we create an entry in a shadow_pgtable for a nested guest we
      create a corresponding rmap entry and add it to the list for the
      L1 guest memslot at the index of the L1 guest page it maps. This means
      at the L1 guest memslot we end up with lists of rmaps.
      
      When we are notified of a host page being invalidated which has been
      mapped through to a (L1) guest, we can then walk the rmap list for that
      guest page, and find and invalidate all of the corresponding
      shadow_pgtable entries.
      
      In order to reduce memory consumption, we compress the information for
      each rmap entry down to 52 bits -- 12 bits for the LPID and 40 bits
      for the guest real page frame number -- which will fit in a single
      unsigned long.  To avoid a scenario where a guest can trigger
      unbounded memory allocations, we scan the list when adding an entry to
      see if there is already an entry with the contents we need.  This can
      occur, because we don't ever remove entries from the middle of a list.
      
      A struct nested guest rmap is a list pointer and an rmap entry;
      ----------------
      | next pointer |
      ----------------
      | rmap entry   |
      ----------------
      
      Thus the rmap pointer for each guest frame number in the memslot can be
      either NULL, a single entry, or a pointer to a list of nested rmap entries.
      
      gfn	 memslot rmap array
       	-------------------------
       0	| NULL			|	(no rmap entry)
       	-------------------------
       1	| single rmap entry	|	(rmap entry with low bit set)
       	-------------------------
       2	| list head pointer	|	(list of rmap entries)
       	-------------------------
      
      The final entry always has the lowest bit set and is stored in the next
      pointer of the last list entry, or as a single rmap entry.
      With a list of rmap entries looking like;
      
      -----------------	-----------------	-------------------------
      | list head ptr	| ----> | next pointer	| ---->	| single rmap entry	|
      -----------------	-----------------	-------------------------
      			| rmap entry	|	| rmap entry		|
      			-----------------	-------------------------
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8cf531ed
    • P
      KVM: PPC: Book3S HV: Handle hypercalls correctly when nested · 4bad7779
      Paul Mackerras 提交于
      When we are running as a nested hypervisor, we use a hypercall to
      enter the guest rather than code in book3s_hv_rmhandlers.S.  This means
      that the hypercall handlers listed in hcall_real_table never get called.
      There are some hypercalls that are handled there and not in
      kvmppc_pseries_do_hcall(), which therefore won't get processed for
      a nested guest.
      
      To fix this, we add cases to kvmppc_pseries_do_hcall() to handle those
      hypercalls, with the following exceptions:
      
      - The HPT hypercalls (H_ENTER, H_REMOVE, etc.) are not handled because
        we only support radix mode for nested guests.
      
      - H_CEDE has to be handled specially because the cede logic in
        kvmhv_run_single_vcpu assumes that it has been processed by the time
        that kvmhv_p9_guest_entry() returns.  Therefore we put a special
        case for H_CEDE in kvmhv_p9_guest_entry().
      
      For the XICS hypercalls, if real-mode processing is enabled, then the
      virtual-mode handlers assume that they are being called only to finish
      up the operation.  Therefore we turn off the real-mode flag in the XICS
      code when running as a nested hypervisor.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bad7779
    • P
      KVM: PPC: Book3S HV: Use XICS hypercalls when running as a nested hypervisor · f3c18e93
      Paul Mackerras 提交于
      This adds code to call the H_IPI and H_EOI hypercalls when we are
      running as a nested hypervisor (i.e. without the CPU_FTR_HVMODE cpu
      feature) and we would otherwise access the XICS interrupt controller
      directly or via an OPAL call.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f3c18e93
    • P
      KVM: PPC: Book3S HV: Nested guest entry via hypercall · 360cae31
      Paul Mackerras 提交于
      This adds a new hypercall, H_ENTER_NESTED, which is used by a nested
      hypervisor to enter one of its nested guests.  The hypercall supplies
      register values in two structs.  Those values are copied by the level 0
      (L0) hypervisor (the one which is running in hypervisor mode) into the
      vcpu struct of the L1 guest, and then the guest is run until an
      interrupt or error occurs which needs to be reported to L1 via the
      hypercall return value.
      
      Currently this assumes that the L0 and L1 hypervisors are the same
      endianness, and the structs passed as arguments are in native
      endianness.  If they are of different endianness, the version number
      check will fail and the hcall will be rejected.
      
      Nested hypervisors do not support indep_threads_mode=N, so this adds
      code to print a warning message if the administrator has set
      indep_threads_mode=N, and treat it as Y.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      360cae31
    • P
      KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization · 8e3f5fc1
      Paul Mackerras 提交于
      This starts the process of adding the code to support nested HV-style
      virtualization.  It defines a new H_SET_PARTITION_TABLE hypercall which
      a nested hypervisor can use to set the base address and size of a
      partition table in its memory (analogous to the PTCR register).
      On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE
      hypercall from the guest is handled by code that saves the virtual
      PTCR value for the guest.
      
      This also adds code for creating and destroying nested guests and for
      reading the partition table entry for a nested guest from L1 memory.
      Each nested guest has its own shadow LPID value, different in general
      from the LPID value used by the nested hypervisor to refer to it.  The
      shadow LPID value is allocated at nested guest creation time.
      
      Nested hypervisor functionality is only available for a radix guest,
      which therefore means a radix host on a POWER9 (or later) processor.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8e3f5fc1
    • S
      KVM: PPC: Book3S HV: Clear partition table entry on vm teardown · 89329c0b
      Suraj Jitindar Singh 提交于
      When destroying a VM we return the LPID to the pool, however we never
      zero the partition table entry. This is instead done when we reallocate
      the LPID.
      
      Zero the partition table entry on VM teardown before returning the LPID
      to the pool. This means if we were running as a nested hypervisor the
      real hypervisor could use this to determine when it can free resources.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      89329c0b
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · fd0944ba
      Paul Mackerras 提交于
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fd0944ba
    • P
      KVM: PPC: Book3S HV: Add a debugfs file to dump radix mappings · 9a94d3ee
      Paul Mackerras 提交于
      This adds a file called 'radix' in the debugfs directory for the
      guest, which when read gives all of the valid leaf PTEs in the
      partition-scoped radix tree for a radix guest, in human-readable
      format.  It is analogous to the existing 'htab' file which dumps
      the HPT entries for a HPT guest.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9a94d3ee
    • P
      KVM: PPC: Book3S HV: Handle hypervisor instruction faults better · 32eb150a
      Paul Mackerras 提交于
      Currently the code for handling hypervisor instruction page faults
      passes 0 for the flags indicating the type of fault, which is OK in
      the usual case that the page is not mapped in the partition-scoped
      page tables.  However, there are other causes for hypervisor
      instruction page faults, such as not being to update a reference
      (R) or change (C) bit.  The cause is indicated in bits in HSRR1,
      including a bit which indicates that the fault is due to not being
      able to write to a page (for example to update an R or C bit).
      Not handling these other kinds of faults correctly can lead to a
      loop of continual faults without forward progress in the guest.
      
      In order to handle these faults better, this patch constructs a
      "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI)
      have in common, and passes it to kvmppc_book3s_hv_page_fault() so
      that it knows what caused the fault.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      32eb150a
    • P
      KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests · 95a6432c
      Paul Mackerras 提交于
      This creates an alternative guest entry/exit path which is used for
      radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
      these circumstances there is exactly one vcpu per vcore and there is
      no coordination required between vcpus or vcores; the vcpu can enter
      the guest without needing to synchronize with anything else.
      
      The new fast path is implemented almost entirely in C in book3s_hv.c
      and runs with the MMU on until the guest is entered.  On guest exit
      we use the existing path until the point where we are committed to
      exiting the guest (as distinct from handling an interrupt in the
      low-level code and returning to the guest) and we have pulled the
      guest context from the XIVE.  At that point we check a flag in the
      stack frame to see whether we came in via the old path and the new
      path; if we came in via the new path then we go back to C code to do
      the rest of the process of saving the guest context and restoring the
      host context.
      
      The C code is split into separate functions for handling the
      OS-accessible state and the hypervisor state, with the idea that the
      latter can be replaced by a hypercall when we implement nested
      virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix CONFIG_ALTIVEC=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      95a6432c
    • P
      KVM: PPC: Book3S HV: Call kvmppc_handle_exit_hv() with vcore unlocked · 53655ddd
      Paul Mackerras 提交于
      Currently kvmppc_handle_exit_hv() is called with the vcore lock held
      because it is called within a for_each_runnable_thread loop.
      However, we already unlock the vcore within kvmppc_handle_exit_hv()
      under certain circumstances, and this is safe because (a) any vcpus
      that become runnable and are added to the runnable set by
      kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually
      run in the guest (because the vcore state is VCORE_EXITING), and
      (b) for_each_runnable_thread is safe against addition or removal
      of vcpus from the runnable set.
      
      Therefore, in order to simplify things for following patches, let's
      drop the vcore lock in the for_each_runnable_thread loop, so
      kvmppc_handle_exit_hv() gets called without the vcore lock held.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      53655ddd
    • P
      KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code · f7035ce9
      Paul Mackerras 提交于
      This is based on a patch by Suraj Jitindar Singh.
      
      This moves the code in book3s_hv_rmhandlers.S that generates an
      external, decrementer or privileged doorbell interrupt just before
      entering the guest to C code in book3s_hv_builtin.c.  This is to
      make future maintenance and modification easier.  The algorithm
      expressed in the C code is almost identical to the previous
      algorithm.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7035ce9
  11. 05 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Provide mode where all vCPUs on a core must be the same VM · aa227864
      Paul Mackerras 提交于
      This adds a mode where the vcore scheduling logic in HV KVM limits itself
      to scheduling only virtual cores from the same VM on any given physical
      core.  This is enabled via a new module parameter on the kvm-hv module
      called "one_vm_per_core".  For this to work on POWER9, it is necessary to
      set indep_threads_mode=N.  (On POWER8, hardware limitations mean that KVM
      is never in independent threads mode, regardless of the indep_threads_mode
      setting.)
      
      Thus the settings needed for this to work are:
      
      1. The host is in SMT1 mode.
      2. On POWER8, the host is not in 2-way or 4-way static split-core mode.
      3. On POWER9, the indep_threads_mode parameter is N.
      4. The one_vm_per_core parameter is Y.
      
      With these settings, KVM can run up to 4 vcpus on a core at the same
      time on POWER9, or up to 8 vcpus on POWER8 (depending on the guest
      threading mode), and will ensure that all of the vcpus belong to the
      same VM.
      
      This is intended for use in security-conscious settings where users are
      concerned about possible side-channel attacks between threads which could
      perhaps enable one VM to attack another VM on the same core, or the host.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa227864
  12. 21 8月, 2018 1 次提交