1. 01 12月, 2019 1 次提交
    • M
      KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel · 345712c9
      Michael Ellerman 提交于
      commit af2e8c68b9c5403f77096969c516f742f5bb29e0 upstream.
      
      On some systems that are vulnerable to Spectre v2, it is up to
      software to flush the link stack (return address stack), in order to
      protect against Spectre-RSB.
      
      When exiting from a guest we do some house keeping and then
      potentially exit to C code which is several stack frames deep in the
      host kernel. We will then execute a series of returns without
      preceeding calls, opening up the possiblity that the guest could have
      poisoned the link stack, and direct speculative execution of the host
      to a gadget of some sort.
      
      To prevent this we add a flush of the link stack on exit from a guest.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      [dja: straightforward backport to v4.19]
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      345712c9
  2. 24 11月, 2019 2 次提交
  3. 10 11月, 2019 1 次提交
  4. 12 10月, 2019 5 次提交
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · d1e4b4cc
      Aneesh Kumar K.V 提交于
      commit 09ce98cacd51fcd0fa0af2f79d1e1d3192f4cbb0 upstream.
      
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1e4b4cc
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 34b13ff6
      Cédric Le Goater 提交于
      [ Upstream commit 237aed48c642328ff0ab19b63423634340224a06 ]
      
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795afa2 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
      34b13ff6
    • P
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · 30fbe0d3
      Paul Mackerras 提交于
      commit ff42df49e75f053a8a6b4c2533100cdcc23afe69 upstream.
      
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30fbe0d3
    • P
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · 4faa7f05
      Paul Mackerras 提交于
      commit d28eafc5a64045c78136162af9d4ba42f8230080 upstream.
      
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4faa7f05
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 577a5119
      Paul Mackerras 提交于
      commit 959c5d5134786b4988b6fdd08e444aa67d1667ed upstream.
      
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberrySigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      577a5119
  5. 16 9月, 2019 4 次提交
    • M
      KVM: PPC: Book3S HV: Fix CR0 setting in TM emulation · 3a1b79ad
      Michael Neuling 提交于
      [ Upstream commit 3fefd1cd95df04da67c83c1cb93b663f04b3324f ]
      
      When emulating tsr, treclaim and trechkpt, we incorrectly set CR0. The
      code currently sets:
          CR0 <- 00 || MSR[TS]
      but according to the ISA it should be:
          CR0 <-  0 || MSR[TS] || 0
      
      This fixes the bit shift to put the bits in the correct location.
      
      This is a data integrity issue as CR0 is corrupted.
      
      Fixes: 4bb3c7a0 ("KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9")
      Cc: stable@vger.kernel.org # v4.17+
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a1b79ad
    • P
      KVM: PPC: Use ccr field in pt_regs struct embedded in vcpu struct · 3ac71806
      Paul Mackerras 提交于
      [ Upstream commit fd0944baad806dfb4c777124ec712c55b714ff51 ]
      
      When the 'regs' field was added to struct kvm_vcpu_arch, the code
      was changed to use several of the fields inside regs (e.g., gpr, lr,
      etc.) but not the ccr field, because the ccr field in struct pt_regs
      is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is
      only 32 bits.  This changes the code to use the regs.ccr field
      instead of cr, and changes the assembly code on 64-bit platforms to
      use 64-bit loads and stores instead of 32-bit ones.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3ac71806
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · 915c9d0a
      Michael Ellerman 提交于
      [ Upstream commit c3c7470c75566a077c8dc71dcf8f1948b8ddfab4 ]
      
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      915c9d0a
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · d3984e80
      Paul Mackerras 提交于
      [ Upstream commit 234ff0b729ad882d20f7996591a964965647addf ]
      
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d3984e80
  6. 06 9月, 2019 1 次提交
  7. 16 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 2bc73d91
      Wanpeng Li 提交于
      commit 17e433b54393a6269acbcb792da97791fe1592d8 upstream.
      
      After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2bc73d91
  8. 22 6月, 2019 2 次提交
    • P
      KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu · df6384e0
      Paul Mackerras 提交于
      [ Upstream commit 5a3f49364c3ffa1107bd88f8292406e98c5d206c ]
      
      Currently the HV KVM code takes the kvm->lock around calls to
      kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call
      kvm_for_each_vcpu() internally).  However, that leads to a lock
      order inversion problem, because these are called in contexts where
      the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock
      according to Documentation/virtual/kvm/locking.txt.  Hence there
      is a possibility of deadlock.
      
      To fix this, we simply don't take the kvm->lock mutex around these
      calls.  This is safe because the implementations of kvm_for_each_vcpu()
      and kvm_get_vcpu_by_id() have been designed to be able to be called
      locklessly.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      df6384e0
    • P
      KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list · b376683f
      Paul Mackerras 提交于
      [ Upstream commit 1659e27d2bc1ef47b6d031abe01b467f18cb72d9 ]
      
      Currently the Book 3S KVM code uses kvm->lock to synchronize access
      to the kvm->arch.rtas_tokens list.  Because this list is scanned
      inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held,
      taking kvm->lock cause a lock inversion problem, which could lead to
      a deadlock.
      
      To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests
      inside the vcpu mutexes, and use that instead of kvm->lock when
      accessing the rtas token list.
      
      This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free().
      At this point we don't hold the new mutex, but that is OK because
      kvmppc_rtas_tokens_free() is only called when the whole VM is being
      destroyed, and at that point nothing can be looking up a token in
      the list.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b376683f
  9. 09 6月, 2019 2 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · 6a2fbec7
      Thomas Huth 提交于
      commit a86cb413f4bf273a9d341a3ab2c2ca44e12eb317 upstream.
      
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6a2fbec7
    • C
      KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts · 55a94d81
      Cédric Le Goater 提交于
      commit ef9740204051d0e00f5402fe96cf3a43ddd2bbbf upstream.
      
      The passthrough interrupts are defined at the host level and their IRQ
      data should not be cleared unless specifically deconfigured (shutdown)
      by the host. They differ from the IPI interrupts which are allocated
      by the XIVE KVM device and reserved to the guest usage only.
      
      This fixes a host crash when destroying a VM in which a PCI adapter
      was passed-through. In this case, the interrupt is cleared and freed
      by the KVM device and then shutdown by vfio at the host level.
      
      [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00
      [ 1007.360285] Faulting instruction address: 0xc00000000009da34
      [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1]
      [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core
      [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4
      [ 1007.360454] NIP:  c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0
      [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300   Not tainted  (5.1.0-gad7e7d0ef)
      [ 1007.360500] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002484  XER: 00000000
      [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1
      [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0
      [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200
      [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd
      [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718
      [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100
      [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027
      [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c
      [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0
      [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120
      [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50
      [ 1007.360732] Call Trace:
      [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable)
      [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0
      [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0
      [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420
      [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0
      [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350
      [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0
      [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110
      [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0
      [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170
      [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0
      [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90
      [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70
      [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0
      [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0
      [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00
      [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100
      [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0
      [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430
      [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74
      [ 1007.361159] Instruction dump:
      [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020
      [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022
      
      Cc: stable@vger.kernel.org # v4.12+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      55a94d81
  10. 03 4月, 2019 2 次提交
  11. 13 2月, 2019 1 次提交
  12. 01 12月, 2018 1 次提交
    • S
      KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE · 8d976d7a
      Scott Wood 提交于
      [ Upstream commit 28c5bcf74fa07c25d5bd118d1271920f51ce2a98 ]
      
      TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
      <trace/define_trace.h>, so like that #include, they should
      be outside #ifdef protection.
      
      They also need to be #undefed before defining, in case multiple trace
      headers are included by the same C file.  This became the case on
      book3e after commit cf4a6085151a ("powerpc/mm: Add missing tracepoint for
      tlbie"), leading to the following build error:
      
         CC      arch/powerpc/kvm/powerpc.o
      In file included from arch/powerpc/kvm/powerpc.c:51:0:
      arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
      [-Werror]
        #define TRACE_INCLUDE_PATH .
        ^
      In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
                        from arch/powerpc/kvm/powerpc.c:48:
      ./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
      the previous definition
        #define TRACE_INCLUDE_PATH asm
        ^
      cc1: all warnings being treated as errors
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NScott Wood <oss@buserror.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8d976d7a
  13. 04 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Avoid crash from THP collapse during radix page fault · 6579804c
      Paul Mackerras 提交于
      Commit 71d29f43 ("KVM: PPC: Book3S HV: Don't use compound_order to
      determine host mapping size", 2018-09-11) added a call to 
      __find_linux_pte() and a dereference of the returned PTE pointer to the
      radix page fault path in the common case where the page is normal
      system memory.  Previously, __find_linux_pte() was only called for
      mappings to physical addresses which don't have a page struct (e.g.
      memory-mapped I/O) or where the page struct is marked as reserved
      memory.
      
      This exposes us to the possibility that the returned PTE pointer
      could be NULL, for example in the case of a concurrent THP collapse
      operation.  Dereferencing the returned NULL pointer causes a host
      crash.
      
      To fix this, we check for NULL, and if it is NULL, we retry the
      operation by returning to the guest, with the expectation that it
      will generate the same page fault again (unless of course it has
      been fixed up by another CPU in the meantime).
      
      Fixes: 71d29f43 ("KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6579804c
  14. 12 9月, 2018 2 次提交
    • N
      KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size · 71d29f43
      Nicholas Piggin 提交于
      THP paths can defer splitting compound pages until after the actual
      remap and TLB flushes to split a huge PMD/PUD. This causes radix
      partition scope page table mappings to get out of synch with the host
      qemu page table mappings.
      
      This results in random memory corruption in the guest when running
      with THP. The easiest way to reproduce is use KVM balloon to free up
      a lot of memory in the guest and then shrink the balloon to give the
      memory back, while some work is being done in the guest.
      
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      71d29f43
    • A
      KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode · 425333bf
      Alexey Kardashevskiy 提交于
      At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
      which in turn reads the old TCE and if it was a valid entry, marks
      the physical page dirty if it was mapped for writing. Since it is in
      real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
      to get the page struct. However SetPageDirty() itself reads the compound
      page head and returns a virtual address for the head page struct and
      setting dirty bit for that kills the system.
      
      This adds additional dirty bit tracking into the MM/IOMMU API for use
      in the real mode. Note that this does not change how VFIO and
      KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
      - use the lowest bit of the cached host phys address to carry
      the dirty bit;
      - mark pages dirty when they are unpinned which happens when
      the preregistered memory is released which always happens in virtual
      mode;
      - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
      - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
      in the new mm_iommu_ua_mark_dirty_rm() helper;
      - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
      caller anyway) to reduce the real mode KVM and IOMMU knowledge
      across different subsystems.
      
      This removes realmode_pfn_to_page() as it is not used anymore.
      
      While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
      the real mode only and modules cannot call it anyway.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      425333bf
  15. 24 8月, 2018 1 次提交
  16. 21 8月, 2018 1 次提交
  17. 20 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function · 46dec40f
      Paul Mackerras 提交于
      This fixes a bug which causes guest virtual addresses to get translated
      to guest real addresses incorrectly when the guest is using the HPT MMU
      and has more than 256GB of RAM, or more specifically has a HPT larger
      than 2GB.  This has showed up in testing as a failure of the host to
      emulate doorbell instructions correctly on POWER9 for HPT guests with
      more than 256GB of RAM.
      
      The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate()
      is stored as an int, and in forming the HPTE address, the index gets
      shifted left 4 bits as an int before being signed-extended to 64 bits.
      The simple fix is to make the variable a long int, matching the
      return type of kvmppc_hv_find_lock_hpte(), which is what calculates
      the index.
      
      Fixes: 697d3899 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46dec40f
  18. 18 8月, 2018 1 次提交
  19. 15 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix() · c066fafc
      Paul Mackerras 提交于
      Since commit e641a317 ("KVM: PPC: Book3S HV: Unify dirty page map
      between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the
      number of PAGE_SIZEd pages being unmapped and passes it to
      kvmppc_update_dirty_map(), which expects to be passed the page size
      instead.  Consequently it will only mark one system page dirty even
      when a large page (for example a THP page) is being unmapped.  The
      consequence of this is that part of the THP page might not get copied
      during live migration, resulting in memory corruption for the guest.
      
      This fixes it by computing and passing the page size in kvm_unmap_radix().
      
      Cc: stable@vger.kernel.org # v4.15+
      Fixes: e641a317 (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix)
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c066fafc
  20. 30 7月, 2018 4 次提交
  21. 26 7月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras 提交于
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • S
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff 提交于
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  22. 18 7月, 2018 3 次提交
    • A
      KVM: PPC: Check if IOMMU page is contained in the pinned physical page · 76fa4975
      Alexey Kardashevskiy 提交于
      A VM which has:
       - a DMA capable device passed through to it (eg. network card);
       - running a malicious kernel that ignores H_PUT_TCE failure;
       - capability of using IOMMU pages bigger that physical pages
      can create an IOMMU mapping that exposes (for example) 16MB of
      the host physical memory to the device when only 64K was allocated to the VM.
      
      The remaining 16MB - 64K will be some other content of host memory, possibly
      including pages of the VM, but also pages of host kernel memory, host
      programs or other VMs.
      
      The attacking VM does not control the location of the page it can map,
      and is only allowed to map as many pages as it has pages of RAM.
      
      We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
      an IOMMU page is contained in the physical page so the PCI hardware won't
      get access to unassigned host memory; however this check is missing in
      the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
      did not hit this yet as the very first time when the mapping happens
      we do not have tbl::it_userspace allocated yet and fall back to
      the userspace which in turn calls VFIO IOMMU driver, this fails and
      the guest does not retry,
      
      This stores the smallest preregistered page size in the preregistered
      region descriptor and changes the mm_iommu_xxx API to check this against
      the IOMMU page size.
      
      This calculates maximum page size as a minimum of the natural region
      alignment and compound page size. For the page shift this uses the shift
      returned by find_linux_pte() which indicates how the page is mapped to
      the current userspace - if the page is huge and this is not a zero, then
      it is a leaf pte and the page is mapped within the range.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      76fa4975
    • N
      KVM: PPC: Book3S HV: Fix constant size warning · 0abb75b7
      Nicholas Mc Guire 提交于
      The constants are 64bit but not explicitly declared UL resulting
      in sparse warnings. Fix this by declaring the constants UL.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0abb75b7
    • N
      KVM: PPC: Book3S HV: Add of_node_put() in success path · 51eaa08f
      Nicholas Mc Guire 提交于
      The call to of_find_compatible_node() is returning a pointer with
      incremented refcount so it must be explicitly decremented after the
      last use. As here it is only being used for checking of node presence
      but the result is not actually used in the success path it can be
      dropped immediately.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Fixes: commit f725758b ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      51eaa08f