1. 01 11月, 2017 6 次提交
    • P
      KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix · e641a317
      Paul Mackerras 提交于
      Currently, the HPT code in HV KVM maintains a dirty bit per guest page
      in the rmap array, whether or not dirty page tracking has been enabled
      for the memory slot.  In contrast, the radix code maintains a dirty
      bit per guest page in memslot->dirty_bitmap, and only does so when
      dirty page tracking has been enabled.
      
      This changes the HPT code to maintain the dirty bits in the memslot
      dirty_bitmap like radix does.  This results in slightly less code
      overall, and will mean that we do not lose the dirty bits when
      transitioning between HPT and radix mode in future.
      
      There is one minor change to behaviour as a result.  With HPT, when
      dirty tracking was enabled for a memslot, we would previously clear
      all the dirty bits at that point (both in the HPT entries and in the
      rmap arrays), meaning that a KVM_GET_DIRTY_LOG ioctl immediately
      following would show no pages as dirty (assuming no vcpus have run
      in the meantime).  With this change, the dirty bits on HPT entries
      are not cleared at the point where dirty tracking is enabled, so
      KVM_GET_DIRTY_LOG would show as dirty any guest pages that are
      resident in the HPT and dirty.  This is consistent with what happens
      on radix.
      
      This also fixes a bug in the mark_pages_dirty() function for radix
      (in the sense that the function no longer exists).  In the case where
      a large page of 64 normal pages or more is marked dirty, the
      addressing of the dirty bitmap was incorrect and could write past
      the end of the bitmap.  Fortunately this case was never hit in
      practice because a 2MB large page is only 32 x 64kB pages, and we
      don't support backing the guest with 1GB huge pages at this point.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e641a317
    • P
      KVM: PPC: Book3S HV: Rename hpte_setup_done to mmu_ready · 1b151ce4
      Paul Mackerras 提交于
      This renames the kvm->arch.hpte_setup_done field to mmu_ready because
      we will want to use it for radix guests too -- both for setting things
      up before vcpu execution, and for excluding vcpus from executing while
      MMU-related things get changed, such as in future switching the MMU
      from radix to HPT mode or vice-versa.
      
      This also moves the call to kvmppc_setup_partition_table() that was
      done in kvmppc_hv_setup_htab_rma() for HPT guests, and the setting
      of mmu_ready, into the caller in kvmppc_vcpu_run_hv().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b151ce4
    • P
      KVM: PPC: Book3S HV: Don't rely on host's page size information · 8dc6cca5
      Paul Mackerras 提交于
      This removes the dependence of KVM on the mmu_psize_defs array (which
      stores information about hardware support for various page sizes) and
      the things derived from it, chiefly hpte_page_sizes[], hpte_page_size(),
      hpte_actual_page_size() and get_sllp_encoding().  We also no longer
      rely on the mmu_slb_size variable or the MMU_FTR_1T_SEGMENTS feature
      bit.
      
      The reason for doing this is so we can support a HPT guest on a radix
      host.  In a radix host, the mmu_psize_defs array contains information
      about page sizes supported by the MMU in radix mode rather than the
      page sizes supported by the MMU in HPT mode.  Similarly, mmu_slb_size
      and the MMU_FTR_1T_SEGMENTS bit are not set.
      
      Instead we hard-code knowledge of the behaviour of the HPT MMU in the
      POWER7, POWER8 and POWER9 processors (which are the only processors
      supported by HV KVM) - specifically the encoding of the LP fields in
      the HPT and SLB entries, and the fact that they have 32 SLB entries
      and support 1TB segments.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8dc6cca5
    • N
      KVM: PPC: Book3S: Fix gas warning due to using r0 as immediate 0 · 93897a1f
      Nicholas Piggin 提交于
      This fixes the message:
      
      arch/powerpc/kvm/book3s_segment.S: Assembler messages:
      arch/powerpc/kvm/book3s_segment.S:330: Warning: invalid register expression
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      93897a1f
    • G
      KVM: PPC: Book3S PR: Only install valid SLBs during KVM_SET_SREGS · f4093ee9
      Greg Kurz 提交于
      Userland passes an array of 64 SLB descriptors to KVM_SET_SREGS,
      some of which are valid (ie, SLB_ESID_V is set) and the rest are
      likely all-zeroes (with QEMU at least).
      
      Each of them is then passed to kvmppc_mmu_book3s_64_slbmte(), which
      assumes to find the SLB index in the 3 lower bits of its rb argument.
      When passed zeroed arguments, it happily overwrites the 0th SLB entry
      with zeroes. This is exactly what happens while doing live migration
      with QEMU when the destination pushes the incoming SLB descriptors to
      KVM PR. When reloading the SLBs at the next synchronization, QEMU first
      clears its SLB array and only restore valid ones, but the 0th one is
      now gone and we cannot access the corresponding memory anymore:
      
      (qemu) x/x $pc
      c0000000000b742c: Cannot access memory
      
      To avoid this, let's filter out non-valid SLB entries. While here, we
      also force a full SLB flush before installing new entries. Since SLB
      is for 64-bit only, we now build this path conditionally to avoid a
      build break on 32-bit, which doesn't define SLB_ESID_V.
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f4093ee9
    • P
      KVM: PPC: Book3S HV: Don't call real-mode XICS hypercall handlers if not enabled · 00bb6ae5
      Paul Mackerras 提交于
      When running a guest on a POWER9 system with the in-kernel XICS
      emulation disabled (for example by running QEMU with the parameter
      "-machine pseries,kernel_irqchip=off"), the kernel does not pass
      the XICS-related hypercalls such as H_CPPR up to userspace for
      emulation there as it should.
      
      The reason for this is that the real-mode handlers for these
      hypercalls don't check whether a XICS device has been instantiated
      before calling the xics-on-xive code.  That code doesn't check
      either, leading to potential NULL pointer dereferences because
      vcpu->arch.xive_vcpu is NULL.  Those dereferences won't cause an
      exception in real mode but will lead to kernel memory corruption.
      
      This fixes it by adding kvmppc_xics_enabled() checks before calling
      the XICS functions.
      
      Cc: stable@vger.kernel.org # v4.11+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00bb6ae5
  2. 20 10月, 2017 1 次提交
    • M
      KVM: PPC: Tie KVM_CAP_PPC_HTM to the user-visible TM feature · 2a3d6553
      Michael Ellerman 提交于
      Currently we use CPU_FTR_TM to decide if the CPU/kernel can support
      TM (Transactional Memory), and if it's true we advertise that to
      Qemu (or similar) via KVM_CAP_PPC_HTM.
      
      PPC_FEATURE2_HTM is the user-visible feature bit, which indicates that
      the CPU and kernel can support TM. Currently CPU_FTR_TM and
      PPC_FEATURE2_HTM always have the same value, either true or false, so
      using the former for KVM_CAP_PPC_HTM is correct.
      
      However some Power9 CPUs can operate in a mode where TM is enabled but
      TM suspended state is disabled. In this mode CPU_FTR_TM is true, but
      PPC_FEATURE2_HTM is false. Instead a different PPC_FEATURE2 bit is
      set, to indicate that this different mode of TM is available.
      
      It is not safe to let guests use TM as-is, when the CPU is in this
      mode. So to prevent that from happening, use PPC_FEATURE2_HTM to
      determine the value of KVM_CAP_PPC_HTM.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2a3d6553
  3. 19 10月, 2017 1 次提交
  4. 16 10月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Explicitly disable HPT operations on radix guests · 891f1ebf
      Paul Mackerras 提交于
      This adds code to make sure that we don't try to access the
      non-existent HPT for a radix guest using the htab file for the VM
      in debugfs, a file descriptor obtained using the KVM_PPC_GET_HTAB_FD
      ioctl, or via the KVM_PPC_RESIZE_HPT_{PREPARE,COMMIT} ioctls.
      
      At present nothing bad happens if userspace does access these
      interfaces on a radix guest, mostly because kvmppc_hpt_npte()
      gives 0 for a radix guest, which in turn is because 1 << -4
      comes out as 0 on POWER processors.  However, that relies on
      undefined behaviour, so it is better to be explicit about not
      accessing the HPT for a radix guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      891f1ebf
  5. 14 10月, 2017 5 次提交
  6. 03 10月, 2017 1 次提交
  7. 22 9月, 2017 1 次提交
    • M
      KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception · e001fa78
      Michael Neuling 提交于
      On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
      Interrupt (HDSI) the HDSISR is not be updated at all.
      
      To work around this we put a canary value into the HDSISR before
      returning to a guest and then check for this canary when we take a
      HDSI. If we find the canary on a HDSI, we know the hardware didn't
      update the HDSISR. In this case we return to the guest to retake the
      HDSI which should correctly update the HDSISR the second time HDSI
      entry.
      
      After talking to Paulus we've applied this workaround to all POWER9
      CPUs. The workaround of returning to the guest shouldn't ever be
      triggered on well behaving CPU. The extra instructions should have
      negligible performance impact.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e001fa78
  8. 15 9月, 2017 1 次提交
  9. 12 9月, 2017 3 次提交
    • P
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras 提交于
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
    • P
      KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr · cf5f6f31
      Paul Mackerras 提交于
      Commit 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT
      guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr()
      which doesn't hold the kvm->lock mutex around the call, as required.
      This adds the lock/unlock pair, and for good measure, includes
      the kvmppc_setup_partition_table() call in the locked region, since
      it is altering global state of the VM.
      
      This error appears not to have any fatal consequences for the host;
      the consequences would be that the VCPUs could end up running with
      different LPCR values, or an update to the LPCR value by userspace
      using the one_reg interface could get overwritten, or the update
      done by kvmhv_configure_mmu() could get overwritten.
      
      Cc: stable@vger.kernel.org # v4.10+
      Fixes: 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cf5f6f31
    • B
      KVM: PPC: Book3S HV: Don't access XIVE PIPR register using byte accesses · d222af07
      Benjamin Herrenschmidt 提交于
      The XIVE interrupt controller on POWER9 machines doesn't support byte
      accesses to any register in the thread management area other than the
      CPPR (current processor priority register).  In particular, when
      reading the PIPR (pending interrupt priority register), we need to
      do a 32-bit or 64-bit load.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: 2c4fb78f ("KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss")
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d222af07
  10. 01 9月, 2017 1 次提交
  11. 31 8月, 2017 7 次提交
  12. 30 8月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Protect updates to spapr_tce_tables list · edd03602
      Paul Mackerras 提交于
      Al Viro pointed out that while one thread of a process is executing
      in kvm_vm_ioctl_create_spapr_tce(), another thread could guess the
      file descriptor returned by anon_inode_getfd() and close() it before
      the first thread has added it to the kvm->arch.spapr_tce_tables list.
      That highlights a more general problem: there is no mutual exclusion
      between writers to the spapr_tce_tables list, leading to the
      possibility of the list becoming corrupted, which could cause a
      host kernel crash.
      
      To fix the mutual exclusion problem, we add a mutex_lock/unlock
      pair around the list_del_rce in kvm_spapr_tce_release().  Also,
      this moves the call to anon_inode_getfd() inside the region
      protected by the kvm->lock mutex, after we have done the check for
      a duplicate LIOBN.  This means that if another thread does guess the
      file descriptor and closes it, its call to kvm_spapr_tce_release()
      will not do any harm because it will have to wait until the first
      thread has released kvm->lock.  With this, there are no failure
      points in kvm_vm_ioctl_create_spapr_tce() after the call to
      anon_inode_getfd().
      
      The other things that the second thread could do with the guessed
      file descriptor are to mmap it or to pass it as a parameter to a
      KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl on a KVM device fd.  An mmap
      call won't cause any harm because kvm_spapr_tce_mmap() and
      kvm_spapr_tce_fault() don't access the spapr_tce_tables list or
      the kvmppc_spapr_tce_table.list field, and the fields that they do use
      have been properly initialized by the time of the anon_inode_getfd()
      call.
      
      The KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE ioctl calls
      kvm_spapr_tce_attach_iommu_group(), which scans the spapr_tce_tables
      list looking for the kvmppc_spapr_tce_table struct corresponding to
      the fd given as the parameter.  Either it will find the new entry
      or it won't; if it doesn't, it just returns an error, and if it
      does, it will function normally.  So, in each case there is no
      harmful effect.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      edd03602
  13. 29 8月, 2017 1 次提交
  14. 25 8月, 2017 1 次提交
    • P
      KVM: PPC: Book3S: Fix race and leak in kvm_vm_ioctl_create_spapr_tce() · 47c5310a
      Paul Mackerras 提交于
      Nixiaoming pointed out that there is a memory leak in
      kvm_vm_ioctl_create_spapr_tce() if the call to anon_inode_getfd()
      fails; the memory allocated for the kvmppc_spapr_tce_table struct
      is not freed, and nor are the pages allocated for the iommu
      tables.  In addition, we have already incremented the process's
      count of locked memory pages, and this doesn't get restored on
      error.
      
      David Hildenbrand pointed out that there is a race in that the
      function checks early on that there is not already an entry in the
      stt->iommu_tables list with the same LIOBN, but an entry with the
      same LIOBN could get added between then and when the new entry is
      added to the list.
      
      This fixes all three problems.  To simplify things, we now call
      anon_inode_getfd() before placing the new entry in the list.  The
      check for an existing entry is done while holding the kvm->lock
      mutex, immediately before adding the new entry to the list.
      Finally, on failure we now call kvmppc_account_memlimit to
      decrement the process's count of locked memory pages.
      Reported-by: NNixiaoming <nixiaoming@huawei.com>
      Reported-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47c5310a
  15. 24 8月, 2017 3 次提交
  16. 17 8月, 2017 1 次提交
    • A
      powerpc/mm: Rename find_linux_pte_or_hugepte() · 94171b19
      Aneesh Kumar K.V 提交于
      Add newer helpers to make the function usage simpler. It is always
      recommended to use find_current_mm_pte() for walking the page table.
      If we cannot use find_current_mm_pte(), it should be documented why
      the said usage of __find_linux_pte() is safe against a parallel THP
      split.
      
      For now we have KVM code using __find_linux_pte(). This is because kvm
      code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
      with PACA soft_enabled = 1. We may want to fix that later and make
      sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
      we can switch kvm to use find_linux_pte().
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      94171b19
  17. 08 8月, 2017 1 次提交
    • L
      KVM: add spinlock optimization framework · 199b5763
      Longpeng(Mike) 提交于
      If a vcpu exits due to request a user mode spinlock, then
      the spinlock-holder may be preempted in user mode or kernel mode.
      (Note that not all architectures trap spin loops in user mode,
      only AMD x86 and ARM/ARM64 currently do).
      
      But if a vcpu exits in kernel mode, then the holder must be
      preempted in kernel mode, so we should choose a vcpu in kernel mode
      as a more likely candidate for the lock holder.
      
      This introduces kvm_arch_vcpu_in_kernel() to decide whether the
      vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
      new argument says the same of the spinning VCPU.
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      199b5763
  18. 03 8月, 2017 1 次提交
  19. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  20. 24 7月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix host crash on changing HPT size · ef427198
      Paul Mackerras 提交于
      Commit f98a8bf9 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB
      ioctl() to change HPT size", 2016-12-20) changed the behaviour of
      the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT
      and new revmap array if there was a previously-allocated HPT of a
      different size from the size being requested.  In this case, we need
      to reset the rmap arrays of the memslots, because the rmap arrays
      will contain references to HPTEs which are no longer valid.  Worse,
      these references are also references to slots in the new revmap
      array (which parallels the HPT), and the new revmap array contains
      random contents, since it doesn't get zeroed on allocation.
      
      The effect of having these stale references to slots in the revmap
      array that contain random contents is that subsequent calls to
      functions such as kvmppc_add_revmap_chain will crash because they
      will interpret the non-zero contents of the revmap array as HPTE
      indexes and thus index outside of the revmap array.  This leads to
      host crashes such as the following.
      
      [ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8
      [ 7072.862218] Faulting instruction address: 0xc0000000000e1c78
      [ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 7072.862286] SMP NR_CPUS=1024
      [ 7072.862286] NUMA
      [ 7072.862325] PowerNV
      [ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry
      [ 7072.863085]  nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv]
      [ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59
      [ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000
      [ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0
      [ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300   Not tainted  (4.12.0-kvm+)
      [ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]>
      [ 7072.863667]   CR: 44082882  XER: 20000000
      [ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1
      GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0
      GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000
      GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf
      GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78
      GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000
      GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190
      GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000
      GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048
      [ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130
      [ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
      [ 7072.864566] Call Trace:
      [ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable)
      [ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
      [ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv]
      [ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv]
      [ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm]
      [ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm]
      [ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
      [ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0
      [ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130
      [ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c
      [ 7072.865644] Instruction dump:
      [ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14
      [ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000
      [ 7072.866001] ---[ end trace 627b6e4bf8080edc ]---
      
      Note that to trigger this, it is necessary to use a recent upstream
      QEMU (or other userspace that resizes the HPT at CAS time), specify
      a maximum memory size substantially larger than the current memory
      size, and boot a guest kernel that does not support HPT resizing.
      
      This fixes the problem by resetting the rmap arrays when the old HPT
      is freed.
      
      Fixes: f98a8bf9 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
      Cc: stable@vger.kernel.org # v4.11+
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ef427198
    • P
      KVM: PPC: Book3S HV: Enable TM before accessing TM registers · e4705715
      Paul Mackerras 提交于
      Commit 46a704f8 ("KVM: PPC: Book3S HV: Preserve userspace HTM state
      properly", 2017-06-15) added code to read transactional memory (TM)
      registers but forgot to enable TM before doing so.  The result is
      that if userspace does have live values in the TM registers, a KVM_RUN
      ioctl will cause a host kernel crash like this:
      
      [  181.328511] Unrecoverable TM Unavailable Exception f60 at d00000001e7d9980
      [  181.328605] Oops: Unrecoverable TM Unavailable Exception, sig: 6 [#1]
      [  181.328613] SMP NR_CPUS=2048
      [  181.328613] NUMA
      [  181.328618] PowerNV
      [  181.328646] Modules linked in: vhost_net vhost tap nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 dns_resolver nfs
      +fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
      +nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables
      +ip6table_filter ip6_tables iptable_filter bridge stp llc kvm_hv kvm nfsd ses enclosure scsi_transport_sas ghash_generic
      +auth_rpcgss gf128mul xts sg ctr nfs_acl lockd vmx_crypto shpchp ipmi_powernv i2c_opal grace ipmi_devintf i2c_core
      +powernv_rng sunrpc ipmi_msghandler ibmpowernv uio_pdrv_genirq uio leds_powernv powernv_op_panel ip_tables xfs sd_mod
      +lpfc ipr bnx2x libata mdio ptp pps_core scsi_transport_fc libcrc32c dm_mirror dm_region_hash dm_log dm_mod
      [  181.329278] CPU: 40 PID: 9926 Comm: CPU 0/KVM Not tainted 4.12.0+ #1
      [  181.329337] task: c000003fc6980000 task.stack: c000003fe4d80000
      [  181.329396] NIP: d00000001e7d9980 LR: d00000001e77381c CTR: d00000001e7d98f0
      [  181.329465] REGS: c000003fe4d837e0 TRAP: 0f60   Not tainted  (4.12.0+)
      [  181.329523] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  181.329527]   CR: 24022448  XER: 00000000
      [  181.329608] CFAR: d00000001e773818 SOFTE: 1
      [  181.329608] GPR00: d00000001e77381c c000003fe4d83a60 d00000001e7ef410 c000003fdcfe0000
      [  181.329608] GPR04: c000003fe4f00000 0000000000000000 0000000000000000 c000003fd7954800
      [  181.329608] GPR08: 0000000000000001 c000003fc6980000 0000000000000000 d00000001e7e2880
      [  181.329608] GPR12: d00000001e7d98f0 c000000007b19000 00000001295220e0 00007fffc0ce2090
      [  181.329608] GPR16: 0000010011886608 00007fff8c89f260 0000000000000001 00007fff8c080028
      [  181.329608] GPR20: 0000000000000000 00000100118500a6 0000010011850000 0000010011850000
      [  181.329608] GPR24: 00007fffc0ce1b48 0000010011850000 00000000d673b901 0000000000000000
      [  181.329608] GPR28: 0000000000000000 c000003fdcfe0000 c000003fdcfe0000 c000003fe4f00000
      [  181.330199] NIP [d00000001e7d9980] kvmppc_vcpu_run_hv+0x90/0x6b0 [kvm_hv]
      [  181.330264] LR [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [  181.330322] Call Trace:
      [  181.330351] [c000003fe4d83a60] [d00000001e773478] kvmppc_set_one_reg+0x48/0x340 [kvm] (unreliable)
      [  181.330437] [c000003fe4d83b30] [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [  181.330513] [c000003fe4d83b50] [d00000001e7700b4] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
      [  181.330586] [c000003fe4d83bd0] [d00000001e7642f8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
      [  181.330658] [c000003fe4d83d40] [c0000000003451b8] do_vfs_ioctl+0xc8/0x8b0
      [  181.330717] [c000003fe4d83de0] [c000000000345a64] SyS_ioctl+0xc4/0x120
      [  181.330776] [c000003fe4d83e30] [c00000000000b004] system_call+0x58/0x6c
      [  181.330833] Instruction dump:
      [  181.330869] e92d0260 e9290b50 e9290108 792807e3 41820058 e92d0260 e9290b50 e9290108
      [  181.330941] 792ae8a4 794a1f87 408204f4 e92d0260 <7d4022a6> f9490ff0 e92d0260 7d4122a6
      [  181.331013] ---[ end trace 6f6ddeb4bfe92a92 ]---
      
      The fix is just to turn on the TM bit in the MSR before accessing the
      registers.
      
      Cc: stable@vger.kernel.org # v3.14+
      Fixes: 46a704f8 ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly")
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e4705715