1. 09 10月, 2018 19 次提交
  2. 02 10月, 2018 2 次提交
  3. 12 9月, 2018 2 次提交
    • N
      KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size · 71d29f43
      Nicholas Piggin 提交于
      THP paths can defer splitting compound pages until after the actual
      remap and TLB flushes to split a huge PMD/PUD. This causes radix
      partition scope page table mappings to get out of synch with the host
      qemu page table mappings.
      
      This results in random memory corruption in the guest when running
      with THP. The easiest way to reproduce is use KVM balloon to free up
      a lot of memory in the guest and then shrink the balloon to give the
      memory back, while some work is being done in the guest.
      
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      71d29f43
    • A
      KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode · 425333bf
      Alexey Kardashevskiy 提交于
      At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
      which in turn reads the old TCE and if it was a valid entry, marks
      the physical page dirty if it was mapped for writing. Since it is in
      real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
      to get the page struct. However SetPageDirty() itself reads the compound
      page head and returns a virtual address for the head page struct and
      setting dirty bit for that kills the system.
      
      This adds additional dirty bit tracking into the MM/IOMMU API for use
      in the real mode. Note that this does not change how VFIO and
      KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
      - use the lowest bit of the cached host phys address to carry
      the dirty bit;
      - mark pages dirty when they are unpinned which happens when
      the preregistered memory is released which always happens in virtual
      mode;
      - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
      - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
      in the new mm_iommu_ua_mark_dirty_rm() helper;
      - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
      caller anyway) to reduce the real mode KVM and IOMMU knowledge
      across different subsystems.
      
      This removes realmode_pfn_to_page() as it is not used anymore.
      
      While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
      the real mode only and modules cannot call it anyway.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      425333bf
  4. 24 8月, 2018 1 次提交
  5. 21 8月, 2018 1 次提交
  6. 20 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function · 46dec40f
      Paul Mackerras 提交于
      This fixes a bug which causes guest virtual addresses to get translated
      to guest real addresses incorrectly when the guest is using the HPT MMU
      and has more than 256GB of RAM, or more specifically has a HPT larger
      than 2GB.  This has showed up in testing as a failure of the host to
      emulate doorbell instructions correctly on POWER9 for HPT guests with
      more than 256GB of RAM.
      
      The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate()
      is stored as an int, and in forming the HPTE address, the index gets
      shifted left 4 bits as an int before being signed-extended to 64 bits.
      The simple fix is to make the variable a long int, matching the
      return type of kvmppc_hv_find_lock_hpte(), which is what calculates
      the index.
      
      Fixes: 697d3899 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46dec40f
  7. 18 8月, 2018 1 次提交
  8. 15 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix() · c066fafc
      Paul Mackerras 提交于
      Since commit e641a317 ("KVM: PPC: Book3S HV: Unify dirty page map
      between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the
      number of PAGE_SIZEd pages being unmapped and passes it to
      kvmppc_update_dirty_map(), which expects to be passed the page size
      instead.  Consequently it will only mark one system page dirty even
      when a large page (for example a THP page) is being unmapped.  The
      consequence of this is that part of the THP page might not get copied
      during live migration, resulting in memory corruption for the guest.
      
      This fixes it by computing and passing the page size in kvm_unmap_radix().
      
      Cc: stable@vger.kernel.org # v4.15+
      Fixes: e641a317 (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix)
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c066fafc
  9. 30 7月, 2018 4 次提交
  10. 26 7月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras 提交于
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • S
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff 提交于
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  11. 18 7月, 2018 5 次提交
    • A
      KVM: PPC: Check if IOMMU page is contained in the pinned physical page · 76fa4975
      Alexey Kardashevskiy 提交于
      A VM which has:
       - a DMA capable device passed through to it (eg. network card);
       - running a malicious kernel that ignores H_PUT_TCE failure;
       - capability of using IOMMU pages bigger that physical pages
      can create an IOMMU mapping that exposes (for example) 16MB of
      the host physical memory to the device when only 64K was allocated to the VM.
      
      The remaining 16MB - 64K will be some other content of host memory, possibly
      including pages of the VM, but also pages of host kernel memory, host
      programs or other VMs.
      
      The attacking VM does not control the location of the page it can map,
      and is only allowed to map as many pages as it has pages of RAM.
      
      We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
      an IOMMU page is contained in the physical page so the PCI hardware won't
      get access to unassigned host memory; however this check is missing in
      the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
      did not hit this yet as the very first time when the mapping happens
      we do not have tbl::it_userspace allocated yet and fall back to
      the userspace which in turn calls VFIO IOMMU driver, this fails and
      the guest does not retry,
      
      This stores the smallest preregistered page size in the preregistered
      region descriptor and changes the mm_iommu_xxx API to check this against
      the IOMMU page size.
      
      This calculates maximum page size as a minimum of the natural region
      alignment and compound page size. For the page shift this uses the shift
      returned by find_linux_pte() which indicates how the page is mapped to
      the current userspace - if the page is huge and this is not a zero, then
      it is a leaf pte and the page is mapped within the range.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      76fa4975
    • N
      KVM: PPC: Book3S HV: Fix constant size warning · 0abb75b7
      Nicholas Mc Guire 提交于
      The constants are 64bit but not explicitly declared UL resulting
      in sparse warnings. Fix this by declaring the constants UL.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0abb75b7
    • N
      KVM: PPC: Book3S HV: Add of_node_put() in success path · 51eaa08f
      Nicholas Mc Guire 提交于
      The call to of_find_compatible_node() is returning a pointer with
      incremented refcount so it must be explicitly decremented after the
      last use. As here it is only being used for checking of node presence
      but the result is not actually used in the success path it can be
      dropped immediately.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Fixes: commit f725758b ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      51eaa08f
    • A
      KVM: PPC: Book3S: Fix matching of hardware and emulated TCE tables · 76346cd9
      Alexey Kardashevskiy 提交于
      When attaching a hardware table to LIOBN in KVM, we match table parameters
      such as page size, table offset and table size. However the tables are
      created via very different paths - VFIO and KVM - and the VFIO path goes
      through the platform code which has minimum TCE page size requirement
      (which is 4K but since we allocate memory by pages and cannot avoid
      alignment anyway, we align to 64k pages for powernv_defconfig).
      
      So when we match the tables, one might be bigger that the other which
      means the hardware table cannot get attached to LIOBN and DMA mapping
      fails.
      
      This removes the table size alignment from the guest visible table.
      This does not affect the memory allocation which is still aligned -
      kvmppc_tce_pages() takes care of this.
      
      This relaxes the check we do when attaching tables to allow the hardware
      table be bigger than the guest visible table.
      
      Ideally we want the KVM table to cover the same space as the hardware
      table does but since the hardware table may use multiple levels, and
      all levels must use the same table size (IODA2 design), the area it can
      actually cover might get very different from the window size which
      the guest requested, even though the guest won't map it all.
      
      Fixes: ca1fc489 "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages"
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      76346cd9
    • S
      KVM: PPC: Remove mmio_vsx_tx_sx_enabled in KVM MMIO emulation · 4eeb8556
      Simon Guo 提交于
      Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR
      reg number, and use mmio_vsx_tx_sx_enabled field together for
      0~63# VSR regs.
      
      Currently PPC KVM MMIO emulation is reimplemented with analyse_instr()
      assistance.  analyse_instr() returns 0~63 for VSR register number, so
      it is not necessary to use additional mmio_vsx_tx_sx_enabled field
      any more.
      
      This patch extends related reg bits (expand io_gpr to u16 from u8
      and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can
      be removed.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4eeb8556
  12. 16 7月, 2018 1 次提交
    • A
      powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126
      Alexey Kardashevskiy 提交于
      At the moment we allocate the entire TCE table, twice (hardware part and
      userspace translation cache). This normally works as we normally have
      contigous memory and the guest will map entire RAM for 64bit DMA.
      
      However if we have sparse RAM (one example is a memory device), then
      we will allocate TCEs which will never be used as the guest only maps
      actual memory for DMA. If it is a single level TCE table, there is nothing
      we can really do but if it a multilevel table, we can skip allocating
      TCEs we know we won't need.
      
      This adds ability to allocate only first level, saving memory.
      
      This changes iommu_table::free() to avoid allocating of an extra level;
      iommu_table::set() will do this when needed.
      
      This adds @alloc parameter to iommu_table::exchange() to tell the callback
      if it can allocate an extra level; the flag is set to "false" for
      the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
      H_TOO_HARD.
      
      This still requires the entire table to be counted in mm::locked_vm.
      
      To be conservative, this only does on-demand allocation when
      the usespace cache table is requested which is the case of VFIO.
      
      The example math for a system replicating a powernv setup with NVLink2
      in a guest:
      16GB RAM mapped at 0x0
      128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
      
      the table to cover that all with 64K pages takes:
      (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
      
      If we allocate only necessary TCE levels, we will only need:
      (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
      levels).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a68bd126