You need to sign in or sign up before continuing.
  1. 09 10月, 2018 4 次提交
    • P
      KVM: PPC: Book3S: Simplify external interrupt handling · d24ea8a7
      Paul Mackerras 提交于
      Currently we use two bits in the vcpu pending_exceptions bitmap to
      indicate that an external interrupt is pending for the guest, one
      for "one-shot" interrupts that are cleared when delivered, and one
      for interrupts that persist until cleared by an explicit action of
      the OS (e.g. an acknowledge to an interrupt controller).  The
      BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests
      and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts.
      
      In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our
      Book3S platforms generally, and pseries in particular, expect
      external interrupt requests to persist until they are acknowledged
      at the interrupt controller.  That combined with the confusion
      introduced by having two bits for what is essentially the same thing
      makes it attractive to simplify things by only using one bit.  This
      patch does that.
      
      With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default
      it has the semantics of a persisting interrupt.  In order to avoid
      breaking the ABI, we introduce a new "external_oneshot" flag which
      preserves the behaviour of the KVM_INTERRUPT ioctl with the
      KVM_INTERRUPT_SET argument.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d24ea8a7
    • A
      KVM: PPC: Remove redundand permission bits removal · a3ac077b
      Alexey Kardashevskiy 提交于
      The kvmppc_gpa_to_ua() helper itself takes care of the permission
      bits in the TCE and yet every single caller removes them.
      
      This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
      (which are GPAs + TCE permission bits) to make the callers simpler.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3ac077b
    • A
      KVM: PPC: Propagate errors to the guest when failed instead of ignoring · 2691f0ff
      Alexey Kardashevskiy 提交于
      At the moment if the PUT_TCE{_INDIRECT} handlers fail to update
      the hardware tables, we print a warning once, clear the entry and
      continue. This is so as at the time the assumption was that if
      a VFIO device is hotplugged into the guest, and the userspace replays
      virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails,
      then there is nothing useful we can do about it.
      
      However the assumption is not valid as these handlers are not called for
      TCE replay (VFIO ioctl interface is used for that) and these handlers
      are for new TCEs.
      
      This returns an error to the guest if there is a request which cannot be
      processed. By now the only possible failure must be H_TOO_HARD.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2691f0ff
    • A
      KVM: PPC: Validate TCEs against preregistered memory page sizes · 42de7b9e
      Alexey Kardashevskiy 提交于
      The userspace can request an arbitrary supported page size for a DMA
      window and this works fine as long as the mapped memory is backed with
      the pages of the same or bigger size; if this is not the case,
      mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
      dangerously incorrect TCEs.
      
      However since it is quite easy to misconfigure the KVM and we do not do
      reverts to all changes made to TCE tables if an error happens in a middle,
      we better do the acceptable page size validation before we even touch
      the tables.
      
      This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
      against the preregistered memory page sizes.
      
      Since the new check uses real/virtual mode helpers, this renames
      kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
      case and mirrors it for the virtual mode under the old name. The real
      mode handler is not used for the virtual mode as:
      1. it uses _lockless() list traversing primitives instead of RCU;
      2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
      virtual mode does not have to use and since on POWER9+radix only virtual
      mode handlers actually work, we do not want to slow down that path even
      a bit.
      
      This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
      are static now.
      
      From now on the attempts on mapping IOMMU pages bigger than allowed
      will result in KVM exit.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix KVM_HV=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      42de7b9e
  2. 02 10月, 2018 2 次提交
  3. 12 9月, 2018 2 次提交
    • N
      KVM: PPC: Book3S HV: Don't use compound_order to determine host mapping size · 71d29f43
      Nicholas Piggin 提交于
      THP paths can defer splitting compound pages until after the actual
      remap and TLB flushes to split a huge PMD/PUD. This causes radix
      partition scope page table mappings to get out of synch with the host
      qemu page table mappings.
      
      This results in random memory corruption in the guest when running
      with THP. The easiest way to reproduce is use KVM balloon to free up
      a lot of memory in the guest and then shrink the balloon to give the
      memory back, while some work is being done in the guest.
      
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      71d29f43
    • A
      KVM: PPC: Avoid marking DMA-mapped pages dirty in real mode · 425333bf
      Alexey Kardashevskiy 提交于
      At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm()
      which in turn reads the old TCE and if it was a valid entry, marks
      the physical page dirty if it was mapped for writing. Since it is in
      real mode, realmode_pfn_to_page() is used instead of pfn_to_page()
      to get the page struct. However SetPageDirty() itself reads the compound
      page head and returns a virtual address for the head page struct and
      setting dirty bit for that kills the system.
      
      This adds additional dirty bit tracking into the MM/IOMMU API for use
      in the real mode. Note that this does not change how VFIO and
      KVM (in virtual mode) set this bit. The KVM (real mode) changes include:
      - use the lowest bit of the cached host phys address to carry
      the dirty bit;
      - mark pages dirty when they are unpinned which happens when
      the preregistered memory is released which always happens in virtual
      mode;
      - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit;
      - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use
      in the new mm_iommu_ua_mark_dirty_rm() helper;
      - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only
      caller anyway) to reduce the real mode KVM and IOMMU knowledge
      across different subsystems.
      
      This removes realmode_pfn_to_page() as it is not used anymore.
      
      While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for
      the real mode only and modules cannot call it anyway.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      425333bf
  4. 24 8月, 2018 1 次提交
  5. 21 8月, 2018 1 次提交
  6. 20 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't truncate HPTE index in xlate function · 46dec40f
      Paul Mackerras 提交于
      This fixes a bug which causes guest virtual addresses to get translated
      to guest real addresses incorrectly when the guest is using the HPT MMU
      and has more than 256GB of RAM, or more specifically has a HPT larger
      than 2GB.  This has showed up in testing as a failure of the host to
      emulate doorbell instructions correctly on POWER9 for HPT guests with
      more than 256GB of RAM.
      
      The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate()
      is stored as an int, and in forming the HPTE address, the index gets
      shifted left 4 bits as an int before being signed-extended to 64 bits.
      The simple fix is to make the variable a long int, matching the
      return type of kvmppc_hv_find_lock_hpte(), which is what calculates
      the index.
      
      Fixes: 697d3899 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46dec40f
  7. 18 8月, 2018 1 次提交
  8. 15 8月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Use correct pagesize in kvm_unmap_radix() · c066fafc
      Paul Mackerras 提交于
      Since commit e641a317 ("KVM: PPC: Book3S HV: Unify dirty page map
      between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the
      number of PAGE_SIZEd pages being unmapped and passes it to
      kvmppc_update_dirty_map(), which expects to be passed the page size
      instead.  Consequently it will only mark one system page dirty even
      when a large page (for example a THP page) is being unmapped.  The
      consequence of this is that part of the THP page might not get copied
      during live migration, resulting in memory corruption for the guest.
      
      This fixes it by computing and passing the page size in kvm_unmap_radix().
      
      Cc: stable@vger.kernel.org # v4.15+
      Fixes: e641a317 (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix)
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c066fafc
  9. 30 7月, 2018 4 次提交
  10. 26 7月, 2018 2 次提交
    • P
      KVM: PPC: Book3S HV: Read kvm->arch.emul_smt_mode under kvm->lock · b5c6f760
      Paul Mackerras 提交于
      Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full
      VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode
      before any VCPUs are created.  However, userspace can change
      kvm->arch.emul_smt_mode at any time up until the first VCPU is created.
      Hence it is (theoretically) possible for the check in
      kvmppc_core_vcpu_create_hv() to race with another userspace thread
      changing kvm->arch.emul_smt_mode.
      
      This fixes it by moving the test that uses kvm->arch.emul_smt_mode into
      the block where kvm->lock is held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b5c6f760
    • S
      KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space · 1e175d2e
      Sam Bobroff 提交于
      It is not currently possible to create the full number of possible
      VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer
      threads per core than its core stride (or "VSMT mode"). This is
      because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS
      even though the VCPU ID is less than KVM_MAX_VCPU_ID.
      
      To address this, "pack" the VCORE ID and XIVE offsets by using
      knowledge of the way the VCPU IDs will be used when there are fewer
      guest threads per core than the core stride. The primary thread of
      each core will always be used first. Then, if the guest uses more than
      one thread per core, these secondary threads will sequentially follow
      the primary in each core.
      
      So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
      VCPUs are being spaced apart, so at least half of each core is empty,
      and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
      into the second half of each core (4..7, in an 8-thread core).
      
      Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
      each core is being left empty, and we can map down into the second and
      third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
      
      Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
      threads are being used and 7/8 of the core is empty, allowing use of
      the 1, 5, 3 and 7 thread slots.
      
      (Strides less than 8 are handled similarly.)
      
      This allows the VCORE ID or offset to be calculated quickly from the
      VCPU ID or XIVE server numbers, without access to the VCPU structure.
      
      [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE
       to pr_devel, wrapped line, fixed id check.]
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1e175d2e
  11. 18 7月, 2018 5 次提交
    • A
      KVM: PPC: Check if IOMMU page is contained in the pinned physical page · 76fa4975
      Alexey Kardashevskiy 提交于
      A VM which has:
       - a DMA capable device passed through to it (eg. network card);
       - running a malicious kernel that ignores H_PUT_TCE failure;
       - capability of using IOMMU pages bigger that physical pages
      can create an IOMMU mapping that exposes (for example) 16MB of
      the host physical memory to the device when only 64K was allocated to the VM.
      
      The remaining 16MB - 64K will be some other content of host memory, possibly
      including pages of the VM, but also pages of host kernel memory, host
      programs or other VMs.
      
      The attacking VM does not control the location of the page it can map,
      and is only allowed to map as many pages as it has pages of RAM.
      
      We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that
      an IOMMU page is contained in the physical page so the PCI hardware won't
      get access to unassigned host memory; however this check is missing in
      the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and
      did not hit this yet as the very first time when the mapping happens
      we do not have tbl::it_userspace allocated yet and fall back to
      the userspace which in turn calls VFIO IOMMU driver, this fails and
      the guest does not retry,
      
      This stores the smallest preregistered page size in the preregistered
      region descriptor and changes the mm_iommu_xxx API to check this against
      the IOMMU page size.
      
      This calculates maximum page size as a minimum of the natural region
      alignment and compound page size. For the page shift this uses the shift
      returned by find_linux_pte() which indicates how the page is mapped to
      the current userspace - if the page is huge and this is not a zero, then
      it is a leaf pte and the page is mapped within the range.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      76fa4975
    • N
      KVM: PPC: Book3S HV: Fix constant size warning · 0abb75b7
      Nicholas Mc Guire 提交于
      The constants are 64bit but not explicitly declared UL resulting
      in sparse warnings. Fix this by declaring the constants UL.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0abb75b7
    • N
      KVM: PPC: Book3S HV: Add of_node_put() in success path · 51eaa08f
      Nicholas Mc Guire 提交于
      The call to of_find_compatible_node() is returning a pointer with
      incremented refcount so it must be explicitly decremented after the
      last use. As here it is only being used for checking of node presence
      but the result is not actually used in the success path it can be
      dropped immediately.
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      Fixes: commit f725758b ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      51eaa08f
    • A
      KVM: PPC: Book3S: Fix matching of hardware and emulated TCE tables · 76346cd9
      Alexey Kardashevskiy 提交于
      When attaching a hardware table to LIOBN in KVM, we match table parameters
      such as page size, table offset and table size. However the tables are
      created via very different paths - VFIO and KVM - and the VFIO path goes
      through the platform code which has minimum TCE page size requirement
      (which is 4K but since we allocate memory by pages and cannot avoid
      alignment anyway, we align to 64k pages for powernv_defconfig).
      
      So when we match the tables, one might be bigger that the other which
      means the hardware table cannot get attached to LIOBN and DMA mapping
      fails.
      
      This removes the table size alignment from the guest visible table.
      This does not affect the memory allocation which is still aligned -
      kvmppc_tce_pages() takes care of this.
      
      This relaxes the check we do when attaching tables to allow the hardware
      table be bigger than the guest visible table.
      
      Ideally we want the KVM table to cover the same space as the hardware
      table does but since the hardware table may use multiple levels, and
      all levels must use the same table size (IODA2 design), the area it can
      actually cover might get very different from the window size which
      the guest requested, even though the guest won't map it all.
      
      Fixes: ca1fc489 "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages"
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      76346cd9
    • S
      KVM: PPC: Remove mmio_vsx_tx_sx_enabled in KVM MMIO emulation · 4eeb8556
      Simon Guo 提交于
      Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR
      reg number, and use mmio_vsx_tx_sx_enabled field together for
      0~63# VSR regs.
      
      Currently PPC KVM MMIO emulation is reimplemented with analyse_instr()
      assistance.  analyse_instr() returns 0~63 for VSR register number, so
      it is not necessary to use additional mmio_vsx_tx_sx_enabled field
      any more.
      
      This patch extends related reg bits (expand io_gpr to u16 from u8
      and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can
      be removed.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4eeb8556
  12. 16 7月, 2018 4 次提交
    • A
      powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126
      Alexey Kardashevskiy 提交于
      At the moment we allocate the entire TCE table, twice (hardware part and
      userspace translation cache). This normally works as we normally have
      contigous memory and the guest will map entire RAM for 64bit DMA.
      
      However if we have sparse RAM (one example is a memory device), then
      we will allocate TCEs which will never be used as the guest only maps
      actual memory for DMA. If it is a single level TCE table, there is nothing
      we can really do but if it a multilevel table, we can skip allocating
      TCEs we know we won't need.
      
      This adds ability to allocate only first level, saving memory.
      
      This changes iommu_table::free() to avoid allocating of an extra level;
      iommu_table::set() will do this when needed.
      
      This adds @alloc parameter to iommu_table::exchange() to tell the callback
      if it can allocate an extra level; the flag is set to "false" for
      the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
      H_TOO_HARD.
      
      This still requires the entire table to be counted in mm::locked_vm.
      
      To be conservative, this only does on-demand allocation when
      the usespace cache table is requested which is the case of VFIO.
      
      The example math for a system replicating a powernv setup with NVLink2
      in a guest:
      16GB RAM mapped at 0x0
      128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
      
      the table to cover that all with 64K pages takes:
      (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
      
      If we allocate only necessary TCE levels, we will only need:
      (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
      levels).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a68bd126
    • A
      powerpc/powernv: Add indirect levels to it_userspace · 090bad39
      Alexey Kardashevskiy 提交于
      We want to support sparse memory and therefore huge chunks of DMA windows
      do not need to be mapped. If a DMA window big enough to require 2 or more
      indirect levels, and a DMA window is used to map all RAM (which is
      a default case for 64bit window), we can actually save some memory by
      not allocation TCE for regions which we are not going to map anyway.
      
      The hardware tables alreary support indirect levels but we also keep
      host-physical-to-userspace translation array which is allocated by
      vmalloc() and is a flat array which might use quite some memory.
      
      This converts it_userspace from vmalloc'ed array to a multi level table.
      
      As the format becomes platform dependend, this replaces the direct access
      to it_usespace with a iommu_table_ops::useraddrptr hook which returns
      a pointer to the userspace copy of a TCE; future extension will return
      NULL if the level was not allocated.
      
      This should not change non-KVM handling of TCE tables and it_userspace
      will not be allocated for non-KVM tables.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      090bad39
    • A
      KVM: PPC: Make iommu_table::it_userspace big endian · 00a5c58d
      Alexey Kardashevskiy 提交于
      We are going to reuse multilevel TCE code for the userspace copy of
      the TCE table and since it is big endian, let's make the copy big endian
      too.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      00a5c58d
    • N
      powerpc/64s: Remove POWER9 DD1 support · 2bf1071a
      Nicholas Piggin 提交于
      POWER9 DD1 was never a product. It is no longer supported by upstream
      firmware, and it is not effectively supported in Linux due to lack of
      testing.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NMichael Ellerman <mpe@ellerman.id.au>
      [mpe: Remove arch_make_huge_pte() entirely]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2bf1071a
  13. 20 6月, 2018 1 次提交
  14. 13 6月, 2018 8 次提交
    • S
      KVM: PPC: Book3S PR: Fix failure status setting in tabort. emulation · f61e0d3c
      Simon Guo 提交于
      tabort. will perform transaction failure recording and the recording
      depends on TEXASR FS bit. Currently the TEXASR FS bit is retrieved
      after tabort., when the TEXASR FS bit is already been updated by
      tabort. itself.
      
      This patch corrects this behavior by retrieving TEXASR val before
      tabort.
      
      tabort. will not immediately leads to transaction failure handling
      in suspend state. So this patch also remove the mtspr on TEXASR/TFIAR
      registers to avoid TM bad thing exception.
      
      Fixes: 26798f88 ("KVM: PPC: Book3S PR: Add emulation for tabort. in privileged state")
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f61e0d3c
    • P
      KVM: PPC: Book3S PR: Enable use on POWER9 bare-metal hosts in HPT mode · db96a04a
      Paul Mackerras 提交于
      It turns out that PR KVM has no dependency on the format of HPTEs,
      because it uses functions pointed to by mmu_hash_ops which do all
      the formatting and interpretation of HPTEs.  Thus we can allow PR
      KVM to load on POWER9 bare-metal hosts as long as they are running
      in HPT mode.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      db96a04a
    • P
      KVM: PPC: Book3S PR: Don't let PAPR guest set MSR hypervisor bit · 4f169d21
      Paul Mackerras 提交于
      PAPR guests run in supervisor mode and should not be able to set the
      MSR HV (hypervisor mode) bit or clear the ME (machine check enable)
      bit by mtmsrd or any other means.  To enforce this, we force MSR_HV
      off and MSR_ME on in kvmppc_set_msr_pr.  Without this, the guest
      can appear to be in hypervisor mode to itself and to userspace.
      This has been observed to cause a crash in QEMU when it tries to
      deliver a system reset interrupt to the guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4f169d21
    • P
      KVM: PPC: Book3S PR: Fix failure status setting in treclaim. emulation · a50623fb
      Paul Mackerras 提交于
      The treclaim. emulation needs to record failure status in the TEXASR
      register if the transaction had not previously failed.  However, the
      current code first does kvmppc_save_tm_pr() (which does a treclaim.
      itself) and then checks the failure summary bit in TEXASR after that.
      Since treclaim. itself causes transaction failure, the FS bit is
      always set, so we were never updating TEXASR with the failure cause
      supplied by the guest as the RA parameter to the treclaim. instruction.
      This caused the tm-unavailable test in tools/testing/selftests/powerpc/tm
      to fail.
      
      To fix this, we need to read TEXASR before calling kvmppc_save_tm_pr(),
      and base the final value of TEXASR on that value.
      
      Fixes: 03c81682 ("KVM: PPC: Book3S PR: Add emulation for treclaim.")
      Reviewed-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a50623fb
    • P
      KVM: PPC: Book3S PR: Fix MSR setting when delivering interrupts · 916ccadc
      Paul Mackerras 提交于
      This makes sure that MSR "partial-function" bits are not transferred
      to SRR1 when delivering an interrupt.  This was causing failures in
      guests running kernels that include commit f3d96e69 ("powerpc/mm:
      Overhaul handling of bad page faults", 2017-07-19), which added code
      to check bits of SRR1 on instruction storage interrupts (ISIs) that
      indicate a bad page fault.  The symptom was that a guest user program
      that handled a signal and attempted to return from the signal handler
      would get a SIGBUS signal and die.
      
      The code that generated ISIs and some other interrupts would
      previously set bits in the guest MSR to indicate the interrupt status
      and then call kvmppc_book3s_queue_irqprio().  This technique no
      longer works now that kvmppc_inject_interrupt() is masking off those
      bits.  Instead we make kvmppc_core_queue_data_storage() and
      kvmppc_core_queue_inst_storage() call kvmppc_inject_interrupt()
      directly, and make sure that all the places that generate ISIs or
      DSIs call kvmppc_core_queue_{data,inst}_storage instead of
      kvmppc_book3s_queue_irqprio().
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      916ccadc
    • C
      KVM: PPC: Book3S PR: Handle additional interrupt types · b71dc519
      Cameron Kaiser 提交于
      This adds trivial handling for additional interrupt types that KVM-PR must
      support for proper virtualization on a POWER9 host in HPT mode, as a further
      prerequisite to enabling KVM-PR on that configuration.
      Signed-off-by: NCameron Kaiser <spectre@floodgap.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b71dc519
    • K
      treewide: Use array_size() in vzalloc() · fad953ce
      Kees Cook 提交于
      The vzalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vzalloc(a * b)
      
      with:
              vzalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vzalloc(a * b * c)
      
      with:
      
              vzalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vzalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vzalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vzalloc(C1 * C2 * C3, ...)
      |
        vzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vzalloc(C1 * C2, ...)
      |
        vzalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      fad953ce
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
  15. 02 6月, 2018 2 次提交
    • G
      kvm: no need to check return value of debugfs_create functions · 929f45e3
      Greg Kroah-Hartman 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      This cleans up the error handling a lot, as this code will never get
      hit.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim KrÄmář" <rkrcmar@redhat.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc: Andre Przywara <andre.przywara@arm.com>
      Cc: kvm-ppc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: kvm@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      929f45e3
    • S
      kvm: Change return type to vm_fault_t · 1499fa80
      Souptick Joarder 提交于
      Use new return type vm_fault_t for fault handler. For
      now, this is just documenting that the function returns
      a VM_FAULT value rather than an errno. Once all instances
      are converted, vm_fault_t will become a distinct type.
      
      commit 1c8f4220 ("mm: change return type to vm_fault_t")
      Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1499fa80
  16. 01 6月, 2018 1 次提交