1. 19 2月, 2019 3 次提交
  2. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  3. 30 12月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: radix: Fix uninitialized var build error · f4607722
      Michael Ellerman 提交于
      Old GCCs (4.6.3 at least), aren't able to follow the logic in
      __kvmhv_copy_tofrom_guest_radix() and warn that old_pid is used
      uninitialized:
      
        arch/powerpc/kvm/book3s_64_mmu_radix.c:75:3: error: 'old_pid' may be
        used uninitialized in this function
      
      The logic is OK, we only use old_pid if quadrant == 1, and in that
      case it has definitely be initialised, eg:
      
      	if (quadrant == 1) {
      		old_pid = mfspr(SPRN_PID);
      	...
      	if (quadrant == 1 && pid != old_pid)
      		mtspr(SPRN_PID, old_pid);
      
      Annotate it to fix the error.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f4607722
  4. 21 12月, 2018 8 次提交
    • M
      treewide: surround Kconfig file paths with double quotes · 8636a1f9
      Masahiro Yamada 提交于
      The Kconfig lexer supports special characters such as '.' and '/' in
      the parameter context. In my understanding, the reason is just to
      support bare file paths in the source statement.
      
      I do not see a good reason to complicate Kconfig for the room of
      ambiguity.
      
      The majority of code already surrounds file paths with double quotes,
      and it makes sense since file paths are constant string literals.
      
      Make it treewide consistent now.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NWolfram Sang <wsa@the-dreams.de>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      8636a1f9
    • L
      KVM: Make kvm_set_spte_hva() return int · 748c0e31
      Lan Tianyu 提交于
      The patch is to make kvm_set_spte_hva() return int and caller can
      check return value to determine flush tlb or not.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      748c0e31
    • A
      powerpc/vfio/iommu/kvm: Do not pin device memory · c10c21ef
      Alexey Kardashevskiy 提交于
      This new memory does not have page structs as it is not plugged to
      the host so gup() will fail anyway.
      
      This adds 2 helpers:
      - mm_iommu_newdev() to preregister the "memory device" memory so
      the rest of API can still be used;
      - mm_iommu_is_devmem() to know if the physical address is one of thise
      new regions which we must avoid unpinning of.
      
      This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
      if the memory is device memory to avoid pfn_to_page().
      
      This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
      does delayed pages dirtying.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c10c21ef
    • S
      KVM: PPC: Book3S HV: Keep rc bits in shadow pgtable in sync with host · ae59a7e1
      Suraj Jitindar Singh 提交于
      The rc bits contained in ptes are used to track whether a page has been
      accessed and whether it is dirty. The accessed bit is used to age a page
      and the dirty bit to track whether a page is dirty or not.
      
      Now that we support nested guests there are three ptes which track the
      state of the same page:
      - The partition-scoped page table in the L1 guest, mapping L2->L1 address
      - The partition-scoped page table in the host for the L1 guest, mapping
        L1->L0 address
      - The shadow partition-scoped page table for the nested guest in the host,
        mapping L2->L0 address
      
      The idea is to attempt to keep the rc state of these three ptes in sync,
      both when setting and when clearing rc bits.
      
      When setting the bits we achieve consistency by:
      - Initially setting the bits in the shadow page table as the 'and' of the
        other two.
      - When updating in software the rc bits in the shadow page table we
        ensure the state is consistent with the other two locations first, and
        update these before reflecting the change into the shadow page table.
        i.e. only set the bits in the L2->L0 pte if also set in both the
             L2->L1 and the L1->L0 pte.
      
      When clearing the bits we achieve consistency by:
      - The rc bits in the shadow page table are only cleared when discarding
        a pte, and we don't need to record this as if either bit is set then
        it must also be set in the pte mapping L1->L0.
      - When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
        tlbie instruction
        - This means we will discard the pte from the shadow page table
          meaning the mapping will have to be setup again.
        - When setup the pte again in the shadow page table we will ensure
          consistency with the L2->L1 pte.
      - When the host clears an rc bit in the L1->L0 mapping we need to also
        clear the bit in any ptes in the shadow page table which map the same
        gfn so we will be notified if a nested guest accesses the page.
        This case is what this patch specifically concerns.
        - We can search the nest_rmap list for that given gfn and clear the
          same bit from all corresponding ptes in shadow page tables.
        - If a nested guest causes either of the rc bits to be set by software
          in future then we will update the L1->L0 pte and maintain consistency.
      
      With the process outlined above we aim to maintain consistency of the 3
      pte locations where we track rc for a given guest page.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ae59a7e1
    • S
      KVM: PPC: Book3S HV: Introduce kvmhv_update_nest_rmap_rc_list() · 90165d3d
      Suraj Jitindar Singh 提交于
      Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
      nest_rmap list will traverse it, find the corresponding pte in the shadow
      page tables, and if it still maps the same host page update the rc bits
      accordingly.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      90165d3d
    • S
      KVM: PPC: Book3S HV: Apply combination of host and l1 pte rc for nested guest · 8b23eee4
      Suraj Jitindar Singh 提交于
      The shadow page table contains ptes for translations from nested guest
      address to host address. Currently when creating these ptes we take the
      rc bits from the pte for the L1 guest address to host address
      translation. This is incorrect as we must also factor in the rc bits
      from the pte for the nested guest address to L1 guest address
      translation (as contained in the L1 guest partition table for the nested
      guest).
      
      By not calculating these bits correctly L1 may not have been correctly
      notified when it needed to update its rc bits in the partition table it
      maintains for its nested guest.
      
      Modify the code so that the rc bits in the resultant pte for the L2->L0
      translation are the 'and' of the rc bits in the L2->L1 pte and the L1->L0
      pte, also accounting for whether this was a write access when setting
      the dirty bit.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b23eee4
    • S
      KVM: PPC: Book3S HV: Align gfn to L1 page size when inserting nest-rmap entry · 8400f874
      Suraj Jitindar Singh 提交于
      Nested rmap entries are used to store the translation from L1 gpa to L2
      gpa when entries are inserted into the shadow (nested) page tables. This
      rmap list is located by indexing the rmap array in the memslot by L1
      gfn. When we come to search for these entries we only know the L1 page size
      (which could be PAGE_SIZE, 2M or a 1G page) and so can only select a gfn
      aligned to that size. This means that when we insert the entry, so we can
      find it later, we need to align the gfn we use to select the rmap list
      in which to insert the entry to L1 page size as well.
      
      By not doing this we were missing nested rmap entries when modifying L1
      ptes which were for a page also passed through to an L2 guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8400f874
    • S
      KVM: PPC: Book3S HV: Hold kvm->mmu_lock across updating nested pte rc bits · bec6e03b
      Suraj Jitindar Singh 提交于
      We already hold the kvm->mmu_lock spin lock across updating the rc bits
      in the pte for the L1 guest. Continue to hold the lock across updating
      the rc bits in the pte for the nested guest as well to prevent
      invalidations from occurring.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bec6e03b
  5. 20 12月, 2018 2 次提交
  6. 17 12月, 2018 12 次提交
    • S
      KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L3 guest · 95d386c2
      Suraj Jitindar Singh 提交于
      Previously when a device was being emulated by an L1 guest for an L2
      guest, that device couldn't then be passed through to an L3 guest. This
      was because the L1 guest had no method for accessing L3 memory.
      
      The hcall H_COPY_TOFROM_GUEST provides this access. Thus this setup for
      passthrough can now be allowed.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      95d386c2
    • S
      KVM: PPC: Book3S: Introduce new hcall H_COPY_TOFROM_GUEST to access quadrants 1 & 2 · 6ff887b8
      Suraj Jitindar Singh 提交于
      A guest cannot access quadrants 1 or 2 as this would result in an
      exception. Thus introduce the hcall H_COPY_TOFROM_GUEST to be used by a
      guest when it wants to perform an access to quadrants 1 or 2, for
      example when it wants to access memory for one of its nested guests.
      
      Also provide an implementation for the kvm-hv module.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6ff887b8
    • S
      KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest · 873db2cd
      Suraj Jitindar Singh 提交于
      Allow for a device which is being emulated at L0 (the host) for an L1
      guest to be passed through to a nested (L2) guest.
      
      The existing kvmppc_hv_emulate_mmio function can be used here. The main
      challenge is that for a load the result must be stored into the L2 gpr,
      not an L1 gpr as would normally be the case after going out to qemu to
      complete the operation. This presents a challenge as at this point the
      L2 gpr state has been written back into L1 memory.
      
      To work around this we store the address in L1 memory of the L2 gpr
      where the result of the load is to be stored and use the new io_gpr
      value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
      which completion must be done when returning back into the kernel. Then
      in kvmppc_complete_mmio_load() the resultant value is written into L1
      memory at the location of the indicated L2 gpr.
      
      Note that we don't currently let an L1 guest emulate a device for an L2
      guest which is then passed through to an L3 guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      873db2cd
    • S
      KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants · cc6929cc
      Suraj Jitindar Singh 提交于
      The functions kvmppc_st and kvmppc_ld are used to access guest memory
      from the host using a guest effective address. They do so by translating
      through the process table to obtain a guest real address and then using
      kvm_read_guest or kvm_write_guest to make the access with the guest real
      address.
      
      This method of access however only works for L1 guests and will give the
      incorrect results for a nested guest.
      
      We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
      perform the access for a nested guesti (and a L1 guest). So attempt this
      method first and fall back to the old method if this fails and we aren't
      running a nested guest.
      
      At this stage there is no fall back method to perform the access for a
      nested guest and this is left as a future improvement. For now we will
      return to the nested guest and rely on the fact that a translation
      should be faulted in before retrying the access.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cc6929cc
    • S
      KVM: PPC: Add load_from_eaddr and store_to_eaddr to the kvmppc_ops struct · dceadcf9
      Suraj Jitindar Singh 提交于
      The kvmppc_ops struct is used to store function pointers to kvm
      implementation specific functions.
      
      Introduce two new functions load_from_eaddr and store_to_eaddr to be
      used to load from and store to a guest effective address respectively.
      
      Also implement these for the kvm-hv module. If we are using the radix
      mmu then we can call the functions to access quadrant 1 and 2.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      dceadcf9
    • S
      KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2 · d7b45615
      Suraj Jitindar Singh 提交于
      The POWER9 radix mmu has the concept of quadrants. The quadrant number
      is the two high bits of the effective address and determines the fully
      qualified address to be used for the translation. The fully qualified
      address consists of the effective lpid, the effective pid and the
      effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
      
      When accessing these quadrants the fully qualified address is obtained
      as follows:
      
      Quadrant		| Hypervisor		| Guest
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b00	| EA[0:1] = 0b00
      0			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = PIDR	| effPID  = PIDR
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b01	|
      1			| effLPID = LPIDR	| Invalid Access
      			| effPID  = PIDR	|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b10	|
      2			| effLPID = LPIDR	| Invalid Access
      			| effPID  = 0		|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b11	| EA[0:1] = 0b11
      3			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = 0		| effPID  = 0
      --------------------------------------------------------------------------
      
      In the Guest;
      Quadrant 3 is normally used to address the operating system since this
      uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
      be switched.
      Quadrant 0 is normally used to address user space since the effLPID and
      effPID are taken from the corresponding registers.
      
      In the Host;
      Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
      address the host.
      
      Quadrants 1 and 2 can be used by the host to address guest memory using
      a guest effective address. Since the effLPID comes from the LPID register,
      the host loads the LPID of the guest it would like to access (and the
      PID of the process) and can perform accesses to a guest effective
      address.
      
      This means quadrant 1 can be used to address the guest user space and
      quadrant 2 can be used to address the guest operating system from the
      hypervisor, using a guest effective address.
      
      Access to the quadrants can cause a Hypervisor Data Storage Interrupt
      (HDSI) due to being unable to perform partition scoped translation.
      Previously this could only be generated from a guest and so the code
      path expects us to take the KVM trampoline in the interrupt handler.
      This is no longer the case so we modify the handler to call
      bad_page_fault() to check if we were expecting this fault so we can
      handle it gracefully and just return with an error code. In the hash mmu
      case we still raise an unknown exception since quadrants aren't defined
      for the hash mmu.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d7b45615
    • S
      KVM: PPC: Book3S HV: Add function kvmhv_vcpu_is_radix() · d232afeb
      Suraj Jitindar Singh 提交于
      There exists a function kvm_is_radix() which is used to determine if a
      kvm instance is using the radix mmu. However this only applies to the
      first level (L1) guest. Add a function kvmhv_vcpu_is_radix() which can
      be used to determine if the current execution context of the vcpu is
      radix, accounting for if the vcpu is running a nested guest.
      
      Currently all nested guests must be radix but this may change in the
      future.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d232afeb
    • S
      KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines · 693ac10a
      Suraj Jitindar Singh 提交于
      The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
      availability of in kernel tce acceleration for vfio. However it is
      currently the case that this is only available on a powernv machine,
      not for a pseries machine.
      
      Thus make this capability dependent on having the cpu feature
      CPU_FTR_HVMODE.
      
      [paulus@ozlabs.org - fixed compilation for Book E.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      693ac10a
    • P
      KVM: PPC: Book3S HV: Flush guest mappings when turning dirty tracking on/off · 5af3e9d0
      Paul Mackerras 提交于
      This adds code to flush the partition-scoped page tables for a radix
      guest when dirty tracking is turned on or off for a memslot.  Only the
      guest real addresses covered by the memslot are flushed.  The reason
      for this is to get rid of any 2M PTEs in the partition-scoped page
      tables that correspond to host transparent huge pages, so that page
      dirtiness is tracked at a system page (4k or 64k) granularity rather
      than a 2M granularity.  The page tables are also flushed when turning
      dirty tracking off so that the memslot's address space can be
      repopulated with THPs if possible.
      
      To do this, we add a new function kvmppc_radix_flush_memslot().  Since
      this does what's needed for kvmppc_core_flush_memslot_hv() on a radix
      guest, we now make kvmppc_core_flush_memslot_hv() call the new
      kvmppc_radix_flush_memslot() rather than calling kvm_unmap_radix()
      for each page in the memslot.  This has the effect of fixing a bug in
      that kvmppc_core_flush_memslot_hv() was previously calling
      kvm_unmap_radix() without holding the kvm->mmu_lock spinlock, which
      is required to be held.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5af3e9d0
    • P
      KVM: PPC: Book3S HV: Cleanups - constify memslots, fix comments · c43c3a86
      Paul Mackerras 提交于
      This adds 'const' to the declarations for the struct kvm_memory_slot
      pointer parameters of some functions, which will make it possible to
      call those functions from kvmppc_core_commit_memory_region_hv()
      in the next patch.
      
      This also fixes some comments about locking.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c43c3a86
    • P
      KVM: PPC: Book3S HV: Map single pages when doing dirty page logging · f460f679
      Paul Mackerras 提交于
      For radix guests, this makes KVM map guest memory as individual pages
      when dirty page logging is enabled for the memslot corresponding to the
      guest real address.  Having a separate partition-scoped PTE for each
      system page mapped to the guest means that we have a separate dirty
      bit for each page, thus making the reported dirty bitmap more accurate.
      Without this, if part of guest memory is backed by transparent huge
      pages, the dirty status is reported at a 2MB granularity rather than
      a 64kB (or 4kB) granularity for that part, causing userspace to have
      to transmit more data when migrating the guest.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f460f679
    • B
      KVM: PPC: Pass change type down to memslot commit function · f032b734
      Bharata B Rao 提交于
      Currently, kvm_arch_commit_memory_region() gets called with a
      parameter indicating what type of change is being made to the memslot,
      but it doesn't pass it down to the platform-specific memslot commit
      functions.  This adds the `change' parameter to the lower-level
      functions so that they can use it in future.
      
      [paulus@ozlabs.org - fix book E also.]
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f032b734
  7. 14 12月, 2018 4 次提交
  8. 04 12月, 2018 1 次提交
  9. 15 11月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12
      Michael Roth 提交于
      While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
      pending signal in the L0 QEMU process can generate the following
      sequence:
      
        ret0 = kvmppc_pseries_do_hcall()
          ret1 = kvmhv_enter_nested_guest()
            ret2 = kvmhv_run_single_vcpu()
            if (ret2 == -EINTR)
              return H_INTERRUPT
          if (ret1 == H_INTERRUPT)
            kvmppc_set_gpr(vcpu, 3, 0)
            return -EINTR
          /* skipped: */
          kvmppc_set_gpr(vcpu, 3, ret)
          vcpu->arch.hcall_needed = 0
          return RESUME_GUEST
      
      which causes an exit to L0 userspace with ret0 == -EINTR.
      
      The intention seems to be to set the hcall return value to 0 (via
      VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
      once we resume executing the VCPU. However, because we don't set
      vcpu->arch.hcall_needed = 0, we do the following once userspace
      resumes execution via kvm_arch_vcpu_ioctl_run():
      
        ...
        } else if (vcpu->arch.hcall_needed) {
          int i
      
          kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
          for (i = 0; i < 9; ++i)
                 kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
          vcpu->arch.hcall_needed = 0;
      
      since vcpu->arch.hcall_needed == 1 indicates that userspace should
      have handled the hcall and stored the return value in
      run->papr_hcall.ret. Since that's not the case here, we can get an
      unexpected value in VCPU r3, which can result in
      kvmhv_p9_guest_entry() reporting an unexpected trap value when it
      returns from H_ENTER_NESTED, causing the following register dump to
      console via subsequent call to kvmppc_handle_exit_hv() in L1:
      
        [  350.612854] vcpu 00000000f9564cf8 (0):
        [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
        [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
        [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
        [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
        [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
        [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
        [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
        [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
        [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
        [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
        [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
        [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
        [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
        [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
        [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
        [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
        [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
        [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
        [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
        [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
        [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
        [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
        [  350.613962] dar = 0000772f95390000
        [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
        [  350.614073] SLB (0 entries):
        [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
        [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
      
      followed by L1's QEMU reporting the following before stopping execution
      of the nested guest:
      
        KVM: unknown exit, hardware reason 1
        NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
        MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
        TB 00000000 00000000 DECR 00000000
        GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
        GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
        GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
        GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
        GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
        GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
        GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
        GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
        CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
         SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
        SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
        SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
        HSRR0 0000000000000000 HSRR1 0000000000000000
         CFAR 0000000000000000
         LPCR 0000000003d40413
         PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000
      
      Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
      of H_ENTER_NESTED before we exit to L0 userspace.
      
      Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: linuxppc-dev@ozlabs.org
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6c08ec12
  10. 07 11月, 2018 1 次提交
    • S
      KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE · 28c5bcf7
      Scott Wood 提交于
      TRACE_INCLUDE_PATH and TRACE_INCLUDE_FILE are used by
      <trace/define_trace.h>, so like that #include, they should
      be outside #ifdef protection.
      
      They also need to be #undefed before defining, in case multiple trace
      headers are included by the same C file.  This became the case on
      book3e after commit cf4a6085 ("powerpc/mm: Add missing tracepoint for
      tlbie"), leading to the following build error:
      
         CC      arch/powerpc/kvm/powerpc.o
      In file included from arch/powerpc/kvm/powerpc.c:51:0:
      arch/powerpc/kvm/trace.h:9:0: error: "TRACE_INCLUDE_PATH" redefined
      [-Werror]
        #define TRACE_INCLUDE_PATH .
        ^
      In file included from arch/powerpc/kvm/../mm/mmu_decl.h:25:0,
                        from arch/powerpc/kvm/powerpc.c:48:
      ./arch/powerpc/include/asm/trace.h:224:0: note: this is the location of
      the previous definition
        #define TRACE_INCLUDE_PATH asm
        ^
      cc1: all warnings being treated as errors
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NScott Wood <oss@buserror.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      28c5bcf7
  11. 26 10月, 2018 1 次提交
  12. 20 10月, 2018 1 次提交
    • A
      KVM: PPC: Optimize clearing TCEs for sparse tables · 6e301a8e
      Alexey Kardashevskiy 提交于
      The powernv platform maintains 2 TCE tables for VFIO - a hardware TCE
      table and a table with userspace addresses. These tables are radix trees,
      we allocate indirect levels when they are written to. Since
      the memory allocation is problematic in real mode, we have 2 accessors
      to the entries:
      - for virtual mode: it allocates the memory and it is always expected
      to return non-NULL;
      - fr real mode: it does not allocate and can return NULL.
      
      Also, DMA windows can span to up to 55 bits of the address space and since
      we never have this much RAM, such windows are sparse. However currently
      the SPAPR TCE IOMMU driver walks through all TCEs to unpin DMA memory.
      
      Since we maintain a userspace addresses table for VFIO which is a mirror
      of the hardware table, we can use it to know which parts of the DMA
      window have not been mapped and skip these so does this patch.
      
      The bare metal systems do not have this problem as they use a bypass mode
      of a PHB which maps RAM directly.
      
      This helps a lot with sparse DMA windows, reducing the shutdown time from
      about 3 minutes per 1 billion TCEs to a few seconds for 32GB sparse guest.
      Just skipping the last level seems to be good enough.
      
      As non-allocating accessor is used now in virtual mode as well, rename it
      from IOMMU_TABLE_USERSPACE_ENTRY_RM (real mode) to _RO (read only).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6e301a8e
  13. 19 10月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips · 8d9fcacf
      Paul Mackerras 提交于
      This disables the use of the streamlined entry path for radix guests
      on early POWER9 chips that need the workaround added in commit
      a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM",
      2017-07-24), because the streamlined entry path does not include
      that workaround.  This also means that we can't do nested HV-KVM
      on those chips.
      
      Since the chips that need that workaround are the same ones that can't
      run both radix and HPT guests at the same time on different threads of
      a core, we use the existing 'no_mixing_hpt_and_radix' variable that
      identifies those chips to identify when we can't use the new guest
      entry path, and when we can't do nested virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8d9fcacf
  14. 18 10月, 2018 1 次提交
    • M
      powerpc: Add -Werror at arch/powerpc level · 23ad1a27
      Michael Ellerman 提交于
      Back when I added -Werror in commit ba55bd74 ("powerpc: Add
      configurable -Werror for arch/powerpc") I did it by adding it to most
      of the arch Makefiles.
      
      At the time we excluded math-emu, because apparently it didn't build
      cleanly. But that seems to have been fixed somewhere in the interim.
      
      So move the -Werror addition to the top-level of the arch, this saves
      us from repeating it in every Makefile and means we won't forget to
      add it to any new sub-dirs.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      23ad1a27
  15. 09 10月, 2018 2 次提交