1. 28 11月, 2019 1 次提交
    • B
      KVM: PPC: Book3S HV: Support for running secure guests · ca9f4942
      Bharata B Rao 提交于
      A pseries guest can be run as secure guest on Ultravisor-enabled
      POWER platforms. On such platforms, this driver will be used to manage
      the movement of guest pages between the normal memory managed by
      hypervisor (HV) and secure memory managed by Ultravisor (UV).
      
      HV is informed about the guest's transition to secure mode via hcalls:
      
      H_SVM_INIT_START: Initiate securing a VM
      H_SVM_INIT_DONE: Conclude securing a VM
      
      As part of H_SVM_INIT_START, register all existing memslots with
      the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
      the guest to secure mode is complete.
      
      These two states (transition to secure mode STARTED and transition
      to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
      Setting these states will cause the assembly code that enters the
      guest to call the UV_RETURN ucall instead of trying to enter the
      guest directly.
      
      Migration of pages betwen normal and secure memory of secure
      guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
      
      H_SVM_PAGE_IN: Move the content of a normal page to secure page
      H_SVM_PAGE_OUT: Move the content of a secure page to normal page
      
      Private ZONE_DEVICE memory equal to the amount of secure memory
      available in the platform for running secure guests is created.
      Whenever a page belonging to the guest becomes secure, a page from
      this private device memory is used to represent and track that secure
      page on the HV side. The movement of pages between normal and secure
      memory is done via migrate_vma_pages() using UV_PAGE_IN and
      UV_PAGE_OUT ucalls.
      
      In order to prevent the device private pages (that correspond to pages
      of secure guest) from participating in KSM merging, H_SVM_PAGE_IN
      calls ksm_madvise() under read version of mmap_sem. However
      ksm_madvise() needs to be under write lock.  Hence we call
      kvmppc_svm_page_in with mmap_sem held for writing, and it then
      downgrades to a read lock after calling ksm_madvise.
      
      [paulus@ozlabs.org - roll in patch "KVM: PPC: Book3S HV: Take write
       mmap_sem when calling ksm_madvise"]
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ca9f4942
  2. 22 10月, 2019 3 次提交
  3. 21 9月, 2019 1 次提交
    • J
      powerpc/64s: Set reserved PCR bits · 13c7bb3c
      Jordan Niethe 提交于
      Currently the reserved bits of the Processor Compatibility
      Register (PCR) are cleared as per the Programming Note in Section
      1.3.3 of version 3.0B of the Power ISA. This causes all new
      architecture features to be made available when running on newer
      processors with new architecture features added to the PCR as bits
      must be set to disable a given feature.
      
      For example to disable new features added as part of Version 2.07 of
      the ISA the corresponding bit in the PCR needs to be set.
      
      As new processor features generally require explicit kernel support
      they should be disabled until such support is implemented. Therefore
      kernels should set all unknown/reserved bits in the PCR such that any
      new architecture features which the kernel does not currently know
      about get disabled.
      
      An update is planned to the ISA to clarify that the PCR is an
      exception to the Programming Note on reserved bits in Section 1.3.3.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Tested-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
      13c7bb3c
  4. 05 9月, 2019 1 次提交
    • N
      powerpc/64s/radix: introduce options to disable use of the tlbie instruction · 2275d7b5
      Nicholas Piggin 提交于
      Introduce two options to control the use of the tlbie instruction. A
      boot time option which completely disables the kernel using the
      instruction, this is currently incompatible with HASH MMU, KVM, and
      coherent accelerators.
      
      And a debugfs option can be switched at runtime and avoids using tlbie
      for invalidating CPU TLBs for normal process and kernel address
      mappings. Coherent accelerators are still managed with tlbie, as will
      KVM partition scope translations.
      
      Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
      basic implementation which does not attempt to make any optimisation
      beyond the tlbie implementation.
      
      This is useful for performance testing among other things. For example
      in certain situations on large systems, using IPIs may be faster than
      tlbie as they can be directed rather than broadcast. Later we may also
      take advantage of the IPIs to do more interesting things such as trim
      the mm cpumask more aggressively.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
      2275d7b5
  5. 27 8月, 2019 2 次提交
    • P
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · ff42df49
      Paul Mackerras 提交于
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      ff42df49
    • P
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · d28eafc5
      Paul Mackerras 提交于
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d28eafc5
  6. 15 7月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries · c8b4083d
      Suraj Jitindar Singh 提交于
      The Performance Stop Status and Control Register (PSSCR) is used to
      control the power saving facilities of the processor. This register
      has various fields, some of which can be modified only in hypervisor
      state, and others which can be modified in both hypervisor and
      privileged non-hypervisor state. The bits which can be modified in
      privileged non-hypervisor state are referred to as guest visible.
      
      Currently the L0 hypervisor saves and restores both it's own host
      value as well as the guest value of the PSSCR when context switching
      between the hypervisor and guest. However a nested hypervisor running
      it's own nested guests (as indicated by kvmhv_on_pseries()) doesn't
      context switch the PSSCR register. That means if a nested (L2) guest
      modifies the PSSCR then the L1 guest hypervisor will run with that
      modified value, and if the L1 guest hypervisor modifies the PSSCR and
      then goes to run the nested (L2) guest again then the L2 PSSCR value
      will be lost.
      
      Fix this by having the (L1) nested hypervisor save and restore both
      its host and the guest PSSCR value when entering and exiting a
      nested (L2) guest. Note that only the guest visible parts of the PSSCR
      are context switched since this is all the L1 nested hypervisor can
      access, this is fine however as these are the only fields the L0
      hypervisor provides guest control of anyway and so all other fields
      are ignored.
      
      This could also have been implemented by adding the PSSCR register to
      the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED
      hcall, however this would have meant updating the structure layout and
      thus required modifications to both the L0 and L1 kernels. Whereas the
      approach used doesn't require L0 kernel modifications while achieving
      the same result.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190703012022.15644-3-sjitindarsingh@gmail.com
      c8b4083d
    • S
      KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting · 63279eeb
      Suraj Jitindar Singh 提交于
      The performance monitoring unit (PMU) registers are saved on guest
      exit when the guest has set the pmcregs_in_use flag in its lppaca, if
      it exists, or unconditionally if it doesn't. If a nested guest is
      being run then the hypervisor doesn't, and in most cases can't, know
      if the PMU registers are in use since it doesn't know the location of
      the lppaca for the nested guest, although it may have one for its
      immediate guest. This results in the values of these registers being
      lost across nested guest entry and exit in the case where the nested
      guest was making use of the performance monitoring facility while it's
      nested guest hypervisor wasn't.
      
      Further more the hypervisor could interrupt a guest hypervisor between
      when it has loaded up the PMU registers and it calling H_ENTER_NESTED
      or between returning from the nested guest to the guest hypervisor and
      the guest hypervisor reading the PMU registers, in
      kvmhv_p9_guest_entry(). This means that it isn't sufficient to just
      save the PMU registers when entering or exiting a nested guest, but
      that it is necessary to always save the PMU registers whenever a guest
      is capable of running nested guests to ensure the register values
      aren't lost in the context switch.
      
      Ensure the PMU register values are preserved by always saving their
      value into the vcpu struct when a guest is capable of running nested
      guests.
      
      This should have minimal performance impact however any impact can be
      avoided by booting a guest with "-machine pseries,cap-nested-hv=false"
      on the qemu commandline.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190703012022.15644-1-sjitindarsingh@gmail.com
      63279eeb
  7. 20 6月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry · 3c25ab35
      Suraj Jitindar Singh 提交于
      If we enter an L1 guest with a pending decrementer exception then this
      is cleared on guest exit if the guest has writtien a positive value
      into the decrementer (indicating that it handled the decrementer
      exception) since there is no other way to detect that the guest has
      handled the pending exception and that it should be dequeued. In the
      event that the L1 guest tries to run a nested (L2) guest immediately
      after this and the L2 guest decrementer is negative (which is loaded
      by L1 before making the H_ENTER_NESTED hcall), then the pending
      decrementer exception isn't cleared and the L2 entry is blocked since
      L1 has a pending exception, even though L1 may have already handled
      the exception and written a positive value for it's decrementer. This
      results in a loop of L1 trying to enter the L2 guest and L0 blocking
      the entry since L1 has an interrupt pending with the outcome being
      that L2 never gets to run and hangs.
      
      Fix this by clearing any pending decrementer exceptions when L1 makes
      the H_ENTER_NESTED hcall since it won't do this if it's decrementer
      has gone negative, and anyway it's decrementer has been communicated
      to L0 in the hdec_expires field and L0 will return control to L1 when
      this goes negative by delivering an H_DECREMENTER exception.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3c25ab35
    • S
      KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer · 86953770
      Suraj Jitindar Singh 提交于
      On POWER9 the decrementer can operate in large decrementer mode where
      the decrementer is 56 bits and signed extended to 64 bits. When not
      operating in this mode the decrementer behaves as a 32 bit decrementer
      which is NOT signed extended (as on POWER8).
      
      Currently when reading a guest decrementer value we don't take into
      account whether the large decrementer is enabled or not, and this
      means the value will be incorrect when the guest is not using the
      large decrementer. Fix this by sign extending the value read when the
      guest isn't using the large decrementer.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      86953770
  8. 19 6月, 2019 1 次提交
  9. 30 5月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() · d724c9e5
      Suraj Jitindar Singh 提交于
      The sprgs are a set of 4 general purpose sprs provided for software use.
      SPRG3 is special in that it can also be read from userspace. Thus it is
      used on linux to store the cpu and numa id of the process to speed up
      syscall access to this information.
      
      This register is overwritten with the guest value on kvm guest entry,
      and so needs to be restored on exit again. Thus restore the value on
      the guest exit path in kvmhv_p9_guest_entry().
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d724c9e5
    • P
      KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 · 1b28d553
      Paul Mackerras 提交于
      Commit 3309bec8 ("KVM: PPC: Book3S HV: Fix lockdep warning when
      entering the guest") moved calls to trace_hardirqs_{on,off} in the
      entry path used for HPT guests.  Similar code exists in the new
      streamlined entry path used for radix guests on POWER9.  This makes
      the same change there, so as to avoid lockdep warnings such as this:
      
      [  228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      [  228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270
      [  228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat
      +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter
      +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf
      +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs
      +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor
      +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea
      +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core
      [  228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42
      [  228.686911] NIP:  c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20
      [  228.686963] REGS: c000200cdb50f570 TRAP: 0700   Not tainted  (5.2.0-rc1-xive+)
      [  228.687001] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 48222222  XER: 20040000
      [  228.687060] CFAR: c000000000116db0 IRQMASK: 1
      [  228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e
      [  228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f
      [  228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000
      [  228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000
      [  228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900
      [  228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074
      [  228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000
      [  228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0
      [  228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270
      [  228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270
      [  228.687466] Call Trace:
      [  228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable)
      [  228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0
      [  228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100
      [  228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340
      [  228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv]
      [  228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv]
      [  228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [  228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
      [  228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm]
      [  228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
      [  228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
      [  228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80
      [  228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70
      [  228.688109] Instruction dump:
      [  228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87
      [  228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87
      [  228.688251] irq event stamp: 205
      [  228.688287] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688412] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      [  228.688513] ---[ end trace eb16f6260022a812 ]---
      [  228.688548] possible reason: unannotated irqs-off.
      [  228.688571] irq event stamp: 205
      [  228.688607] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688719] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Tested-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b28d553
  10. 29 5月, 2019 2 次提交
    • P
      KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu · 5a3f4936
      Paul Mackerras 提交于
      Currently the HV KVM code takes the kvm->lock around calls to
      kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call
      kvm_for_each_vcpu() internally).  However, that leads to a lock
      order inversion problem, because these are called in contexts where
      the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock
      according to Documentation/virtual/kvm/locking.txt.  Hence there
      is a possibility of deadlock.
      
      To fix this, we simply don't take the kvm->lock mutex around these
      calls.  This is safe because the implementations of kvm_for_each_vcpu()
      and kvm_get_vcpu_by_id() have been designed to be able to be called
      locklessly.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5a3f4936
    • P
      KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup · 0d4ee88d
      Paul Mackerras 提交于
      Currently the HV KVM code uses kvm->lock in conjunction with a flag,
      kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu
      execution until the MMU-related data structures are ready.  However,
      this means that kvm->lock is being taken inside vcpu->mutex, which
      is contrary to Documentation/virtual/kvm/locking.txt and results in
      lockdep warnings.
      
      To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests
      inside the vcpu mutexes, and is taken in the places where kvm->lock
      was taken that are related to MMU setup.
      
      Additionally we take the new mutex in the vcpu creation code at the
      point where we are creating a new vcore, in order to provide mutual
      exclusion with kvmppc_update_lpcr() and ensure that an update to
      kvm->arch.lpcr doesn't get missed, which could otherwise lead to a
      stale vcore->lpcr value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0d4ee88d
  11. 30 4月, 2019 5 次提交
    • S
      KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry() · 44b198ae
      Suraj Jitindar Singh 提交于
      On POWER9 and later processors where the host can schedule vcpus on a
      per thread basis, there is a streamlined entry path used when the guest
      is radix. This entry path saves/restores the fp and vr state in
      kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
      load_[fp/vr]_state(). This is the same as the old entry path however the
      old entry path also saved/restored the VRSAVE register, which isn't done
      in the new entry path.
      
      This means that the vrsave register is now volatile across guest exit,
      which is an incorrect change in behaviour.
      
      Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
      This restores the old, correct, behaviour.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      44b198ae
    • P
      KVM: PPC: Book3S HV: Flush TLB on secondary radix threads · 70ea13f6
      Paul Mackerras 提交于
      When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
      in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
      If those guests are in radix mode, we fail to load the LPID and flush
      the TLB if necessary, leading to the guest crashing with an
      unsupported MMU fault.  This arises from commit 9a4506e1 ("KVM:
      PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
      with relocation on", 2018-05-17), which didn't consider the case
      where indep_threads_mode = N.
      
      For simplicity, this makes the real-mode guest entry path flush the
      TLB in the same place for both radix and hash guests, as we did before
      9a4506e1, though the code is now C code rather than assembly code.
      We also have the radix TLB flush open-coded rather than calling
      radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
      called in real mode, and in real mode we don't want to invoke the
      tracepoint code.
      
      Fixes: 9a4506e1 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      70ea13f6
    • P
      KVM: PPC: Book3S HV: smb->smp comment fixup · 6fabc9f2
      Palmer Dabbelt 提交于
      I made the same typo when trying to grep for uses of smp_wmb and figured
      I might as well fix it.
      Signed-off-by: NPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6fabc9f2
    • A
      KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest · 3309bec8
      Alexey Kardashevskiy 提交于
      The trace_hardirqs_on() sets current->hardirqs_enabled and from here
      the lockdep assumes interrupts are enabled although they are remain
      disabled until the context switches to the guest. Consequent
      srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
      disabled interrupts and prints a warning (see below).
      
      This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
      prevent lockdep from being confused.
      
      DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
      [...]
      NIP [c000000000185b84] check_flags.part.25+0x224/0x280
      LR [c000000000185b80] check_flags.part.25+0x220/0x280
      Call Trace:
      [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
      [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
      [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
      [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
      [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
      Instruction dump:
      419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
      3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      ---[ end trace 31180adcc848993e ]---
      possible reason: unannotated irqs-off.
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      
      Fixes: 8b24e69f ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3309bec8
    • S
      KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler · 2d34d1c3
      Suraj Jitindar Singh 提交于
      Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel handler halves the time to handle this H_CALL compared to
      handling it in userspace for a radix guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2d34d1c3
  12. 20 4月, 2019 1 次提交
    • M
      powerpc: Add force enable of DAWR on P9 option · c1fe190c
      Michael Neuling 提交于
      This adds a flag so that the DAWR can be enabled on P9 via:
        echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
      
      The DAWR was previously force disabled on POWER9 in:
        96541531 powerpc: Disable DAWR in the base POWER9 CPU features
      Also see Documentation/powerpc/DAWR-POWER9.txt
      
      This is a dangerous setting, USE AT YOUR OWN RISK.
      
      Some users may not care about a bad user crashing their box
      (ie. single user/desktop systems) and really want the DAWR.  This
      allows them to force enable DAWR.
      
      This flag can also be used to disable DAWR access. Once this is
      cleared, all DAWR access should be cleared immediately and your
      machine once again safe from crashing.
      
      Userspace may get confused by toggling this. If DAWR is force
      enabled/disabled between getting the number of breakpoints (via
      PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
      inconsistent view of what's available. Similarly for guests.
      
      For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
      enabled in the host AND the guest. For this reason, this won't work on
      POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
      dawr_enable_dangerous file will fail if the hypervisor doesn't support
      writing the DAWR.
      
      To double check the DAWR is working, run this kernel selftest:
        tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
      Any errors/failures/skips mean something is wrong.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c1fe190c
  13. 05 4月, 2019 1 次提交
    • S
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh 提交于
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  14. 22 2月, 2019 1 次提交
    • J
      KVM: PPC: Book3S HV: Fix build failure without IOMMU support · e40542af
      Jordan Niethe 提交于
      Currently trying to build without IOMMU support will fail:
      
        (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
        (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
        (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
        (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
      
      This happens because turning off IOMMU support will prevent
      book3s_64_vio_hv.c from being built because it is only built when
      SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
      
      Fix it using ifdefs for the undefined references.
      
      Fixes: 76d837a4 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e40542af
  15. 21 2月, 2019 6 次提交
    • P
      powerpc/64s: Better printing of machine check info for guest MCEs · c0577201
      Paul Mackerras 提交于
      This adds an "in_guest" parameter to machine_check_print_event_info()
      so that we can avoid trying to translate guest NIP values into
      symbolic form using the host kernel's symbol table.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0577201
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
    • M
      KVM: PPC: Book3S HV: Context switch AMR on Power9 · d976f680
      Michael Ellerman 提交于
      kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
      when guest and host are both running with the Radix MMU.
      
      Currently in that path we don't save the host AMR (Authority Mask
      Register) value, and we always restore 0 on return to the host. That
      is OK at the moment because the AMR is not used for storage keys with
      the Radix MMU.
      
      However we plan to start using the AMR on Radix to prevent the kernel
      from reading/writing to userspace outside of copy_to/from_user(). In
      order to make that work we need to save/restore the AMR value.
      
      We only restore the value if it is different from the guest value,
      which is already in the register when we exit to the host. This should
      mean we rarely need to actually restore the value when running a
      modern Linux as a guest, because it will be using the same value as
      us.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NRussell Currey <ruscur@russell.cc>
      d976f680
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
  16. 19 2月, 2019 2 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
    • W
      KVM: PPC: Book3S HV: Replace kmalloc_node+memset with kzalloc_node · 08434ab4
      wangbo 提交于
      Replace kmalloc_node and memset with kzalloc_node
      Signed-off-by: Nwangbo <wang.bo116@zte.com.cn>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      08434ab4
  17. 17 12月, 2018 5 次提交
  18. 14 12月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix race between kvm_unmap_hva_range and MMU mode switch · 234ff0b7
      Paul Mackerras 提交于
      Testing has revealed an occasional crash which appears to be caused
      by a race between kvmppc_switch_mmu_to_hpt and kvm_unmap_hva_range_hv.
      The symptom is a NULL pointer dereference in __find_linux_pte() called
      from kvm_unmap_radix() with kvm->arch.pgtable == NULL.
      
      Looking at kvmppc_switch_mmu_to_hpt(), it does indeed clear
      kvm->arch.pgtable (via kvmppc_free_radix()) before setting
      kvm->arch.radix to NULL, and there is nothing to prevent
      kvm_unmap_hva_range_hv() or the other MMU callback functions from
      being called concurrently with kvmppc_switch_mmu_to_hpt() or
      kvmppc_switch_mmu_to_radix().
      
      This patch therefore adds calls to spin_lock/unlock on the kvm->mmu_lock
      around the assignments to kvm->arch.radix, and makes sure that the
      partition-scoped radix tree or HPT is only freed after changing
      kvm->arch.radix.
      
      This also takes the kvm->mmu_lock in kvmppc_rmap_reset() to make sure
      that the clearing of each rmap array (one per memslot) doesn't happen
      concurrently with use of the array in the kvm_unmap_hva_range_hv()
      or the other MMU callbacks.
      
      Fixes: 18c3640c ("KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host")
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      234ff0b7
  19. 15 11月, 2018 1 次提交
    • M
      KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED · 6c08ec12
      Michael Roth 提交于
      While running a nested guest VCPU on L0 via H_ENTER_NESTED hcall, a
      pending signal in the L0 QEMU process can generate the following
      sequence:
      
        ret0 = kvmppc_pseries_do_hcall()
          ret1 = kvmhv_enter_nested_guest()
            ret2 = kvmhv_run_single_vcpu()
            if (ret2 == -EINTR)
              return H_INTERRUPT
          if (ret1 == H_INTERRUPT)
            kvmppc_set_gpr(vcpu, 3, 0)
            return -EINTR
          /* skipped: */
          kvmppc_set_gpr(vcpu, 3, ret)
          vcpu->arch.hcall_needed = 0
          return RESUME_GUEST
      
      which causes an exit to L0 userspace with ret0 == -EINTR.
      
      The intention seems to be to set the hcall return value to 0 (via
      VCPU r3) so that L1 will see a successful return from H_ENTER_NESTED
      once we resume executing the VCPU. However, because we don't set
      vcpu->arch.hcall_needed = 0, we do the following once userspace
      resumes execution via kvm_arch_vcpu_ioctl_run():
      
        ...
        } else if (vcpu->arch.hcall_needed) {
          int i
      
          kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
          for (i = 0; i < 9; ++i)
                 kvmppc_set_gpr(vcpu, 4 + i, run->papr_hcall.args[i]);
          vcpu->arch.hcall_needed = 0;
      
      since vcpu->arch.hcall_needed == 1 indicates that userspace should
      have handled the hcall and stored the return value in
      run->papr_hcall.ret. Since that's not the case here, we can get an
      unexpected value in VCPU r3, which can result in
      kvmhv_p9_guest_entry() reporting an unexpected trap value when it
      returns from H_ENTER_NESTED, causing the following register dump to
      console via subsequent call to kvmppc_handle_exit_hv() in L1:
      
        [  350.612854] vcpu 00000000f9564cf8 (0):
        [  350.612915] pc  = c00000000013eb98  msr = 8000000000009033  trap = 1
        [  350.613020] r 0 = c0000000004b9044  r16 = 0000000000000000
        [  350.613075] r 1 = c00000007cffba30  r17 = 0000000000000000
        [  350.613120] r 2 = c00000000178c100  r18 = 00007fffc24f3b50
        [  350.613166] r 3 = c00000007ef52480  r19 = 00007fffc24fff58
        [  350.613212] r 4 = 0000000000000000  r20 = 00000a1e96ece9d0
        [  350.613253] r 5 = 70616d00746f6f72  r21 = 00000a1ea117c9b0
        [  350.613295] r 6 = 0000000000000020  r22 = 00000a1ea1184360
        [  350.613338] r 7 = c0000000783be440  r23 = 0000000000000003
        [  350.613380] r 8 = fffffffffffffffc  r24 = 00000a1e96e9e124
        [  350.613423] r 9 = c00000007ef52490  r25 = 00000000000007ff
        [  350.613469] r10 = 0000000000000004  r26 = c00000007eb2f7a0
        [  350.613513] r11 = b0616d0009eccdb2  r27 = c00000007cffbb10
        [  350.613556] r12 = c0000000004b9000  r28 = c00000007d83a2c0
        [  350.613597] r13 = c000000001b00000  r29 = c0000000783cdf68
        [  350.613639] r14 = 0000000000000000  r30 = 0000000000000000
        [  350.613681] r15 = 0000000000000000  r31 = c00000007cffbbf0
        [  350.613723] ctr = c0000000004b9000  lr  = c0000000004b9044
        [  350.613765] srr0 = 0000772f954dd48c srr1 = 800000000280f033
        [  350.613808] sprg0 = 0000000000000000 sprg1 = c000000001b00000
        [  350.613859] sprg2 = 0000772f9565a280 sprg3 = 0000000000000000
        [  350.613911] cr = 88002848  xer = 0000000020040000  dsisr = 42000000
        [  350.613962] dar = 0000772f95390000
        [  350.614031] fault dar = c000000244b278c0 dsisr = 00000000
        [  350.614073] SLB (0 entries):
        [  350.614157] lpcr = 0040000003d40413 sdr1 = 0000000000000000 last_inst = ffffffff
        [  350.614252] trap=0x1 | pc=0xc00000000013eb98 | msr=0x8000000000009033
      
      followed by L1's QEMU reporting the following before stopping execution
      of the nested guest:
      
        KVM: unknown exit, hardware reason 1
        NIP c00000000013eb98   LR c0000000004b9044 CTR c0000000004b9000 XER 0000000020040000 CPU#0
        MSR 8000000000009033 HID0 0000000000000000  HF 8000000000000000 iidx 3 didx 3
        TB 00000000 00000000 DECR 00000000
        GPR00 c0000000004b9044 c00000007cffba30 c00000000178c100 c00000007ef52480
        GPR04 0000000000000000 70616d00746f6f72 0000000000000020 c0000000783be440
        GPR08 fffffffffffffffc c00000007ef52490 0000000000000004 b0616d0009eccdb2
        GPR12 c0000000004b9000 c000000001b00000 0000000000000000 0000000000000000
        GPR16 0000000000000000 0000000000000000 00007fffc24f3b50 00007fffc24fff58
        GPR20 00000a1e96ece9d0 00000a1ea117c9b0 00000a1ea1184360 0000000000000003
        GPR24 00000a1e96e9e124 00000000000007ff c00000007eb2f7a0 c00000007cffbb10
        GPR28 c00000007d83a2c0 c0000000783cdf68 0000000000000000 c00000007cffbbf0
        CR 88002848  [ L  L  -  -  E  L  G  L  ]             RES ffffffffffffffff
         SRR0 0000772f954dd48c  SRR1 800000000280f033    PVR 00000000004e1202 VRSAVE 0000000000000000
        SPRG0 0000000000000000 SPRG1 c000000001b00000  SPRG2 0000772f9565a280  SPRG3 0000000000000000
        SPRG4 0000000000000000 SPRG5 0000000000000000  SPRG6 0000000000000000  SPRG7 0000000000000000
        HSRR0 0000000000000000 HSRR1 0000000000000000
         CFAR 0000000000000000
         LPCR 0000000003d40413
         PTCR 0000000000000000   DAR 0000772f95390000  DSISR 0000000042000000
      
      Fix this by setting vcpu->arch.hcall_needed = 0 to indicate completion
      of H_ENTER_NESTED before we exit to L0 userspace.
      
      Fixes: 360cae31 ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: linuxppc-dev@ozlabs.org
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6c08ec12