1. 21 10月, 2019 1 次提交
    • F
      KVM: PPC: Report single stepping capability · 1a9167a2
      Fabiano Rosas 提交于
      When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
      the next instruction to be single stepped via the
      KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.
      
      This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
      to inform userspace about the state of single stepping support.
      
      We currently don't have support for guest single stepping implemented
      in Book3S HV so the capability is only present for Book3S PR and
      BookE.
      Signed-off-by: NFabiano Rosas <farosas@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1a9167a2
  2. 27 8月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions · 2ad7a27d
      Paul Mackerras 提交于
      There are some POWER9 machines where the OPAL firmware does not support
      the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
      The impact of this is that a guest using XIVE natively will not be able
      to be migrated successfully.  On the source side, the get_attr operation
      on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
      will fail; on the destination side, the set_attr operation for the same
      attribute will fail.
      
      This adds tests for the existence of the OPAL get/set queue state
      functions, and if they are not supported, the XIVE-native KVM device
      is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
      Userspace can then either provide a software emulation of XIVE, or
      else tell the guest that it does not have a XIVE controller available
      to it.
      
      Cc: stable@vger.kernel.org # v5.2+
      Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ad7a27d
  3. 05 8月, 2019 2 次提交
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  4. 05 6月, 2019 2 次提交
  5. 28 5月, 2019 1 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · a86cb413
      Thomas Huth 提交于
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      a86cb413
  6. 30 4月, 2019 2 次提交
  7. 16 4月, 2019 1 次提交
  8. 01 3月, 2019 1 次提交
  9. 19 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
  10. 17 12月, 2018 4 次提交
    • S
      KVM: PPC: Book3S HV: Allow passthrough of an emulated device to an L2 guest · 873db2cd
      Suraj Jitindar Singh 提交于
      Allow for a device which is being emulated at L0 (the host) for an L1
      guest to be passed through to a nested (L2) guest.
      
      The existing kvmppc_hv_emulate_mmio function can be used here. The main
      challenge is that for a load the result must be stored into the L2 gpr,
      not an L1 gpr as would normally be the case after going out to qemu to
      complete the operation. This presents a challenge as at this point the
      L2 gpr state has been written back into L1 memory.
      
      To work around this we store the address in L1 memory of the L2 gpr
      where the result of the load is to be stored and use the new io_gpr
      value KVM_MMIO_REG_NESTED_GPR to indicate that this is a nested load for
      which completion must be done when returning back into the kernel. Then
      in kvmppc_complete_mmio_load() the resultant value is written into L1
      memory at the location of the indicated L2 gpr.
      
      Note that we don't currently let an L1 guest emulate a device for an L2
      guest which is then passed through to an L3 guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      873db2cd
    • S
      KVM: PPC: Update kvmppc_st and kvmppc_ld to use quadrants · cc6929cc
      Suraj Jitindar Singh 提交于
      The functions kvmppc_st and kvmppc_ld are used to access guest memory
      from the host using a guest effective address. They do so by translating
      through the process table to obtain a guest real address and then using
      kvm_read_guest or kvm_write_guest to make the access with the guest real
      address.
      
      This method of access however only works for L1 guests and will give the
      incorrect results for a nested guest.
      
      We can however use the store_to_eaddr and load_from_eaddr kvmppc_ops to
      perform the access for a nested guesti (and a L1 guest). So attempt this
      method first and fall back to the old method if this fails and we aren't
      running a nested guest.
      
      At this stage there is no fall back method to perform the access for a
      nested guest and this is left as a future improvement. For now we will
      return to the nested guest and rely on the fact that a translation
      should be faulted in before retrying the access.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      cc6929cc
    • S
      KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines · 693ac10a
      Suraj Jitindar Singh 提交于
      The kvm capability KVM_CAP_SPAPR_TCE_VFIO is used to indicate the
      availability of in kernel tce acceleration for vfio. However it is
      currently the case that this is only available on a powernv machine,
      not for a pseries machine.
      
      Thus make this capability dependent on having the cpu feature
      CPU_FTR_HVMODE.
      
      [paulus@ozlabs.org - fixed compilation for Book E.]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      693ac10a
    • B
      KVM: PPC: Pass change type down to memslot commit function · f032b734
      Bharata B Rao 提交于
      Currently, kvm_arch_commit_memory_region() gets called with a
      parameter indicating what type of change is being made to the memslot,
      but it doesn't pass it down to the platform-specific memslot commit
      functions.  This adds the `change' parameter to the lower-level
      functions so that they can use it in future.
      
      [paulus@ozlabs.org - fix book E also.]
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f032b734
  11. 14 12月, 2018 1 次提交
  12. 09 10月, 2018 2 次提交
  13. 30 7月, 2018 1 次提交
  14. 18 7月, 2018 1 次提交
    • S
      KVM: PPC: Remove mmio_vsx_tx_sx_enabled in KVM MMIO emulation · 4eeb8556
      Simon Guo 提交于
      Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR
      reg number, and use mmio_vsx_tx_sx_enabled field together for
      0~63# VSR regs.
      
      Currently PPC KVM MMIO emulation is reimplemented with analyse_instr()
      assistance.  analyse_instr() returns 0~63 for VSR register number, so
      it is not necessary to use additional mmio_vsx_tx_sx_enabled field
      any more.
      
      This patch extends related reg bits (expand io_gpr to u16 from u8
      and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can
      be removed.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4eeb8556
  15. 02 6月, 2018 1 次提交
  16. 01 6月, 2018 3 次提交
  17. 22 5月, 2018 4 次提交
  18. 18 5月, 2018 1 次提交
    • S
      KVM: PPC: Fix a mmio_host_swabbed uninitialized usage issue · f19d1f36
      Simon Guo 提交于
      When KVM emulates VMX store, it will invoke kvmppc_get_vmx_data() to
      retrieve VMX reg val. kvmppc_get_vmx_data() will check mmio_host_swabbed
      to decide which double word of vr[] to be used. But the
      mmio_host_swabbed can be uninitialized during VMX store procedure:
      
      kvmppc_emulate_loadstore
      	\- kvmppc_handle_store128_by2x64
      		\- kvmppc_get_vmx_data
      
      So vcpu->arch.mmio_host_swabbed is not meant to be used at all for
      emulation of store instructions, and this patch makes that true for
      VMX stores. This patch also initializes mmio_host_swabbed to avoid
      possible future problems.
      Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      f19d1f36
  19. 23 3月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9 · 4bb3c7a0
      Paul Mackerras 提交于
      POWER9 has hardware bugs relating to transactional memory and thread
      reconfiguration (changes to hardware SMT mode).  Specifically, the core
      does not have enough storage to store a complete checkpoint of all the
      architected state for all four threads.  The DD2.2 version of POWER9
      includes hardware modifications designed to allow hypervisor software
      to implement workarounds for these problems.  This patch implements
      those workarounds in KVM code so that KVM guests see a full, working
      transactional memory implementation.
      
      The problems center around the use of TM suspended state, where the
      CPU has a checkpointed state but execution is not transactional.  The
      workaround is to implement a "fake suspend" state, which looks to the
      guest like suspended state but the CPU does not store a checkpoint.
      In this state, any instruction that would cause a transition to
      transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
      checkpointed state (treclaim) causes a "soft patch" interrupt (vector
      0x1500) to the hypervisor so that it can be emulated.  The trechkpt
      instruction also causes a soft patch interrupt.
      
      On POWER9 DD2.2, we avoid returning to the guest in any state which
      would require a checkpoint to be present.  The trechkpt in the guest
      entry path which would normally create that checkpoint is replaced by
      either a transition to fake suspend state, if the guest is in suspend
      state, or a rollback to the pre-transactional state if the guest is in
      transactional state.  Fake suspend state is indicated by a flag in the
      PACA plus a new bit in the PSSCR.  The new PSSCR bit is write-only and
      reads back as 0.
      
      On exit from the guest, if the guest is in fake suspend state, we still
      do the treclaim instruction as we would in real suspend state, in order
      to get into non-transactional state, but we do not save the resulting
      register state since there was no checkpoint.
      
      Emulation of the instructions that cause a softpatch interrupt is
      handled in two paths.  If the guest is in real suspend mode, we call
      kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
      transitioning to transactional state.  This is called before we do the
      treclaim in the guest exit path; because we haven't done treclaim, we
      can get back to the guest with the transaction still active.  If the
      instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
      handle, or if the guest is in fake suspend state, then we proceed to
      do the complete guest exit path and subsequently call
      kvmhv_p9_tm_emulation() in host context with the MMU on.  This handles
      all the cases including the cases that generate program interrupts
      (illegal instruction or TM Bad Thing) and facility unavailable
      interrupts.
      
      The emulation is reasonably straightforward and is mostly concerned
      with checking for exception conditions and updating the state of
      registers such as MSR and CR0.  The treclaim emulation takes care to
      ensure that the TEXASR register gets updated as if it were the guest
      treclaim instruction that had done failure recording, not the treclaim
      done in hypervisor state in the guest exit path.
      
      With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
      transactional memory is not available to host userspace.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bb3c7a0
  20. 13 2月, 2018 2 次提交
    • P
      KVM: PPC: Book3S: Fix compile error that occurs with some gcc versions · 6df3877f
      Paul Mackerras 提交于
      Some versions of gcc generate a warning that the variable "emulated"
      may be used uninitialized in function kvmppc_handle_load128_by2x64().
      It would be used uninitialized if kvmppc_handle_load128_by2x64 was
      ever called with vcpu->arch.mmio_vmx_copy_nums == 0, but neither of
      the callers ever do that, so there is no actual bug.  When gcc
      generates a warning, it causes the build to fail because arch/powerpc
      is compiled with -Werror.
      
      This silences the warning by initializing "emulated" to EMULATE_DONE.
      
      Fixes: 09f98496 ("KVM: PPC: Book3S: Add MMIO emulation for VMX instructions")
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6df3877f
    • P
      KVM: PPC: Fix compile error that occurs when CONFIG_ALTIVEC=n · c662f773
      Paul Mackerras 提交于
      Commit accb757d ("KVM: Move vcpu_load to arch-specific
      kvm_arch_vcpu_ioctl_run", 2017-12-04) added a "goto out"
      statement and an "out:" label to kvm_arch_vcpu_ioctl_run().
      Since the only "goto out" is inside a CONFIG_VSX block,
      compiling with CONFIG_VSX=n gives a warning that label "out"
      is defined but not used, and because arch/powerpc is compiled
      with -Werror, that becomes a compile error that makes the kernel
      build fail.
      
      Merge commit 1ab03c07 ("Merge tag 'kvm-ppc-next-4.16-2' of
      git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc",
      2018-02-09) added a similar block of code inside a #ifdef
      CONFIG_ALTIVEC, with a "goto out" statement.
      
      In order to make the build succeed, this adds a #ifdef around the
      "out:" label.  This is a minimal, ugly fix, to be replaced later
      by a refactoring of the code.  Since CONFIG_VSX depends on
      CONFIG_ALTIVEC, it is sufficient to use #ifdef CONFIG_ALTIVEC here.
      
      Fixes: accb757d ("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_run")
      Reported-by: NChristian Zigotzky <chzigotzky@xenosoft.de>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c662f773
  21. 09 2月, 2018 2 次提交
  22. 19 1月, 2018 1 次提交
    • P
      KVM: PPC: Book3S: Provide information about hardware/firmware CVE workarounds · 3214d01f
      Paul Mackerras 提交于
      This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
      information about the underlying machine's level of vulnerability
      to the recently announced vulnerabilities CVE-2017-5715,
      CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
      instructions to assist software to work around the vulnerabilities.
      
      The ioctl returns two u64 words describing characteristics of the
      CPU and required software behaviour respectively, plus two mask
      words which indicate which bits have been filled in by the kernel,
      for extensibility.  The bit definitions are the same as for the
      new H_GET_CPU_CHARACTERISTICS hypercall.
      
      There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
      indicates whether the new ioctl is available.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3214d01f
  23. 16 1月, 2018 1 次提交
    • P
      KVM: PPC: Book3S HV: Enable migration of decrementer register · 5855564c
      Paul Mackerras 提交于
      This adds a register identifier for use with the one_reg interface
      to allow the decrementer expiry time to be read and written by
      userspace.  The decrementer expiry time is in guest timebase units
      and is equal to the sum of the decrementer and the guest timebase.
      (The expiry time is used rather than the decrementer value itself
      because the expiry time is not constantly changing, though the
      decrementer value is, while the guest vcpu is not running.)
      
      Without this, a guest vcpu migrated to a new host will see its
      decrementer set to some random value.  On POWER8 and earlier, the
      decrementer is 32 bits wide and counts down at 512MHz, so the
      guest vcpu will potentially see no decrementer interrupts for up
      to about 4 seconds, which will lead to a stall.  With POWER9, the
      decrementer is now 56 bits side, so the stall can be much longer
      (up to 2.23 years) and more noticeable.
      
      To help work around the problem in cases where userspace has not been
      updated to migrate the decrementer expiry time, we now set the
      default decrementer expiry at vcpu creation time to the current time
      rather than the maximum possible value.  This should mean an
      immediate decrementer interrupt when a migrated vcpu starts
      running.  In cases where the decrementer is 32 bits wide and more
      than 4 seconds elapse between the creation of the vcpu and when it
      first runs, the decrementer would have wrapped around to positive
      values and there may still be a stall - but this is no worse than
      the current situation.  In the large-decrementer case, we are sure
      to get an immediate decrementer interrupt (assuming the time from
      vcpu creation to first run is less than 2.23 years) and we thus
      avoid a very long stall.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5855564c
  24. 14 12月, 2017 3 次提交