1. 30 4月, 2019 23 次提交
    • C
      KVM: PPC: Book3S HV: XIVE: Add passthrough support · 232b984b
      Cédric Le Goater 提交于
      The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
      implement an IRQ space for the guest using the generic IPI interrupts
      of the XIVE IC controller. These interrupts are allocated at the OPAL
      level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
      Interrupt management is performed in the XIVE way: using loads and
      stores on the addresses of the XIVE IPI interrupt ESB pages.
      
      Both KVM devices share the same internal structure caching information
      on the interrupts, among which the xive_irq_data struct containing the
      addresses of the IPI ESB pages and an extra one in case of pass-through.
      The later contains the addresses of the ESB pages of the underlying HW
      controller interrupts, PHB4 in all cases for now.
      
      A guest, when running in the XICS legacy interrupt mode, lets the KVM
      XICS-over-XIVE device "handle" interrupt management, that is to
      perform the loads and stores on the addresses of the ESB pages of the
      guest interrupts. However, when running in XIVE native exploitation
      mode, the KVM XIVE native device exposes the interrupt ESB pages to
      the guest and lets the guest perform directly the loads and stores.
      
      The VMA exposing the ESB pages make use of a custom VM fault handler
      which role is to populate the VMA with appropriate pages. When a fault
      occurs, the guest IRQ number is deduced from the offset, and the ESB
      pages of associated XIVE IPI interrupt are inserted in the VMA (using
      the internal structure caching information on the interrupts).
      
      Supporting device passthrough in the guest running in XIVE native
      exploitation mode adds some extra refinements because the ESB pages
      of a different HW controller (PHB4) need to be exposed to the guest
      along with the initial IPI ESB pages of the XIVE IC controller. But
      the overall mechanic is the same.
      
      When the device HW irqs are mapped into or unmapped from the guest
      IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
      and kvmppc_xive_clr_mapped(), are called to record or clear the
      passthrough interrupt information and to perform the switch.
      
      The approach taken by this patch is to clear the ESB pages of the
      guest IRQ number being mapped and let the VM fault handler repopulate.
      The handler will insert the ESB page corresponding to the HW interrupt
      of the device being passed-through or the initial IPI ESB page if the
      device is being removed.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      232b984b
    • C
      KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages · 6520ca64
      Cédric Le Goater 提交于
      Each source is associated with an Event State Buffer (ESB) with a
      even/odd pair of pages which provides commands to manage the source:
      to trigger, to EOI, to turn off the source for instance.
      
      The custom VM fault handler will deduce the guest IRQ number from the
      offset of the fault, and the ESB page of the associated XIVE interrupt
      will be inserted into the VMA using the internal structure caching
      information on the interrupts.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6520ca64
    • C
      KVM: PPC: Book3S HV: XIVE: Add a TIMA mapping · 39e9af3d
      Cédric Le Goater 提交于
      Each thread has an associated Thread Interrupt Management context
      composed of a set of registers. These registers let the thread handle
      priority management and interrupt acknowledgment. The most important
      are :
      
          - Interrupt Pending Buffer     (IPB)
          - Current Processor Priority   (CPPR)
          - Notification Source Register (NSR)
      
      They are exposed to software in four different pages each proposing a
      view with a different privilege. The first page is for the physical
      thread context and the second for the hypervisor. Only the third
      (operating system) and the fourth (user level) are exposed the guest.
      
      A custom VM fault handler will populate the VMA with the appropriate
      pages, which should only be the OS page for now.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      39e9af3d
    • C
      KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state · e4945b9d
      Cédric Le Goater 提交于
      The state of the thread interrupt management registers needs to be
      collected for migration. These registers are cached under the
      'xive_saved_state.w01' field of the VCPU when the VPCU context is
      pulled from the HW thread. An OPAL call retrieves the backup of the
      IPB register in the underlying XIVE NVT structure and merges it in the
      KVM state.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e4945b9d
    • C
      KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages · e6714bd1
      Cédric Le Goater 提交于
      When migration of a VM is initiated, a first copy of the RAM is
      transferred to the destination before the VM is stopped, but there is
      no guarantee that the EQ pages in which the event notifications are
      queued have not been modified.
      
      To make sure migration will capture a consistent memory state, the
      XIVE device should perform a XIVE quiesce sequence to stop the flow of
      event notifications and stabilize the EQs. This is the purpose of the
      KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
      to force their transfer.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e6714bd1
    • C
      KVM: PPC: Book3S HV: XIVE: Add a control to sync the sources · 7b46b616
      Cédric Le Goater 提交于
      This control will be used by the H_INT_SYNC hcall from QEMU to flush
      event notifications on the XIVE IC owning the source.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b46b616
    • C
      KVM: PPC: Book3S HV: XIVE: Add a global reset control · 5ca80647
      Cédric Le Goater 提交于
      This control is to be used by the H_INT_RESET hcall from QEMU. Its
      purpose is to clear all configuration of the sources and EQs. This is
      necessary in case of a kexec (for a kdump kernel for instance) to make
      sure that no remaining configuration is left from the previous boot
      setup so that the new kernel can start safely from a clean state.
      
      The queue 7 is ignored when the XIVE device is configured to run in
      single escalation mode. Prio 7 is used by escalations.
      
      The XIVE VP is kept enabled as the vCPU is still active and connected
      to the XIVE device.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5ca80647
    • C
      KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration · 13ce3297
      Cédric Le Goater 提交于
      These controls will be used by the H_INT_SET_QUEUE_CONFIG and
      H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
      Event Queue in the XIVE IC. They will also be used to restore the
      configuration of the XIVE EQs and to capture the internal run-time
      state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
      the EQ toggle bit and EQ index which are updated by the XIVE IC when
      event notifications are enqueued in the EQ.
      
      The value of the guest physical address of the event queue is saved in
      the XIVE internal xive_q structure for later use. That is when
      migration needs to mark the EQ pages dirty to capture a consistent
      memory state of the VM.
      
      To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
      OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
      but restoring the EQ state will.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      13ce3297
    • C
      KVM: PPC: Book3S HV: XIVE: Add a control to configure a source · e8676ce5
      Cédric Le Goater 提交于
      This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
      QEMU to configure the target of a source and also to restore the
      configuration of a source when migrating the VM.
      
      The XIVE source interrupt structure is extended with the value of the
      Effective Interrupt Source Number. The EISN is the interrupt number
      pushed in the event queue that the guest OS will use to dispatch
      events internally. Caching the EISN value in KVM eases the test when
      checking if a reconfiguration is indeed needed.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e8676ce5
    • C
      KVM: PPC: Book3S HV: XIVE: add a control to initialize a source · 4131f83c
      Cédric Le Goater 提交于
      The XIVE KVM device maintains a list of interrupt sources for the VM
      which are allocated in the pool of generic interrupts (IPIs) of the
      main XIVE IC controller. These are used for the CPU IPIs as well as
      for virtual device interrupts. The IRQ number space is defined by
      QEMU.
      
      The XIVE device reuses the source structures of the XICS-on-XIVE
      device for the source blocks (2-level tree) and for the source
      interrupts. Under XIVE native, the source interrupt caches mostly
      configuration information and is less used than under the XICS-on-XIVE
      device in which hcalls are still necessary at run-time.
      
      When a source is initialized in KVM, an IPI interrupt source is simply
      allocated at the OPAL level and then MASKED. KVM only needs to know
      about its type: LSI or MSI.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4131f83c
    • C
      KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE · eacc56bb
      Cédric Le Goater 提交于
      The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
      let QEMU connect the vCPU presenters to the XIVE KVM device if
      required. The capability is not advertised for now as the full support
      for the XIVE native exploitation mode is not yet available. When this
      is case, the capability will be advertised on PowerNV Hypervisors
      only. Nested guests (pseries KVM Hypervisor) are not supported.
      
      Internally, the interface to the new KVM device is protected with a
      new interrupt mode: KVMPPC_IRQ_XIVE.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eacc56bb
    • C
      KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode · 90c73795
      Cédric Le Goater 提交于
      This is the basic framework for the new KVM device supporting the XIVE
      native exploitation mode. The user interface exposes a new KVM device
      to be created by QEMU, only available when running on a L0 hypervisor.
      Support for nested guests is not available yet.
      
      The XIVE device reuses the device structure of the XICS-on-XIVE device
      as they have a lot in common. That could possibly change in the future
      if the need arise.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      90c73795
    • S
      KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry() · 44b198ae
      Suraj Jitindar Singh 提交于
      On POWER9 and later processors where the host can schedule vcpus on a
      per thread basis, there is a streamlined entry path used when the guest
      is radix. This entry path saves/restores the fp and vr state in
      kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and
      load_[fp/vr]_state(). This is the same as the old entry path however the
      old entry path also saved/restored the VRSAVE register, which isn't done
      in the new entry path.
      
      This means that the vrsave register is now volatile across guest exit,
      which is an incorrect change in behaviour.
      
      Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry().
      This restores the old, correct, behaviour.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      44b198ae
    • P
      KVM: PPC: Book3S HV: Flush TLB on secondary radix threads · 70ea13f6
      Paul Mackerras 提交于
      When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
      in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
      If those guests are in radix mode, we fail to load the LPID and flush
      the TLB if necessary, leading to the guest crashing with an
      unsupported MMU fault.  This arises from commit 9a4506e1 ("KVM:
      PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
      with relocation on", 2018-05-17), which didn't consider the case
      where indep_threads_mode = N.
      
      For simplicity, this makes the real-mode guest entry path flush the
      TLB in the same place for both radix and hash guests, as we did before
      9a4506e1, though the code is now C code rather than assembly code.
      We also have the radix TLB flush open-coded rather than calling
      radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
      called in real mode, and in real mode we don't want to invoke the
      tracepoint code.
      
      Fixes: 9a4506e1 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      70ea13f6
    • P
      KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code · 2940ba0c
      Paul Mackerras 提交于
      This replaces assembler code in book3s_hv_rmhandlers.S that checks
      the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
      with C code in book3s_hv_builtin.c.  Note that unlike the radix
      version, the hash version doesn't do an explicit ERAT invalidation
      because we will invalidate and load up the SLB before entering the
      guest, and that will invalidate the ERAT.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2940ba0c
    • S
      KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push code · 7ae9bda7
      Suraj Jitindar Singh 提交于
      The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU
      context to the hardware currently assumes it is being called in real
      mode, which is usually true.  There is however a path by which it can
      be executed in virtual mode, in the case where indep_threads_mode = N.
      A virtual CPU executing on an offline secondary thread can take a
      hypervisor interrupt in virtual mode and return from the
      kvmppc_hv_entry() call after the kvm_secondary_got_guest label.
      It is possible for it to be given another vcpu to execute before it
      gets to execute the stop instruction.  In that case it will call
      kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE
      vCPU push code will be executed in virtual mode.  The result in that
      case will be a host crash due to an unexpected data storage interrupt
      caused by executing the stdcix instruction in virtual mode.
      
      This fixes it by adding a code path for virtual mode, which uses the
      virtual TIMA pointer and normal load/store instructions.
      
      [paulus@ozlabs.org - wrote patch description]
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7ae9bda7
    • P
      KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0 · 1f80ba3d
      Paul Mackerras 提交于
      This fixes a bug in the XICS emulation on POWER9 machines which is
      triggered by the guest doing a H_IPI with priority = 0 (the highest
      priority).  What happens is that the notification interrupt arrives
      at the destination at priority zero.  The loop in scan_interrupts()
      sees that a priority 0 interrupt is pending, but because xc->mfrr is
      zero, we break out of the loop before taking the notification
      interrupt out of the queue and EOI-ing it.  (This doesn't happen
      when xc->mfrr != 0; in that case we process the priority-0 notification
      interrupt on the first iteration of the loop, and then break out of
      a subsequent iteration of the loop with hirq == XICS_IPI.)
      
      To fix this, we move the prio >= xc->mfrr check down to near the end
      of the loop.  However, there are then some other things that need to
      be adjusted.  Since we are potentially handling the notification
      interrupt and also delivering an IPI to the guest in the same loop
      iteration, we need to update pending and handle any q->pending_count
      value before the xc->mfrr check, rather than at the end of the loop.
      Also, we need to update the queue pointers when we have processed and
      EOI-ed the notification interrupt, since we may not do it later.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1f80ba3d
    • P
      KVM: PPC: Book3S HV: smb->smp comment fixup · 6fabc9f2
      Palmer Dabbelt 提交于
      I made the same typo when trying to grep for uses of smp_wmb and figured
      I might as well fix it.
      Signed-off-by: NPalmer Dabbelt <palmer@sifive.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6fabc9f2
    • A
      KVM: PPC: Book3S: Allocate guest TCEs on demand too · e1a1ef84
      Alexey Kardashevskiy 提交于
      We already allocate hardware TCE tables in multiple levels and skip
      intermediate levels when we can, now it is a turn of the KVM TCE tables.
      Thankfully these are allocated already in 2 levels.
      
      This moves the table's last level allocation from the creating helper to
      kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
      be done in real mode, this creates a virtual mode version of
      kvmppc_tce_put() which handles allocations.
      
      This adds kvmppc_rm_ioba_validate() to do an additional test if
      the consequent kvmppc_tce_put() needs a page which has not been allocated;
      if this is the case, we bail out to virtual mode handlers.
      
      The allocations are protected by a new mutex as kvm->lock is not suitable
      for the task because the fault handler is called with the mmap_sem held
      but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e1a1ef84
    • A
      KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers · 2001825e
      Alexey Kardashevskiy 提交于
      The kvmppc_tce_to_ua() helper is called from real and virtual modes
      and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
      However if the lockdep debugging is on, the lockdep will most likely break
      in kvm_memslots() because of srcu_dereference_check() so we need to use
      PPC-own kvm_memslots_raw() which uses realmode safe
      rcu_dereference_raw_notrace().
      
      This creates a realmode copy of kvmppc_tce_to_ua() which replaces
      kvm_memslots() with kvm_memslots_raw().
      
      Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
      HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.
      
      This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
      drops the prmap parameter which was never used in the virtual mode.
      
      Fixes: d3695aa4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2001825e
    • A
      KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest · 3309bec8
      Alexey Kardashevskiy 提交于
      The trace_hardirqs_on() sets current->hardirqs_enabled and from here
      the lockdep assumes interrupts are enabled although they are remain
      disabled until the context switches to the guest. Consequent
      srcu_read_lock() checks the flags in rcu_lock_acquire(), observes
      disabled interrupts and prints a warning (see below).
      
      This moves trace_hardirqs_on/off closer to __kvmppc_vcore_entry to
      prevent lockdep from being confused.
      
      DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      WARNING: CPU: 16 PID: 8038 at kernel/locking/lockdep.c:4128 check_flags.part.25+0x224/0x280
      [...]
      NIP [c000000000185b84] check_flags.part.25+0x224/0x280
      LR [c000000000185b80] check_flags.part.25+0x220/0x280
      Call Trace:
      [c000003fec253710] [c000000000185b80] check_flags.part.25+0x220/0x280 (unreliable)
      [c000003fec253780] [c000000000187ea4] lock_acquire+0x94/0x260
      [c000003fec253840] [c00800001a1e9768] kvmppc_run_core+0xa60/0x1ab0 [kvm_hv]
      [c000003fec253a10] [c00800001a1ed944] kvmppc_vcpu_run_hv+0x73c/0xec0 [kvm_hv]
      [c000003fec253ae0] [c00800001a1095dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fec253b00] [c00800001a1056bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fec253b90] [c00800001a0f3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fec253d00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fec253db0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fec253e00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fec253e20] [c00000000000b5a4] system_call+0x5c/0x70
      Instruction dump:
      419e0034 3d220004 39291730 81290000 2f890000 409e0020 3c82ffc6 3c62ffc5
      3884be70 386329c0 4bf6ea71 60000000 <0fe00000> 3c62ffc6 3863be90 4801273d
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      ---[ end trace 31180adcc848993e ]---
      possible reason: unannotated irqs-off.
      irq event stamp: 1025
      hardirqs last  enabled at (1025): [<c00800001a1e9728>] kvmppc_run_core+0xa20/0x1ab0 [kvm_hv]
      hardirqs last disabled at (1024): [<c00800001a1e9358>] kvmppc_run_core+0x650/0x1ab0 [kvm_hv]
      softirqs last  enabled at (0): [<c0000000000f1210>] copy_process.isra.4.part.5+0x5f0/0x1d00
      softirqs last disabled at (0): [<0000000000000000>]           (null)
      
      Fixes: 8b24e69f ("KVM: PPC: Book3S HV: Close race with testing for signals on guest entry", 2017-06-26)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3309bec8
    • S
      KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler · eadfb1c5
      Suraj Jitindar Singh 提交于
      Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel real mode handler halves the time to handle this H_CALL
      compared to handling it in userspace for a hash guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eadfb1c5
    • S
      KVM: PPC: Book3S HV: Implement virtual mode H_PAGE_INIT handler · 2d34d1c3
      Suraj Jitindar Singh 提交于
      Implement a virtual mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel handler halves the time to handle this H_CALL compared to
      handling it in userspace for a radix guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2d34d1c3
  2. 20 4月, 2019 1 次提交
    • M
      powerpc: Add force enable of DAWR on P9 option · c1fe190c
      Michael Neuling 提交于
      This adds a flag so that the DAWR can be enabled on P9 via:
        echo Y > /sys/kernel/debug/powerpc/dawr_enable_dangerous
      
      The DAWR was previously force disabled on POWER9 in:
        96541531 powerpc: Disable DAWR in the base POWER9 CPU features
      Also see Documentation/powerpc/DAWR-POWER9.txt
      
      This is a dangerous setting, USE AT YOUR OWN RISK.
      
      Some users may not care about a bad user crashing their box
      (ie. single user/desktop systems) and really want the DAWR.  This
      allows them to force enable DAWR.
      
      This flag can also be used to disable DAWR access. Once this is
      cleared, all DAWR access should be cleared immediately and your
      machine once again safe from crashing.
      
      Userspace may get confused by toggling this. If DAWR is force
      enabled/disabled between getting the number of breakpoints (via
      PTRACE_GETHWDBGINFO) and setting the breakpoint, userspace will get an
      inconsistent view of what's available. Similarly for guests.
      
      For the DAWR to be enabled in a KVM guest, the DAWR needs to be force
      enabled in the host AND the guest. For this reason, this won't work on
      POWERVM as it doesn't allow the HCALL to work. Writes of 'Y' to the
      dawr_enable_dangerous file will fail if the hypervisor doesn't support
      writing the DAWR.
      
      To double check the DAWR is working, run this kernel selftest:
        tools/testing/selftests/powerpc/ptrace/ptrace-hwbreak.c
      Any errors/failures/skips mean something is wrong.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c1fe190c
  3. 05 4月, 2019 2 次提交
    • A
      KVM: PPC: Book3S: Protect memslots while validating user address · 345077c8
      Alexey Kardashevskiy 提交于
      Guest physical to user address translation uses KVM memslots and reading
      these requires holding the kvm->srcu lock. However recently introduced
      kvmppc_tce_validate() broke the rule (see the lockdep warning below).
      
      This moves srcu_read_lock(&vcpu->kvm->srcu) earlier to protect
      kvmppc_tce_validate() as well.
      
      =============================
      WARNING: suspicious RCU usage
      5.1.0-rc2-le_nv2_aikATfstn1-p1 #380 Not tainted
      -----------------------------
      include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by qemu-system-ppc/8020:
       #0: 0000000094972fe9 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0xdc/0x850 [kvm]
      
      stack backtrace:
      CPU: 44 PID: 8020 Comm: qemu-system-ppc Not tainted 5.1.0-rc2-le_nv2_aikATfstn1-p1 #380
      Call Trace:
      [c000003fece8f740] [c000000000bcc134] dump_stack+0xe8/0x164 (unreliable)
      [c000003fece8f790] [c000000000181be0] lockdep_rcu_suspicious+0x130/0x170
      [c000003fece8f810] [c0000000000d5f50] kvmppc_tce_to_ua+0x280/0x290
      [c000003fece8f870] [c00800001a7e2c78] kvmppc_tce_validate+0x80/0x1b0 [kvm]
      [c000003fece8f8e0] [c00800001a7e3fac] kvmppc_h_put_tce+0x94/0x3e4 [kvm]
      [c000003fece8f9a0] [c00800001a8baac4] kvmppc_pseries_do_hcall+0x30c/0xce0 [kvm_hv]
      [c000003fece8fa10] [c00800001a8bd89c] kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
      [c000003fece8fae0] [c00800001a7d95dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [c000003fece8fb00] [c00800001a7d56bc] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [c000003fece8fb90] [c00800001a7c3618] kvm_vcpu_ioctl+0x460/0x850 [kvm]
      [c000003fece8fd00] [c00000000041c4f4] do_vfs_ioctl+0xe4/0x930
      [c000003fece8fdb0] [c00000000041ce04] ksys_ioctl+0xc4/0x110
      [c000003fece8fe00] [c00000000041ce78] sys_ioctl+0x28/0x80
      [c000003fece8fe20] [c00000000000b5a4] system_call+0x5c/0x70
      
      Fixes: 42de7b9e ("KVM: PPC: Validate TCEs against preregistered memory page sizes", 2018-09-10)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      345077c8
    • S
      KVM: PPC: Book3S HV: Perserve PSSCR FAKE_SUSPEND bit on guest exit · 7cb9eb10
      Suraj Jitindar Singh 提交于
      There is a hardware bug in some POWER9 processors where a treclaim in
      fake suspend mode can cause an inconsistency in the XER[SO] bit across
      the threads of a core, the workaround being to force the core into SMT4
      when doing the treclaim.
      
      The FAKE_SUSPEND bit (bit 10) in the PSSCR is used to control whether a
      thread is in fake suspend or real suspend. The important difference here
      being that thread reconfiguration is blocked in real suspend but not
      fake suspend mode.
      
      When we exit a guest which was in fake suspend mode, we force the core
      into SMT4 while we do the treclaim in kvmppc_save_tm_hv().
      However on the new exit path introduced with the function
      kvmhv_run_single_vcpu() we restore the host PSSCR before calling
      kvmppc_save_tm_hv() which means that if we were in fake suspend mode we
      put the thread into real suspend mode when we clear the
      PSSCR[FAKE_SUSPEND] bit. This means that we block thread reconfiguration
      and the thread which is trying to get the core into SMT4 before it can
      do the treclaim spins forever since it itself is blocking thread
      reconfiguration. The result is that that core is essentially lost.
      
      This results in a trace such as:
      [   93.512904] CPU: 7 PID: 13352 Comm: qemu-system-ppc Not tainted 5.0.0 #4
      [   93.512905] NIP:  c000000000098a04 LR: c0000000000cc59c CTR: 0000000000000000
      [   93.512908] REGS: c000003fffd2bd70 TRAP: 0100   Not tainted  (5.0.0)
      [   93.512908] MSR:  9000000302883033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE,TM[SE]>  CR: 22222444  XER: 00000000
      [   93.512914] CFAR: c000000000098a5c IRQMASK: 3
      [   93.512915] PACATMSCRATCH: 0000000000000001
      [   93.512916] GPR00: 0000000000000001 c000003f6cc1b830 c000000001033100 0000000000000004
      [   93.512928] GPR04: 0000000000000004 0000000000000002 0000000000000004 0000000000000007
      [   93.512930] GPR08: 0000000000000000 0000000000000004 0000000000000000 0000000000000004
      [   93.512932] GPR12: c000203fff7fc000 c000003fffff9500 0000000000000000 0000000000000000
      [   93.512935] GPR16: 2000000000300375 000000000000059f 0000000000000000 0000000000000000
      [   93.512951] GPR20: 0000000000000000 0000000000080053 004000000256f41f c000003f6aa88ef0
      [   93.512953] GPR24: c000003f6aa89100 0000000000000010 0000000000000000 0000000000000000
      [   93.512956] GPR28: c000003f9e9a0800 0000000000000000 0000000000000001 c000203fff7fc000
      [   93.512959] NIP [c000000000098a04] pnv_power9_force_smt4_catch+0x1b4/0x2c0
      [   93.512960] LR [c0000000000cc59c] kvmppc_save_tm_hv+0x40/0x88
      [   93.512960] Call Trace:
      [   93.512961] [c000003f6cc1b830] [0000000000080053] 0x80053 (unreliable)
      [   93.512965] [c000003f6cc1b8a0] [c00800001e9cb030] kvmhv_p9_guest_entry+0x508/0x6b0 [kvm_hv]
      [   93.512967] [c000003f6cc1b940] [c00800001e9cba44] kvmhv_run_single_vcpu+0x2dc/0xb90 [kvm_hv]
      [   93.512968] [c000003f6cc1ba10] [c00800001e9cc948] kvmppc_vcpu_run_hv+0x650/0xb90 [kvm_hv]
      [   93.512969] [c000003f6cc1bae0] [c00800001e8f620c] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [   93.512971] [c000003f6cc1bb00] [c00800001e8f2d4c] kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
      [   93.512972] [c000003f6cc1bb90] [c00800001e8e3918] kvm_vcpu_ioctl+0x460/0x7d0 [kvm]
      [   93.512974] [c000003f6cc1bd00] [c0000000003ae2c0] do_vfs_ioctl+0xe0/0x8e0
      [   93.512975] [c000003f6cc1bdb0] [c0000000003aeb24] ksys_ioctl+0x64/0xe0
      [   93.512978] [c000003f6cc1be00] [c0000000003aebc8] sys_ioctl+0x28/0x80
      [   93.512981] [c000003f6cc1be20] [c00000000000b3a4] system_call+0x5c/0x70
      [   93.512983] Instruction dump:
      [   93.512986] 419dffbc e98c0000 2e8b0000 38000001 60000000 60000000 60000000 40950068
      [   93.512993] 392bffff 39400000 79290020 39290001 <7d2903a6> 60000000 60000000 7d235214
      
      To fix this we preserve the PSSCR[FAKE_SUSPEND] bit until we call
      kvmppc_save_tm_hv() which will mean the core can get into SMT4 and
      perform the treclaim. Note kvmppc_save_tm_hv() clears the
      PSSCR[FAKE_SUSPEND] bit again so there is no need to explicitly do that.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7cb9eb10
  4. 01 3月, 2019 1 次提交
  5. 23 2月, 2019 1 次提交
  6. 22 2月, 2019 3 次提交
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · c3c7470c
      Michael Ellerman 提交于
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      c3c7470c
    • A
      KVM: PPC: Book3S: Improve KVM reference counting · 716cb116
      Alexey Kardashevskiy 提交于
      The anon fd's ops releases the KVM reference in the release hook.
      However we reference the KVM object after we create the fd so there is
      small window when the release function can be called and
      dereferenced the KVM object which potentially may free it.
      
      It is not a problem at the moment as the file is created and KVM is
      referenced under the KVM lock and the release function obtains the same
      lock before dereferencing the KVM (although the lock is not held when
      calling kvm_put_kvm()) but it is potentially fragile against future changes.
      
      This references the KVM object before creating a file.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      716cb116
    • J
      KVM: PPC: Book3S HV: Fix build failure without IOMMU support · e40542af
      Jordan Niethe 提交于
      Currently trying to build without IOMMU support will fail:
      
        (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
        (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
        (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
        (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
      
      This happens because turning off IOMMU support will prevent
      book3s_64_vio_hv.c from being built because it is only built when
      SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
      
      Fix it using ifdefs for the undefined references.
      
      Fixes: 76d837a4 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e40542af
  7. 21 2月, 2019 6 次提交
    • P
      powerpc/64s: Better printing of machine check info for guest MCEs · c0577201
      Paul Mackerras 提交于
      This adds an "in_guest" parameter to machine_check_print_event_info()
      so that we can avoid trying to translate guest NIP values into
      symbolic form using the host kernel's symbol table.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0577201
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
    • M
      KVM: PPC: Book3S HV: Context switch AMR on Power9 · d976f680
      Michael Ellerman 提交于
      kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
      when guest and host are both running with the Radix MMU.
      
      Currently in that path we don't save the host AMR (Authority Mask
      Register) value, and we always restore 0 on return to the host. That
      is OK at the moment because the AMR is not used for storage keys with
      the Radix MMU.
      
      However we plan to start using the AMR on Radix to prevent the kernel
      from reading/writing to userspace outside of copy_to/from_user(). In
      order to make that work we need to save/restore the AMR value.
      
      We only restore the value if it is different from the guest value,
      which is already in the register when we exit to the host. This should
      mean we rarely need to actually restore the value when running a
      modern Linux as a guest, because it will be using the same value as
      us.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NRussell Currey <ruscur@russell.cc>
      d976f680
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
  8. 19 2月, 2019 3 次提交
    • S
      KVM: PPC: Book3S HV: Add KVM stat largepages_[2M/1G] · 8f1f7b9b
      Suraj Jitindar Singh 提交于
      This adds an entry to the kvm_stats_debugfs directory which provides the
      number of large (2M or 1G) pages which have been used to setup the guest
      mappings, for radix guests.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8f1f7b9b
    • A
      KVM: PPC: Release all hardware TCE tables attached to a group · a67614cc
      Alexey Kardashevskiy 提交于
      The SPAPR TCE KVM device references all hardware IOMMU tables assigned to
      some IOMMU group to ensure that in-kernel KVM acceleration of H_PUT_TCE
      can work. The tables are references when an IOMMU group gets registered
      with the VFIO KVM device by the KVM_DEV_VFIO_GROUP_ADD ioctl;
      KVM_DEV_VFIO_GROUP_DEL calls into the dereferencing code
      in kvm_spapr_tce_release_iommu_group() which walks through the list of
      LIOBNs, finds a matching IOMMU table and calls kref_put() when found.
      
      However that code stops after the very first successful derefencing
      leaving other tables referenced till the SPAPR TCE KVM device is destroyed
      which normally happens on guest reboot or termination so if we do hotplug
      and unplug in a loop, we are leaking IOMMU tables here.
      
      This removes a premature return to let kvm_spapr_tce_release_iommu_group()
      find and dereference all attached tables.
      
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      a67614cc
    • S
      KVM: PPC: Book3S HV: Optimise mmio emulation for devices on FAST_MMIO_BUS · 1b642257
      Suraj Jitindar Singh 提交于
      Devices on the KVM_FAST_MMIO_BUS by definition have length zero and are
      thus used for notification purposes rather than data transfer. For
      example eventfd for virtio devices.
      
      This means that when emulating mmio instructions which target devices on
      this bus we can immediately handle them and return without needing to load
      the instruction from guest memory.
      
      For now we restrict this to stores as this is the only use case at
      present.
      
      For a normal guest the effect is negligible, however for a nested guest
      we save on the order of 5us per access.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b642257