1. 24 9月, 2019 1 次提交
    • M
      KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag · 3a83f677
      Michael Roth 提交于
      On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
      of memory running the following guest configs:
      
        guest A:
          - 224GB of memory
          - 56 VCPUs (sockets=1,cores=28,threads=2), where:
            VCPUs 0-1 are pinned to CPUs 0-3,
            VCPUs 2-3 are pinned to CPUs 4-7,
            ...
            VCPUs 54-55 are pinned to CPUs 108-111
      
        guest B:
          - 4GB of memory
          - 4 VCPUs (sockets=1,cores=4,threads=1)
      
      with the following workloads (with KSM and THP enabled in all):
      
        guest A:
          stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
      
        guest B:
          stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
      
        host:
          stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
      
      the below soft-lockup traces were observed after an hour or so and
      persisted until the host was reset (this was found to be reliably
      reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
      and 5.3-rc5):
      
        [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
        [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
        [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
        [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
        [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
        [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
        [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
        [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
        [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
        [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
        [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
        [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
        [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
        [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
        [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
        [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
        [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
        [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
      
      CPUs 45, 24, and 124 are stuck on spin locks, likely held by
      CPUs 105 and 31.
      
      CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
      target CPU 42. For instance:
      
        # CPU 105 registers (via xmon)
        R00 = c00000000020b20c   R16 = 00007d1bcd800000
        R01 = c00000363eaa7970   R17 = 0000000000000001
        R02 = c0000000019b3a00   R18 = 000000000000006b
        R03 = 000000000000002a   R19 = 00007d537d7aecf0
        R04 = 000000000000002a   R20 = 60000000000000e0
        R05 = 000000000000002a   R21 = 0801000000000080
        R06 = c0002073fb0caa08   R22 = 0000000000000d60
        R07 = c0000000019ddd78   R23 = 0000000000000001
        R08 = 000000000000002a   R24 = c00000000147a700
        R09 = 0000000000000001   R25 = c0002073fb0ca908
        R10 = c000008ffeb4e660   R26 = 0000000000000000
        R11 = c0002073fb0ca900   R27 = c0000000019e2464
        R12 = c000000000050790   R28 = c0000000000812b0
        R13 = c000207fff623e00   R29 = c0002073fb0ca808
        R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
        R15 = 00007d1bcd600000   R31 = 0000000000000800
        pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
        cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
        lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
        msr = 900000010288b033   cr  = 44024824
        ctr = c000000000050790   xer = 0000000000000000   trap =  100
      
      CPU 42 is running normally, doing VCPU work:
      
        # CPU 42 stack trace (via xmon)
        [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
        [c000008ed3343820] c000008ed3343850 (unreliable)
        [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
        [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
        [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
        [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
        [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
        [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
        [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
        [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
        [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
        --- Exception: c00 (System Call) at 00007d545cfd7694
        SP (7d53ff7edf50) is in userspace
      
      It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
      was set for CPU 42 by at least 1 of the CPUs waiting in
      smp_call_function_many(), but somehow the corresponding
      call_single_queue entries were never processed by CPU 42, causing the
      callers to spin in csd_lock_wait() indefinitely.
      
      Nick Piggin suggested something similar to the following sequence as
      a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
      IPI messages seems to be most common, but any mix of CALL_FUNCTION and
      !CALL_FUNCTION messages could trigger it):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
          105: smp_muxed_ipi_set_message():
          105:   smb_mb()
               // STORE DEFERRED DUE TO RE-ORDERING
        --105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |  42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        |  42: // returns to executing guest
        |      // RE-ORDERED STORE COMPLETES
        ->105:   message[CALL_FUNCTION] = 1
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      This can be prevented with an smp_mb() at the beginning of
      kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
      state indicated by the host_ipi flag) are ordered vs. the store to
      to host_ipi.
      
      However, doing so might still allow for the following scenario (not
      yet observed):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
               // STORE DEFERRED DUE TO RE-ORDERING
        -- 42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        | 105: smp_muxed_ipi_set_message():
        | 105:   smb_mb()
        | 105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |      // RE-ORDERED STORE COMPLETES
        -> 42:   kvmppc_set_host_ipi(42, 0)
           42: // returns to executing guest
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      Fixing this scenario would require an smp_mb() *after* clearing
      host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
      subsequent processing of IPI messages.
      
      To handle both cases, this patch splits kvmppc_set_host_ipi() into
      separate set/clear functions, where we execute smp_mb() prior to
      setting host_ipi flag, and after clearing host_ipi flag. These
      functions pair with each other to synchronize the sender and receiver
      sides.
      
      With that change in place the above workload ran for 20 hours without
      triggering any lock-ups.
      
      Fixes: 755563bc ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
      3a83f677
  2. 27 8月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions · 2ad7a27d
      Paul Mackerras 提交于
      There are some POWER9 machines where the OPAL firmware does not support
      the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
      The impact of this is that a guest using XIVE natively will not be able
      to be migrated successfully.  On the source side, the get_attr operation
      on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
      will fail; on the destination side, the set_attr operation for the same
      attribute will fail.
      
      This adds tests for the existence of the OPAL get/set queue state
      functions, and if they are not supported, the XIVE-native KVM device
      is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
      Userspace can then either provide a software emulation of XIVE, or
      else tell the guest that it does not have a XIVE controller available
      to it.
      
      Cc: stable@vger.kernel.org # v5.2+
      Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ad7a27d
  3. 05 6月, 2019 1 次提交
  4. 30 4月, 2019 8 次提交
    • C
      KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state · e4945b9d
      Cédric Le Goater 提交于
      The state of the thread interrupt management registers needs to be
      collected for migration. These registers are cached under the
      'xive_saved_state.w01' field of the VCPU when the VPCU context is
      pulled from the HW thread. An OPAL call retrieves the backup of the
      IPB register in the underlying XIVE NVT structure and merges it in the
      KVM state.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e4945b9d
    • C
      KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVE · eacc56bb
      Cédric Le Goater 提交于
      The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to
      let QEMU connect the vCPU presenters to the XIVE KVM device if
      required. The capability is not advertised for now as the full support
      for the XIVE native exploitation mode is not yet available. When this
      is case, the capability will be advertised on PowerNV Hypervisors
      only. Nested guests (pseries KVM Hypervisor) are not supported.
      
      Internally, the interface to the new KVM device is protected with a
      new interrupt mode: KVMPPC_IRQ_XIVE.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eacc56bb
    • C
      KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode · 90c73795
      Cédric Le Goater 提交于
      This is the basic framework for the new KVM device supporting the XIVE
      native exploitation mode. The user interface exposes a new KVM device
      to be created by QEMU, only available when running on a L0 hypervisor.
      Support for nested guests is not available yet.
      
      The XIVE device reuses the device structure of the XICS-on-XIVE device
      as they have a lot in common. That could possibly change in the future
      if the need arise.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      90c73795
    • P
      KVM: PPC: Book3S HV: Flush TLB on secondary radix threads · 70ea13f6
      Paul Mackerras 提交于
      When running on POWER9 with kvm_hv.indep_threads_mode = N and the host
      in SMT1 mode, KVM will run guest VCPUs on offline secondary threads.
      If those guests are in radix mode, we fail to load the LPID and flush
      the TLB if necessary, leading to the guest crashing with an
      unsupported MMU fault.  This arises from commit 9a4506e1 ("KVM:
      PPC: Book3S HV: Make radix handle process scoped LPID flush in C,
      with relocation on", 2018-05-17), which didn't consider the case
      where indep_threads_mode = N.
      
      For simplicity, this makes the real-mode guest entry path flush the
      TLB in the same place for both radix and hash guests, as we did before
      9a4506e1, though the code is now C code rather than assembly code.
      We also have the radix TLB flush open-coded rather than calling
      radix__local_flush_tlb_lpid_guest(), because the TLB flush can be
      called in real mode, and in real mode we don't want to invoke the
      tracepoint code.
      
      Fixes: 9a4506e1 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      70ea13f6
    • P
      KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C code · 2940ba0c
      Paul Mackerras 提交于
      This replaces assembler code in book3s_hv_rmhandlers.S that checks
      the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush
      with C code in book3s_hv_builtin.c.  Note that unlike the radix
      version, the hash version doesn't do an explicit ERAT invalidation
      because we will invalidate and load up the SLB before entering the
      guest, and that will invalidate the ERAT.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2940ba0c
    • A
      KVM: PPC: Book3S: Allocate guest TCEs on demand too · e1a1ef84
      Alexey Kardashevskiy 提交于
      We already allocate hardware TCE tables in multiple levels and skip
      intermediate levels when we can, now it is a turn of the KVM TCE tables.
      Thankfully these are allocated already in 2 levels.
      
      This moves the table's last level allocation from the creating helper to
      kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot
      be done in real mode, this creates a virtual mode version of
      kvmppc_tce_put() which handles allocations.
      
      This adds kvmppc_rm_ioba_validate() to do an additional test if
      the consequent kvmppc_tce_put() needs a page which has not been allocated;
      if this is the case, we bail out to virtual mode handlers.
      
      The allocations are protected by a new mutex as kvm->lock is not suitable
      for the task because the fault handler is called with the mmap_sem held
      but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e1a1ef84
    • A
      KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlers · 2001825e
      Alexey Kardashevskiy 提交于
      The kvmppc_tce_to_ua() helper is called from real and virtual modes
      and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled.
      However if the lockdep debugging is on, the lockdep will most likely break
      in kvm_memslots() because of srcu_dereference_check() so we need to use
      PPC-own kvm_memslots_raw() which uses realmode safe
      rcu_dereference_raw_notrace().
      
      This creates a realmode copy of kvmppc_tce_to_ua() which replaces
      kvm_memslots() with kvm_memslots_raw().
      
      Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside
      HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE.
      
      This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and
      drops the prmap parameter which was never used in the virtual mode.
      
      Fixes: d3695aa4 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15)
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2001825e
    • S
      KVM: PPC: Book3S HV: Implement real mode H_PAGE_INIT handler · eadfb1c5
      Suraj Jitindar Singh 提交于
      Implement a real mode handler for the H_CALL H_PAGE_INIT which can be
      used to zero or copy a guest page. The page is defined to be 4k and must
      be 4k aligned.
      
      The in-kernel real mode handler halves the time to handle this H_CALL
      compared to handling it in userspace for a hash guest.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      eadfb1c5
  5. 27 2月, 2019 1 次提交
    • P
      KVM: PPC: Fix compilation when KVM is not enabled · e74d53e3
      Paul Mackerras 提交于
      Compiling with CONFIG_PPC_POWERNV=y and KVM disabled currently gives
      an error like this:
      
        CC      arch/powerpc/kernel/dbell.o
      In file included from arch/powerpc/kernel/dbell.c:20:0:
      arch/powerpc/include/asm/kvm_ppc.h: In function ‘xics_on_xive’:
      arch/powerpc/include/asm/kvm_ppc.h:625:9: error: implicit declaration of function ‘xive_enabled’ [-Werror=implicit-function-declaration]
        return xive_enabled() && cpu_has_feature(CPU_FTR_HVMODE);
               ^
      cc1: all warnings being treated as errors
      scripts/Makefile.build:276: recipe for target 'arch/powerpc/kernel/dbell.o' failed
      make[3]: *** [arch/powerpc/kernel/dbell.o] Error 1
      
      Fix this by making the xics_on_xive() definition conditional on the
      same symbol (CONFIG_KVM_BOOK3S_64_HANDLER) that determines whether we
      include <asm/xive.h> or not, since that's the header that defines
      xive_enabled().
      
      Fixes: 03f95332 ("KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e74d53e3
  6. 21 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
  7. 19 2月, 2019 1 次提交
    • P
      KVM: PPC: Book3S: Allow XICS emulation to work in nested hosts using XIVE · 03f95332
      Paul Mackerras 提交于
      Currently, the KVM code assumes that if the host kernel is using the
      XIVE interrupt controller (the new interrupt controller that first
      appeared in POWER9 systems), then the in-kernel XICS emulation will
      use the XIVE hardware to deliver interrupts to the guest.  However,
      this only works when the host is running in hypervisor mode and has
      full access to all of the XIVE functionality.  It doesn't work in any
      nested virtualization scenario, either with PR KVM or nested-HV KVM,
      because the XICS-on-XIVE code calls directly into the native-XIVE
      routines, which are not initialized and cannot function correctly
      because they use OPAL calls, and OPAL is not available in a guest.
      
      This means that using the in-kernel XICS emulation in a nested
      hypervisor that is using XIVE as its interrupt controller will cause a
      (nested) host kernel crash.  To fix this, we change most of the places
      where the current code calls xive_enabled() to select between the
      XICS-on-XIVE emulation and the plain XICS emulation to call a new
      function, xics_on_xive(), which returns false in a guest.
      
      However, there is a further twist.  The plain XICS emulation has some
      functions which are used in real mode and access the underlying XICS
      controller (the interrupt controller of the host) directly.  In the
      case of a nested hypervisor, this means doing XICS hypercalls
      directly.  When the nested host is using XIVE as its interrupt
      controller, these hypercalls will fail.  Therefore this also adds
      checks in the places where the XICS emulation wants to access the
      underlying interrupt controller directly, and if that is XIVE, makes
      the code use the virtual mode fallback paths, which call generic
      kernel infrastructure rather than doing direct XICS access.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      03f95332
  8. 17 12月, 2018 2 次提交
  9. 09 10月, 2018 5 次提交
    • P
      KVM: PPC: Book3S HV: Add a VM capability to enable nested virtualization · aa069a99
      Paul Mackerras 提交于
      With this, userspace can enable a KVM-HV guest to run nested guests
      under it.
      
      The administrator can control whether any nested guests can be run;
      setting the "nested" module parameter to false prevents any guests
      becoming nested hypervisors (that is, any attempt to enable the nested
      capability on a guest will fail).  Guests which are already nested
      hypervisors will continue to be so.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aa069a99
    • P
      KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests · 95a6432c
      Paul Mackerras 提交于
      This creates an alternative guest entry/exit path which is used for
      radix guests on POWER9 systems when we have indep_threads_mode=Y.  In
      these circumstances there is exactly one vcpu per vcore and there is
      no coordination required between vcpus or vcores; the vcpu can enter
      the guest without needing to synchronize with anything else.
      
      The new fast path is implemented almost entirely in C in book3s_hv.c
      and runs with the MMU on until the guest is entered.  On guest exit
      we use the existing path until the point where we are committed to
      exiting the guest (as distinct from handling an interrupt in the
      low-level code and returning to the guest) and we have pulled the
      guest context from the XIVE.  At that point we check a flag in the
      stack frame to see whether we came in via the old path and the new
      path; if we came in via the new path then we go back to C code to do
      the rest of the process of saving the guest context and restoring the
      host context.
      
      The C code is split into separate functions for handling the
      OS-accessible state and the hypervisor state, with the idea that the
      latter can be replaced by a hypercall when we implement nested
      virtualization.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix CONFIG_ALTIVEC=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      95a6432c
    • P
      KVM: PPC: Book3S HV: Move interrupt delivery on guest entry to C code · f7035ce9
      Paul Mackerras 提交于
      This is based on a patch by Suraj Jitindar Singh.
      
      This moves the code in book3s_hv_rmhandlers.S that generates an
      external, decrementer or privileged doorbell interrupt just before
      entering the guest to C code in book3s_hv_builtin.c.  This is to
      make future maintenance and modification easier.  The algorithm
      expressed in the C code is almost identical to the previous
      algorithm.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f7035ce9
    • A
      KVM: PPC: Remove redundand permission bits removal · a3ac077b
      Alexey Kardashevskiy 提交于
      The kvmppc_gpa_to_ua() helper itself takes care of the permission
      bits in the TCE and yet every single caller removes them.
      
      This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs
      (which are GPAs + TCE permission bits) to make the callers simpler.
      
      This should cause no behavioural change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a3ac077b
    • A
      KVM: PPC: Validate TCEs against preregistered memory page sizes · 42de7b9e
      Alexey Kardashevskiy 提交于
      The userspace can request an arbitrary supported page size for a DMA
      window and this works fine as long as the mapped memory is backed with
      the pages of the same or bigger size; if this is not the case,
      mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with
      dangerously incorrect TCEs.
      
      However since it is quite easy to misconfigure the KVM and we do not do
      reverts to all changes made to TCE tables if an error happens in a middle,
      we better do the acceptable page size validation before we even touch
      the tables.
      
      This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes
      against the preregistered memory page sizes.
      
      Since the new check uses real/virtual mode helpers, this renames
      kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode
      case and mirrors it for the virtual mode under the old name. The real
      mode handler is not used for the virtual mode as:
      1. it uses _lockless() list traversing primitives instead of RCU;
      2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which
      virtual mode does not have to use and since on POWER9+radix only virtual
      mode handlers actually work, we do not want to slow down that path even
      a bit.
      
      This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators
      are static now.
      
      From now on the attempts on mapping IOMMU pages bigger than allowed
      will result in KVM exit.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      [mpe: Fix KVM_HV=n build]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      42de7b9e
  10. 22 5月, 2018 3 次提交
  11. 30 3月, 2018 1 次提交
  12. 19 3月, 2018 1 次提交
  13. 09 2月, 2018 1 次提交
  14. 19 1月, 2018 3 次提交
  15. 23 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Fix migration and HPT resizing of HPT guests on radix hosts · ded13fc1
      Paul Mackerras 提交于
      This fixes two errors that prevent a guest using the HPT MMU from
      successfully migrating to a POWER9 host in radix MMU mode, or resizing
      its HPT when running on a radix host.
      
      The first bug was that commit 8dc6cca5 ("KVM: PPC: Book3S HV:
      Don't rely on host's page size information", 2017-09-11) missed two
      uses of hpte_base_page_size(), one in the HPT rehashing code and
      one in kvm_htab_write() (which is used on the destination side in
      migrating a HPT guest).  Instead we use kvmppc_hpte_base_page_shift().
      Having the shift count means that we can use left and right shifts
      instead of multiplication and division in a few places.
      
      Along the way, this adds a check in kvm_htab_write() to ensure that the
      page size encoding in the incoming HPTEs is recognized, and if not
      return an EINVAL error to userspace.
      
      The second bug was that kvm_htab_write was performing some but not all
      of the functions of kvmhv_setup_mmu(), resulting in the destination VM
      being left in radix mode as far as the hardware is concerned.  The
      simplest fix for now is make kvm_htab_write() call
      kvmppc_setup_partition_table() like kvmppc_hv_setup_htab_rma() does.
      In future it would be better to refactor the code more extensively
      to remove the duplication.
      
      Fixes: 8dc6cca5 ("KVM: PPC: Book3S HV: Don't rely on host's page size information")
      Fixes: 7a84084c ("KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9")
      Reported-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Tested-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ded13fc1
  16. 01 11月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Add infrastructure for running HPT guests on radix host · 18c3640c
      Paul Mackerras 提交于
      This sets up the machinery for switching a guest between HPT (hashed
      page table) and radix MMU modes, so that in future we can run a HPT
      guest on a radix host on POWER9 machines.
      
      * The KVM_PPC_CONFIGURE_V3_MMU ioctl can now specify either HPT or
        radix mode, on a radix host.
      
      * The KVM_CAP_PPC_MMU_HASH_V3 capability now returns 1 on POWER9
        with HV KVM on a radix host.
      
      * The KVM_PPC_GET_SMMU_INFO returns information about the HPT MMU on a
        radix host.
      
      * The KVM_PPC_ALLOCATE_HTAB ioctl on a radix host will switch the
        guest to HPT mode and allocate a HPT.
      
      * For simplicity, we now allocate the rmap array for each memslot,
        even on a radix host, since it will be needed if the guest switches
        to HPT mode.
      
      * Since we cannot yet run a HPT guest on a radix host, the KVM_RUN
        ioctl will return an EINVAL error in that case.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      18c3640c
  17. 19 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
  18. 27 4月, 2017 1 次提交
  19. 20 4月, 2017 5 次提交
    • A
      KVM: PPC: VFIO: Add in-kernel acceleration for VFIO · 121f80ba
      Alexey Kardashevskiy 提交于
      This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
      and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
      without passing them to user space which saves time on switching
      to user space and back.
      
      This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
      KVM tries to handle a TCE request in the real mode, if failed
      it passes the request to the virtual mode to complete the operation.
      If it a virtual mode handler fails, the request is passed to
      the user space; this is not expected to happen though.
      
      To avoid dealing with page use counters (which is tricky in real mode),
      this only accelerates SPAPR TCE IOMMU v2 clients which are required
      to pre-register the userspace memory. The very first TCE request will
      be handled in the VFIO SPAPR TCE driver anyway as the userspace view
      of the TCE table (iommu_table::it_userspace) is not allocated till
      the very first mapping happens and we cannot call vmalloc in real mode.
      
      If we fail to update a hardware IOMMU table unexpected reason, we just
      clear it and move on as there is nothing really we can do about it -
      for example, if we hot plug a VFIO device to a guest, existing TCE tables
      will be mirrored automatically to the hardware and there is no interface
      to report to the guest about possible failures.
      
      This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
      the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
      and associates a physical IOMMU table with the SPAPR TCE table (which
      is a guest view of the hardware IOMMU table). The iommu_table object
      is cached and referenced so we do not have to look up for it in real mode.
      
      This does not implement the UNSET counterpart as there is no use for it -
      once the acceleration is enabled, the existing userspace won't
      disable it unless a VFIO container is destroyed; this adds necessary
      cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
      
      This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
      space.
      
      This adds real mode version of WARN_ON_ONCE() as the generic version
      causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
      returns in the code, this also adds a check for already existing
      vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
      
      This finally makes use of vfio_external_user_iommu_id() which was
      introduced quite some time ago and was considered for removal.
      
      Tests show that this patch increases transmission speed from 220MB/s
      to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      121f80ba
    • A
      KVM: PPC: iommu: Unify TCE checking · b1af23d8
      Alexey Kardashevskiy 提交于
      This reworks helpers for checking TCE update parameters in way they
      can be used in KVM.
      
      This should cause no behavioral change.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      b1af23d8
    • A
      KVM: PPC: Pass kvm* to kvmppc_find_table() · 503bfcbe
      Alexey Kardashevskiy 提交于
      The guest view TCE tables are per KVM anyway (not per VCPU) so pass kvm*
      there. This will be used in the following patches where we will be
      attaching VFIO containers to LIOBNs via ioctl() to KVM (rather than
      to VCPU).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      503bfcbe
    • B
      KVM: PPC: Book3S: Add MMIO emulation for FP and VSX instructions · 6f63e81b
      Bin Lu 提交于
      This patch provides the MMIO load/store emulation for instructions
      of 'double & vector unsigned char & vector signed char & vector
      unsigned short & vector signed short & vector unsigned int & vector
      signed int & vector double '.
      
      The instructions that this adds emulation for are:
      
      - ldx, ldux, lwax,
      - lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
      - stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
      - lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
      - stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
      
      [paulus@ozlabs.org - some cleanups, fixes and rework, make it
       compile for Book E, fix build when PR KVM is built in]
      Signed-off-by: NBin Lu <lblulb@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6f63e81b
    • P
      KVM: PPC: Provide functions for queueing up FP/VEC/VSX unavailable interrupts · 307d9279
      Paul Mackerras 提交于
      This provides functions that can be used for generating interrupts
      indicating that a given functional unit (floating point, vector, or
      VSX) is unavailable.  These functions will be used in instruction
      emulation code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      307d9279
  20. 10 4月, 2017 1 次提交