1. 15 10月, 2019 1 次提交
    • G
      KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use · 12ade69c
      Greg Kurz 提交于
      Connecting a vCPU to a XIVE KVM device means establishing a 1:1
      association between a vCPU id and the offset (VP id) of a VP
      structure within a fixed size block of VPs. We currently try to
      enforce the 1:1 relationship by checking that a vCPU with the
      same id isn't already connected. This is good but unfortunately
      not enough because we don't map VP ids to raw vCPU ids but to
      packed vCPU ids, and the packing function kvmppc_pack_vcpu_id()
      isn't bijective by design. We got away with it because QEMU passes
      vCPU ids that fit well in the packing pattern. But nothing prevents
      userspace to come up with a forged vCPU id resulting in a packed id
      collision which causes the KVM device to associate two vCPUs to the
      same VP. This greatly confuses the irq layer and ultimately crashes
      the kernel, as shown below.
      
      Example: a guest with 1 guest thread per core, a core stride of
      8 and 300 vCPUs has vCPU ids 0,8,16...2392. If QEMU is patched to
      inject at some point an invalid vCPU id 348, which is the packed
      version of itself and 2392, we get:
      
      genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348)
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
      Call Trace:
      [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable)
      [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0
      [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260
      [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
      [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
      [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
      [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
      [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
      [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
      [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
      [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
      xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392
      ------------[ cut here ]------------
      remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348'
      WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200
      Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
      NIP:  c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0
      REGS: c000003f7f9934b0 TRAP: 0700   Not tainted  (5.3.0-xive-nr-servers-5.3-gku+)
      MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48228222  XER: 20040000
      CFAR: c000000000131a50 IRQMASK: 0
      GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057
      GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf
      GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033
      GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1
      GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000
      GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006
      GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48
      GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0
      NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200
      LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200
      Call Trace:
      [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable)
      [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150
      [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0
      [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100
      [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80
      [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm]
      [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
      [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
      [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
      [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
      [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
      [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
      [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
      Instruction dump:
      2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378
      3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41
      ---[ end trace b925b67a74a1d8d1 ]---
      BUG: Kernel NULL pointer dereference at 0x00000010
      Faulting instruction address: 0xc00800000d44fc04
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G        W         5.3.0-xive-nr-servers-5.3-gku+ #38
      NIP:  c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970
      REGS: c000003f7f9938e0 TRAP: 0300   Tainted: G        W          (5.3.0-xive-nr-servers-5.3-gku+)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24228882  XER: 20040000
      CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0
      GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000
      GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8
      GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd
      GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1
      GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000
      GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8
      GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000
      GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90
      NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm]
      LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm]
      Call Trace:
      [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable)
      [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm]
      [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm]
      [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320
      [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0
      [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440
      [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
      Instruction dump:
      3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c
      2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac
      ---[ end trace b925b67a74a1d8d2 ]---
      
      Kernel panic - not syncing: Fatal exception
      
      This affects both XIVE and XICS-on-XIVE devices since the beginning.
      
      Check the VP id instead of the vCPU id when a new vCPU is connected.
      The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
      is moved after the check to avoid the need for rollback.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      12ade69c
  2. 09 10月, 2019 1 次提交
    • J
      powerpc/kvm: Fix kvmppc_vcore->in_guest value in kvmhv_switch_to_host · 7fe4e117
      Jordan Niethe 提交于
      kvmhv_switch_to_host() in arch/powerpc/kvm/book3s_hv_rmhandlers.S
      needs to set kvmppc_vcore->in_guest to 0 to signal secondary CPUs to
      continue. This happens after resetting the PCR. Before commit
      13c7bb3c ("powerpc/64s: Set reserved PCR bits"), r0 would always
      be 0 before it was stored to kvmppc_vcore->in_guest. However because
      of this change in the commit:
      
                /* Reset PCR */
                ld      r0, VCORE_PCR(r5)
        -       cmpdi   r0, 0
        +       LOAD_REG_IMMEDIATE(r6, PCR_MASK)
        +       cmpld   r0, r6
                beq     18f
        -       li      r0, 0
        -       mtspr   SPRN_PCR, r0
        +       mtspr   SPRN_PCR, r6
         18:
                /* Signal secondary CPUs to continue */
                stb     r0,VCORE_IN_GUEST(r5)
      
      We are no longer comparing r0 against 0 and loading it with 0 if it
      contains something else. Hence when we store r0 to
      kvmppc_vcore->in_guest, it might not be 0. This means that secondary
      CPUs will not be signalled to continue. Those CPUs get stuck and
      errors like the following are logged:
      
          KVM: CPU 1 seems to be stuck
          KVM: CPU 2 seems to be stuck
          KVM: CPU 3 seems to be stuck
          KVM: CPU 4 seems to be stuck
          KVM: CPU 5 seems to be stuck
          KVM: CPU 6 seems to be stuck
          KVM: CPU 7 seems to be stuck
      
      This can be reproduced with:
          $ for i in `seq 1 7` ; do chcpu -d $i ; done ;
          $ taskset -c 0 qemu-system-ppc64 -smp 8,threads=8 \
             -M pseries,accel=kvm,kvm-type=HV -m 1G -nographic -vga none \
             -kernel vmlinux -initrd initrd.cpio.xz
      
      Fix by making sure r0 is 0 before storing it to
      kvmppc_vcore->in_guest.
      
      Fixes: 13c7bb3c ("powerpc/64s: Set reserved PCR bits")
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191004025317.19340-1-jniethe5@gmail.com
      7fe4e117
  3. 01 10月, 2019 1 次提交
  4. 24 9月, 2019 3 次提交
    • A
      powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 · 047e6575
      Aneesh Kumar K.V 提交于
      On POWER9, under some circumstances, a broadcast TLB invalidation will
      fail to invalidate the ERAT cache on some threads when there are
      parallel mtpidr/mtlpidr happening on other threads of the same core.
      This can cause stores to continue to go to a page after it's unmapped.
      
      The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie
      flush. This additional TLB flush will cause the ERAT cache
      invalidation. Since we are using PID=0 or LPID=0, we don't get
      filtered out by the TLB snoop filtering logic.
      
      We need to still follow this up with another tlbie to take care of
      store vs tlbie ordering issue explained in commit:
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9"). The presence of ERAT cache implies we can still get new
      stores and they may miss store queue marking flush.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
      047e6575
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · 09ce98ca
      Aneesh Kumar K.V 提交于
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
      09ce98ca
    • M
      KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag · 3a83f677
      Michael Roth 提交于
      On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
      of memory running the following guest configs:
      
        guest A:
          - 224GB of memory
          - 56 VCPUs (sockets=1,cores=28,threads=2), where:
            VCPUs 0-1 are pinned to CPUs 0-3,
            VCPUs 2-3 are pinned to CPUs 4-7,
            ...
            VCPUs 54-55 are pinned to CPUs 108-111
      
        guest B:
          - 4GB of memory
          - 4 VCPUs (sockets=1,cores=4,threads=1)
      
      with the following workloads (with KSM and THP enabled in all):
      
        guest A:
          stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
      
        guest B:
          stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
      
        host:
          stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
      
      the below soft-lockup traces were observed after an hour or so and
      persisted until the host was reset (this was found to be reliably
      reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
      and 5.3-rc5):
      
        [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
        [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
        [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
        [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
        [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
        [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
        [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
        [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
        [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
        [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
        [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
        [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
        [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
        [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
        [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
        [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
        [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
        [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
      
      CPUs 45, 24, and 124 are stuck on spin locks, likely held by
      CPUs 105 and 31.
      
      CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
      target CPU 42. For instance:
      
        # CPU 105 registers (via xmon)
        R00 = c00000000020b20c   R16 = 00007d1bcd800000
        R01 = c00000363eaa7970   R17 = 0000000000000001
        R02 = c0000000019b3a00   R18 = 000000000000006b
        R03 = 000000000000002a   R19 = 00007d537d7aecf0
        R04 = 000000000000002a   R20 = 60000000000000e0
        R05 = 000000000000002a   R21 = 0801000000000080
        R06 = c0002073fb0caa08   R22 = 0000000000000d60
        R07 = c0000000019ddd78   R23 = 0000000000000001
        R08 = 000000000000002a   R24 = c00000000147a700
        R09 = 0000000000000001   R25 = c0002073fb0ca908
        R10 = c000008ffeb4e660   R26 = 0000000000000000
        R11 = c0002073fb0ca900   R27 = c0000000019e2464
        R12 = c000000000050790   R28 = c0000000000812b0
        R13 = c000207fff623e00   R29 = c0002073fb0ca808
        R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
        R15 = 00007d1bcd600000   R31 = 0000000000000800
        pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
        cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
        lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
        msr = 900000010288b033   cr  = 44024824
        ctr = c000000000050790   xer = 0000000000000000   trap =  100
      
      CPU 42 is running normally, doing VCPU work:
      
        # CPU 42 stack trace (via xmon)
        [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
        [c000008ed3343820] c000008ed3343850 (unreliable)
        [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
        [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
        [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
        [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
        [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
        [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
        [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
        [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
        [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
        --- Exception: c00 (System Call) at 00007d545cfd7694
        SP (7d53ff7edf50) is in userspace
      
      It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
      was set for CPU 42 by at least 1 of the CPUs waiting in
      smp_call_function_many(), but somehow the corresponding
      call_single_queue entries were never processed by CPU 42, causing the
      callers to spin in csd_lock_wait() indefinitely.
      
      Nick Piggin suggested something similar to the following sequence as
      a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
      IPI messages seems to be most common, but any mix of CALL_FUNCTION and
      !CALL_FUNCTION messages could trigger it):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
          105: smp_muxed_ipi_set_message():
          105:   smb_mb()
               // STORE DEFERRED DUE TO RE-ORDERING
        --105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |  42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        |  42: // returns to executing guest
        |      // RE-ORDERED STORE COMPLETES
        ->105:   message[CALL_FUNCTION] = 1
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      This can be prevented with an smp_mb() at the beginning of
      kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
      state indicated by the host_ipi flag) are ordered vs. the store to
      to host_ipi.
      
      However, doing so might still allow for the following scenario (not
      yet observed):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
               // STORE DEFERRED DUE TO RE-ORDERING
        -- 42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        | 105: smp_muxed_ipi_set_message():
        | 105:   smb_mb()
        | 105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |      // RE-ORDERED STORE COMPLETES
        -> 42:   kvmppc_set_host_ipi(42, 0)
           42: // returns to executing guest
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      Fixing this scenario would require an smp_mb() *after* clearing
      host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
      subsequent processing of IPI messages.
      
      To handle both cases, this patch splits kvmppc_set_host_ipi() into
      separate set/clear functions, where we execute smp_mb() prior to
      setting host_ipi flag, and after clearing host_ipi flag. These
      functions pair with each other to synchronize the sender and receiver
      sides.
      
      With that change in place the above workload ran for 20 hours without
      triggering any lock-ups.
      
      Fixes: 755563bc ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
      3a83f677
  5. 21 9月, 2019 1 次提交
    • J
      powerpc/64s: Set reserved PCR bits · 13c7bb3c
      Jordan Niethe 提交于
      Currently the reserved bits of the Processor Compatibility
      Register (PCR) are cleared as per the Programming Note in Section
      1.3.3 of version 3.0B of the Power ISA. This causes all new
      architecture features to be made available when running on newer
      processors with new architecture features added to the PCR as bits
      must be set to disable a given feature.
      
      For example to disable new features added as part of Version 2.07 of
      the ISA the corresponding bit in the PCR needs to be set.
      
      As new processor features generally require explicit kernel support
      they should be disabled until such support is implemented. Therefore
      kernels should set all unknown/reserved bits in the PCR such that any
      new architecture features which the kernel does not currently know
      about get disabled.
      
      An update is planned to the ISA to clarify that the PCR is an
      exception to the Programming Note on reserved bits in Section 1.3.3.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Tested-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
      13c7bb3c
  6. 05 9月, 2019 3 次提交
  7. 30 8月, 2019 3 次提交
  8. 27 8月, 2019 4 次提交
    • P
      KVM: PPC: Book3S HV: Don't lose pending doorbell request on migration on P9 · ff42df49
      Paul Mackerras 提交于
      On POWER9, when userspace reads the value of the DPDES register on a
      vCPU, it is possible for 0 to be returned although there is a doorbell
      interrupt pending for the vCPU.  This can lead to a doorbell interrupt
      being lost across migration.  If the guest kernel uses doorbell
      interrupts for IPIs, then it could malfunction because of the lost
      interrupt.
      
      This happens because a newly-generated doorbell interrupt is signalled
      by setting vcpu->arch.doorbell_request to 1; the DPDES value in
      vcpu->arch.vcore->dpdes is not updated, because it can only be updated
      when holding the vcpu mutex, in order to avoid races.
      
      To fix this, we OR in vcpu->arch.doorbell_request when reading the
      DPDES value.
      
      Cc: stable@vger.kernel.org # v4.13+
      Fixes: 57900694 ("KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      ff42df49
    • P
      KVM: PPC: Book3S HV: Check for MMU ready on piggybacked virtual cores · d28eafc5
      Paul Mackerras 提交于
      When we are running multiple vcores on the same physical core, they
      could be from different VMs and so it is possible that one of the
      VMs could have its arch.mmu_ready flag cleared (for example by a
      concurrent HPT resize) when we go to run it on a physical core.
      We currently check the arch.mmu_ready flag for the primary vcore
      but not the flags for the other vcores that will be run alongside
      it.  This adds that check, and also a check when we select the
      secondary vcores from the preempted vcores list.
      
      Cc: stable@vger.kernel.org # v4.14+
      Fixes: 38c53af8 ("KVM: PPC: Book3S HV: Fix exclusion between HPT resizing and other HPT updates")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d28eafc5
    • P
      KVM: PPC: Book3S: Enable XIVE native capability only if OPAL has required functions · 2ad7a27d
      Paul Mackerras 提交于
      There are some POWER9 machines where the OPAL firmware does not support
      the OPAL_XIVE_GET_QUEUE_STATE and OPAL_XIVE_SET_QUEUE_STATE calls.
      The impact of this is that a guest using XIVE natively will not be able
      to be migrated successfully.  On the source side, the get_attr operation
      on the KVM native device for the KVM_DEV_XIVE_GRP_EQ_CONFIG attribute
      will fail; on the destination side, the set_attr operation for the same
      attribute will fail.
      
      This adds tests for the existence of the OPAL get/set queue state
      functions, and if they are not supported, the XIVE-native KVM device
      is not created and the KVM_CAP_PPC_IRQ_XIVE capability returns false.
      Userspace can then either provide a software emulation of XIVE, or
      else tell the guest that it does not have a XIVE controller available
      to it.
      
      Cc: stable@vger.kernel.org # v5.2+
      Fixes: 3fab2d10 ("KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode")
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ad7a27d
    • A
      KVM: PPC: Book3S: Fix incorrect guest-to-user-translation error handling · ddfd151f
      Alexey Kardashevskiy 提交于
      H_PUT_TCE_INDIRECT handlers receive a page with up to 512 TCEs from
      a guest. Although we verify correctness of TCEs before we do anything
      with the existing tables, there is a small window when a check in
      kvmppc_tce_validate might pass and right after that the guest alters
      the page of TCEs, causing an early exit from the handler and leaving
      srcu_read_lock(&vcpu->kvm->srcu) (virtual mode) or lock_rmap(rmap)
      (real mode) locked.
      
      This fixes the bug by jumping to the common exit code with an appropriate
      unlock.
      
      Cc: stable@vger.kernel.org # v4.11+
      Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO")
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ddfd151f
  9. 23 8月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot · d22deab6
      Suraj Jitindar Singh 提交于
      The rmap array in the guest memslot is an array of size number of guest
      pages, allocated at memslot creation time. Each rmap entry in this array
      is used to store information about the guest page to which it
      corresponds. For example for a hpt guest it is used to store a lock bit,
      rc bits, a present bit and the index of a hpt entry in the guest hpt
      which maps this page. For a radix guest which is running nested guests
      it is used to store a pointer to a linked list of nested rmap entries
      which store the nested guest physical address which maps this guest
      address and for which there is a pte in the shadow page table.
      
      As there are currently two uses for the rmap array, and the potential
      for this to expand to more in the future, define a type field (being the
      top 8 bits of the rmap entry) to be used to define the type of the rmap
      entry which is currently present and define two values for this field
      for the two current uses of the rmap array.
      
      Since the nested case uses the rmap entry to store a pointer, define
      this type as having the two high bits set as is expected for a pointer.
      Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d22deab6
    • P
      KVM: PPC: Book3S: Mark expected switch fall-through · ff7240cc
      Paul Menzel 提交于
      Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
      it as an expected fall-through.
      
          arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
          arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
                pte->may_write = true;
                ~~~~~~~~~~~~~~~^~~~~~
          arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
               case 3:
               ^~~~
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ff7240cc
  10. 22 8月, 2019 1 次提交
  11. 16 8月, 2019 4 次提交
    • P
      powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race · da15c03b
      Paul Mackerras 提交于
      Testing has revealed the existence of a race condition where a XIVE
      interrupt being shut down can be in one of the XIVE interrupt queues
      (of which there are up to 8 per CPU, one for each priority) at the
      point where free_irq() is called.  If this happens, can return an
      interrupt number which has been shut down.  This can lead to various
      symptoms:
      
      - irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
        function gets called, resulting in the CPU's elevated interrupt
        priority (numerically lowered CPPR) never gets reset.  That then
        means that the CPU stops processing interrupts, causing device
        timeouts and other errors in various device drivers.
      
      - The irq descriptor or related data structures can be in the process
        of being freed as the interrupt code is using them.  This typically
        leads to crashes due to bad pointer dereferences.
      
      This race is basically what commit 62e04686 ("genirq: Add optional
      hardware synchronization for shutdown", 2019-06-28) is intended to
      fix, given a get_irqchip_state() method for the interrupt controller
      being used.  It works by polling the interrupt controller when an
      interrupt is being freed until the controller says it is not pending.
      
      With XIVE, the PQ bits of the interrupt source indicate the state of
      the interrupt source, and in particular the P bit goes from 0 to 1 at
      the point where the hardware writes an entry into the interrupt queue
      that this interrupt is directed towards.  Normally, the code will then
      process the interrupt and do an end-of-interrupt (EOI) operation which
      will reset PQ to 00 (assuming another interrupt hasn't been generated
      in the meantime).  However, there are situations where the code resets
      P even though a queue entry exists (for example, by setting PQ to 01,
      which disables the interrupt source), and also situations where the
      code leaves P at 1 after removing the queue entry (for example, this
      is done for escalation interrupts so they cannot fire again until
      they are explicitly re-enabled).
      
      The code already has a 'saved_p' flag for the interrupt source which
      indicates that a queue entry exists, although it isn't maintained
      consistently.  This patch adds a 'stale_p' flag to indicate that
      P has been left at 1 after processing a queue entry, and adds code
      to set and clear saved_p and stale_p as necessary to maintain a
      consistent indication of whether a queue entry may or may not exist.
      
      With this, we can implement xive_get_irqchip_state() by looking at
      stale_p, saved_p and the ESB PQ bits for the interrupt.
      
      There is some additional code to handle escalation interrupts
      properly; because they are enabled and disabled in KVM assembly code,
      which does not have access to the xive_irq_data struct for the
      escalation interrupt.  Hence, stale_p may be incorrect when the
      escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
      Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
      with some careful attention to barriers in order to ensure the correct
      result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().
      
      Finally, this adds code to make noise on the console (pr_crit and
      WARN_ON(1)) if we find an interrupt queue entry for an interrupt
      which does not have a descriptor.  While this won't catch the race
      reliably, if it does get triggered it will be an indication that
      the race is occurring and needs to be debugged.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry
      da15c03b
    • P
      KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device · 8d4ba9c9
      Paul Mackerras 提交于
      At present, when running a guest on POWER9 using HV KVM but not using
      an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
      is run with the kernel_irqchip=off option, the guest entry code goes
      ahead and tries to load the guest context into the XIVE hardware, even
      though no context has been set up.
      
      To fix this, we check that the "CAM word" is non-zero before pushing
      it to the hardware.  The CAM word is initialized to a non-zero value
      in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
      and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.
      
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Reported-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry
      8d4ba9c9
    • P
      KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 959c5d51
      Paul Mackerras 提交于
      Escalation interrupts are interrupts sent to the host by the XIVE
      hardware when it has an interrupt to deliver to a guest VCPU but that
      VCPU is not running anywhere in the system.  Hence we disable the
      escalation interrupt for the VCPU being run when we enter the guest
      and re-enable it when the guest does an H_CEDE hypercall indicating
      it is idle.
      
      It is possible that an escalation interrupt gets generated just as we
      are entering the guest.  In that case the escalation interrupt may be
      using a queue entry in one of the interrupt queues, and that queue
      entry may not have been processed when the guest exits with an H_CEDE.
      The existing entry code detects this situation and does not clear the
      vcpu->arch.xive_esc_on flag as an indication that there is a pending
      queue entry (if the queue entry gets processed, xive_esc_irq() will
      clear the flag).  There is a comment in the code saying that if the
      flag is still set on H_CEDE, we have to abort the cede rather than
      re-enabling the escalation interrupt, lest we end up with two
      occurrences of the escalation interrupt in the interrupt queue.
      
      However, the exit code doesn't do that; it aborts the cede in the sense
      that vcpu->arch.ceded gets cleared, but it still enables the escalation
      interrupt by setting the source's PQ bits to 00.  Instead we need to
      set the PQ bits to 10, indicating that an interrupt has been triggered.
      We also need to avoid setting vcpu->arch.xive_esc_on in this case
      (i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
      xive_esc_irq() will run at some point and clear it, and if we race with
      that we may end up with an incorrect result (i.e. xive_esc_on set when
      the escalation interrupt has just been handled).
      
      It is extremely unlikely that having two queue entries would cause
      observable problems; theoretically it could cause queue overflow, but
      the CPU would have to have thousands of interrupts targetted to it for
      that to be possible.  However, this fix will also make it possible to
      determine accurately whether there is an unhandled escalation
      interrupt in the queue, which will be needed by the following patch.
      
      Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry
      959c5d51
    • C
      KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 237aed48
      Cédric Le Goater 提交于
      When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
      disabled and then the event notification queues are freed. When freeing
      the queues, we check for possible escalation interrupts and free them
      also.
      
      But when a XIVE VP is disabled, the underlying XIVE ENDs also are
      disabled in OPAL. When an END (Event Notification Descriptor) is
      disabled, its ESB pages (ESn and ESe) are disabled and loads return all
      1s. Which means that any access on the ESB page of the escalation
      interrupt will return invalid values.
      
      When an interrupt is freed, the shutdown handler computes a 'saved_p'
      field from the value returned by a load in xive_do_source_set_mask().
      This value is incorrect for escalation interrupts for the reason
      described above.
      
      This has no impact on Linux/KVM today because we don't make use of it
      but we will introduce in future changes a xive_get_irqchip_state()
      handler. This handler will use the 'saved_p' field to return the state
      of an interrupt and 'saved_p' being incorrect, softlockup will occur.
      
      Fix the vCPU cleanup sequence by first freeing the escalation interrupts
      if any, then disable the XIVE VP and last free the queues.
      
      Fixes: 90c73795 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org
      237aed48
  12. 05 8月, 2019 2 次提交
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  13. 29 7月, 2019 1 次提交
  14. 19 7月, 2019 1 次提交
  15. 17 7月, 2019 1 次提交
  16. 15 7月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Save and restore guest visible PSSCR bits on pseries · c8b4083d
      Suraj Jitindar Singh 提交于
      The Performance Stop Status and Control Register (PSSCR) is used to
      control the power saving facilities of the processor. This register
      has various fields, some of which can be modified only in hypervisor
      state, and others which can be modified in both hypervisor and
      privileged non-hypervisor state. The bits which can be modified in
      privileged non-hypervisor state are referred to as guest visible.
      
      Currently the L0 hypervisor saves and restores both it's own host
      value as well as the guest value of the PSSCR when context switching
      between the hypervisor and guest. However a nested hypervisor running
      it's own nested guests (as indicated by kvmhv_on_pseries()) doesn't
      context switch the PSSCR register. That means if a nested (L2) guest
      modifies the PSSCR then the L1 guest hypervisor will run with that
      modified value, and if the L1 guest hypervisor modifies the PSSCR and
      then goes to run the nested (L2) guest again then the L2 PSSCR value
      will be lost.
      
      Fix this by having the (L1) nested hypervisor save and restore both
      its host and the guest PSSCR value when entering and exiting a
      nested (L2) guest. Note that only the guest visible parts of the PSSCR
      are context switched since this is all the L1 nested hypervisor can
      access, this is fine however as these are the only fields the L0
      hypervisor provides guest control of anyway and so all other fields
      are ignored.
      
      This could also have been implemented by adding the PSSCR register to
      the hv_regs passed to the L0 hypervisor as input to the H_ENTER_NESTED
      hcall, however this would have meant updating the structure layout and
      thus required modifications to both the L0 and L1 kernels. Whereas the
      approach used doesn't require L0 kernel modifications while achieving
      the same result.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190703012022.15644-3-sjitindarsingh@gmail.com
      c8b4083d
    • S
      KVM: PPC: Book3S HV: Always save guest pmu for guest capable of nesting · 63279eeb
      Suraj Jitindar Singh 提交于
      The performance monitoring unit (PMU) registers are saved on guest
      exit when the guest has set the pmcregs_in_use flag in its lppaca, if
      it exists, or unconditionally if it doesn't. If a nested guest is
      being run then the hypervisor doesn't, and in most cases can't, know
      if the PMU registers are in use since it doesn't know the location of
      the lppaca for the nested guest, although it may have one for its
      immediate guest. This results in the values of these registers being
      lost across nested guest entry and exit in the case where the nested
      guest was making use of the performance monitoring facility while it's
      nested guest hypervisor wasn't.
      
      Further more the hypervisor could interrupt a guest hypervisor between
      when it has loaded up the PMU registers and it calling H_ENTER_NESTED
      or between returning from the nested guest to the guest hypervisor and
      the guest hypervisor reading the PMU registers, in
      kvmhv_p9_guest_entry(). This means that it isn't sufficient to just
      save the PMU registers when entering or exiting a nested guest, but
      that it is necessary to always save the PMU registers whenever a guest
      is capable of running nested guests to ensure the register values
      aren't lost in the context switch.
      
      Ensure the PMU register values are preserved by always saving their
      value into the vcpu struct when a guest is capable of running nested
      guests.
      
      This should have minimal performance impact however any impact can be
      avoided by booting a guest with "-machine pseries,cap-nested-hv=false"
      on the qemu commandline.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190703012022.15644-1-sjitindarsingh@gmail.com
      63279eeb
  17. 13 7月, 2019 1 次提交
  18. 04 7月, 2019 2 次提交
  19. 03 7月, 2019 4 次提交
  20. 20 6月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Clear pending decrementer exceptions on nested guest entry · 3c25ab35
      Suraj Jitindar Singh 提交于
      If we enter an L1 guest with a pending decrementer exception then this
      is cleared on guest exit if the guest has writtien a positive value
      into the decrementer (indicating that it handled the decrementer
      exception) since there is no other way to detect that the guest has
      handled the pending exception and that it should be dequeued. In the
      event that the L1 guest tries to run a nested (L2) guest immediately
      after this and the L2 guest decrementer is negative (which is loaded
      by L1 before making the H_ENTER_NESTED hcall), then the pending
      decrementer exception isn't cleared and the L2 entry is blocked since
      L1 has a pending exception, even though L1 may have already handled
      the exception and written a positive value for it's decrementer. This
      results in a loop of L1 trying to enter the L2 guest and L0 blocking
      the entry since L1 has an interrupt pending with the outcome being
      that L2 never gets to run and hangs.
      
      Fix this by clearing any pending decrementer exceptions when L1 makes
      the H_ENTER_NESTED hcall since it won't do this if it's decrementer
      has gone negative, and anyway it's decrementer has been communicated
      to L0 in the hdec_expires field and L0 will return control to L1 when
      this goes negative by delivering an H_DECREMENTER exception.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3c25ab35
    • S
      KVM: PPC: Book3S HV: Signed extend decrementer value if not using large decrementer · 86953770
      Suraj Jitindar Singh 提交于
      On POWER9 the decrementer can operate in large decrementer mode where
      the decrementer is 56 bits and signed extended to 64 bits. When not
      operating in this mode the decrementer behaves as a 32 bit decrementer
      which is NOT signed extended (as on POWER8).
      
      Currently when reading a guest decrementer value we don't take into
      account whether the large decrementer is enabled or not, and this
      means the value will be incorrect when the guest is not using the
      large decrementer. Fix this by sign extending the value read when the
      guest isn't using the large decrementer.
      
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      86953770