1. 24 1月, 2020 8 次提交
  2. 18 12月, 2019 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't do ultravisor calls on systems without ultravisor · d89c69f4
      Paul Mackerras 提交于
      Commit 22945688 ("KVM: PPC: Book3S HV: Support reset of secure
      guest") added a call to uv_svm_terminate, which is an ultravisor
      call, without any check that the guest is a secure guest or even that
      the system has an ultravisor.  On a system without an ultravisor,
      the ultracall will degenerate to a hypercall, but since we are not
      in KVM guest context, the hypercall will get treated as a system
      call, which could have random effects depending on what happens to
      be in r0, and could also corrupt the current task's kernel stack.
      Hence this adds a test for the guest being a secure guest before
      doing uv_svm_terminate().
      
      Fixes: 22945688 ("KVM: PPC: Book3S HV: Support reset of secure guest")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d89c69f4
  3. 17 12月, 2019 1 次提交
  4. 28 11月, 2019 5 次提交
    • B
      KVM: PPC: Book3S HV: Support reset of secure guest · 22945688
      Bharata B Rao 提交于
      Add support for reset of secure guest via a new ioctl KVM_PPC_SVM_OFF.
      This ioctl will be issued by QEMU during reset and includes the
      the following steps:
      
      - Release all device pages of the secure guest.
      - Ask UV to terminate the guest via UV_SVM_TERMINATE ucall
      - Unpin the VPA pages so that they can be migrated back to secure
        side when guest becomes secure again. This is required because
        pinned pages can't be migrated.
      - Reinit the partition scoped page tables
      
      After these steps, guest is ready to issue UV_ESM call once again
      to switch to secure mode.
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      	[Implementation of uv_svm_terminate() and its call from
      	guest shutdown path]
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      	[Unpinning of VPA pages]
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      22945688
    • B
      KVM: PPC: Book3S HV: Handle memory plug/unplug to secure VM · c3262257
      Bharata B Rao 提交于
      Register the new memslot with UV during plug and unregister
      the memslot during unplug. In addition, release all the
      device pages during unplug.
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      c3262257
    • B
      KVM: PPC: Book3S HV: Radix changes for secure guest · 008e359c
      Bharata B Rao 提交于
      - After the guest becomes secure, when we handle a page fault of a page
        belonging to SVM in HV, send that page to UV via UV_PAGE_IN.
      - Whenever a page is unmapped on the HV side, inform UV via UV_PAGE_INVAL.
      - Ensure all those routines that walk the secondary page tables of
        the guest don't do so in case of secure VM. For secure guest, the
        active secondary page tables are in secure memory and the secondary
        page tables in HV are freed when guest becomes secure.
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      008e359c
    • B
      KVM: PPC: Book3S HV: Shared pages support for secure guests · 60f0a643
      Bharata B Rao 提交于
      A secure guest will share some of its pages with hypervisor (Eg. virtio
      bounce buffers etc). Support sharing of pages between hypervisor and
      ultravisor.
      
      Shared page is reachable via both HV and UV side page tables. Once a
      secure page is converted to shared page, the device page that represents
      the secure page is unmapped from the HV side page tables.
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      60f0a643
    • B
      KVM: PPC: Book3S HV: Support for running secure guests · ca9f4942
      Bharata B Rao 提交于
      A pseries guest can be run as secure guest on Ultravisor-enabled
      POWER platforms. On such platforms, this driver will be used to manage
      the movement of guest pages between the normal memory managed by
      hypervisor (HV) and secure memory managed by Ultravisor (UV).
      
      HV is informed about the guest's transition to secure mode via hcalls:
      
      H_SVM_INIT_START: Initiate securing a VM
      H_SVM_INIT_DONE: Conclude securing a VM
      
      As part of H_SVM_INIT_START, register all existing memslots with
      the UV. H_SVM_INIT_DONE call by UV informs HV that transition of
      the guest to secure mode is complete.
      
      These two states (transition to secure mode STARTED and transition
      to secure mode COMPLETED) are recorded in kvm->arch.secure_guest.
      Setting these states will cause the assembly code that enters the
      guest to call the UV_RETURN ucall instead of trying to enter the
      guest directly.
      
      Migration of pages betwen normal and secure memory of secure
      guest is implemented in H_SVM_PAGE_IN and H_SVM_PAGE_OUT hcalls.
      
      H_SVM_PAGE_IN: Move the content of a normal page to secure page
      H_SVM_PAGE_OUT: Move the content of a secure page to normal page
      
      Private ZONE_DEVICE memory equal to the amount of secure memory
      available in the platform for running secure guests is created.
      Whenever a page belonging to the guest becomes secure, a page from
      this private device memory is used to represent and track that secure
      page on the HV side. The movement of pages between normal and secure
      memory is done via migrate_vma_pages() using UV_PAGE_IN and
      UV_PAGE_OUT ucalls.
      
      In order to prevent the device private pages (that correspond to pages
      of secure guest) from participating in KSM merging, H_SVM_PAGE_IN
      calls ksm_madvise() under read version of mmap_sem. However
      ksm_madvise() needs to be under write lock.  Hence we call
      kvmppc_svm_page_in with mmap_sem held for writing, and it then
      downgrades to a read lock after calling ksm_madvise.
      
      [paulus@ozlabs.org - roll in patch "KVM: PPC: Book3S HV: Take write
       mmap_sem when calling ksm_madvise"]
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ca9f4942
  5. 21 11月, 2019 2 次提交
    • G
      KVM: PPC: Book3S HV: XIVE: Fix potential page leak on error path · 30486e72
      Greg Kurz 提交于
      We need to check the host page size is big enough to accomodate the
      EQ. Let's do this before taking a reference on the EQ page to avoid
      a potential leak if the check fails.
      
      Cc: stable@vger.kernel.org # v5.2
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      30486e72
    • G
      KVM: PPC: Book3S HV: XIVE: Free previous EQ page when setting up a new one · 31a88c82
      Greg Kurz 提交于
      The EQ page is allocated by the guest and then passed to the hypervisor
      with the H_INT_SET_QUEUE_CONFIG hcall. A reference is taken on the page
      before handing it over to the HW. This reference is dropped either when
      the guest issues the H_INT_RESET hcall or when the KVM device is released.
      But, the guest can legitimately call H_INT_SET_QUEUE_CONFIG several times,
      either to reset the EQ (vCPU hot unplug) or to set a new EQ (guest reboot).
      In both cases the existing EQ page reference is leaked because we simply
      overwrite it in the XIVE queue structure without calling put_page().
      
      This is especially visible when the guest memory is backed with huge pages:
      start a VM up to the guest userspace, either reboot it or unplug a vCPU,
      quit QEMU. The leak is observed by comparing the value of HugePages_Free in
      /proc/meminfo before and after the VM is run.
      
      Ideally we'd want the XIVE code to handle the EQ page de-allocation at the
      platform level. This isn't the case right now because the various XIVE
      drivers have different allocation needs. It could maybe worth introducing
      hooks for this purpose instead of exposing XIVE internals to the drivers,
      but this is certainly a huge work to be done later.
      
      In the meantime, for easier backport, fix both vCPU unplug and guest reboot
      leaks by introducing a wrapper around xive_native_configure_queue() that
      does the necessary cleanup.
      Reported-by: NSatheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org # v5.2
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Tested-by: NLijun Pan <ljp@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      31a88c82
  6. 14 11月, 2019 1 次提交
    • M
      KVM: PPC: Book3S HV: Flush link stack on guest exit to host kernel · af2e8c68
      Michael Ellerman 提交于
      On some systems that are vulnerable to Spectre v2, it is up to
      software to flush the link stack (return address stack), in order to
      protect against Spectre-RSB.
      
      When exiting from a guest we do some house keeping and then
      potentially exit to C code which is several stack frames deep in the
      host kernel. We will then execute a series of returns without
      preceeding calls, opening up the possiblity that the guest could have
      poisoned the link stack, and direct speculative execution of the host
      to a gadget of some sort.
      
      To prevent this we add a flush of the link stack on exit from a guest.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      af2e8c68
  7. 22 10月, 2019 13 次提交
  8. 21 10月, 2019 1 次提交
    • F
      KVM: PPC: Report single stepping capability · 1a9167a2
      Fabiano Rosas 提交于
      When calling the KVM_SET_GUEST_DEBUG ioctl, userspace might request
      the next instruction to be single stepped via the
      KVM_GUESTDBG_SINGLESTEP control bit of the kvm_guest_debug structure.
      
      This patch adds the KVM_CAP_PPC_GUEST_DEBUG_SSTEP capability in order
      to inform userspace about the state of single stepping support.
      
      We currently don't have support for guest single stepping implemented
      in Book3S HV so the capability is only present for Book3S PR and
      BookE.
      Signed-off-by: NFabiano Rosas <farosas@linux.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1a9167a2
  9. 15 10月, 2019 1 次提交
    • G
      KVM: PPC: Book3S HV: XIVE: Ensure VP isn't already in use · 12ade69c
      Greg Kurz 提交于
      Connecting a vCPU to a XIVE KVM device means establishing a 1:1
      association between a vCPU id and the offset (VP id) of a VP
      structure within a fixed size block of VPs. We currently try to
      enforce the 1:1 relationship by checking that a vCPU with the
      same id isn't already connected. This is good but unfortunately
      not enough because we don't map VP ids to raw vCPU ids but to
      packed vCPU ids, and the packing function kvmppc_pack_vcpu_id()
      isn't bijective by design. We got away with it because QEMU passes
      vCPU ids that fit well in the packing pattern. But nothing prevents
      userspace to come up with a forged vCPU id resulting in a packed id
      collision which causes the KVM device to associate two vCPUs to the
      same VP. This greatly confuses the irq layer and ultimately crashes
      the kernel, as shown below.
      
      Example: a guest with 1 guest thread per core, a core stride of
      8 and 300 vCPUs has vCPU ids 0,8,16...2392. If QEMU is patched to
      inject at some point an invalid vCPU id 348, which is the packed
      version of itself and 2392, we get:
      
      genirq: Flags mismatch irq 199. 00010000 (kvm-2-2392) vs. 00010000 (kvm-2-348)
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
      Call Trace:
      [c000003f7f9937e0] [c000000000c0110c] dump_stack+0xb0/0xf4 (unreliable)
      [c000003f7f993820] [c0000000001cb480] __setup_irq+0xa70/0xad0
      [c000003f7f9938d0] [c0000000001cb75c] request_threaded_irq+0x13c/0x260
      [c000003f7f993940] [c00800000d44e7ac] kvmppc_xive_attach_escalation+0x104/0x270 [kvm]
      [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
      [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
      [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
      [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
      [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
      [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
      [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
      xive-kvm: Failed to request escalation interrupt for queue 0 of VCPU 2392
      ------------[ cut here ]------------
      remove_proc_entry: removing non-empty directory 'irq/199', leaking at least 'kvm-2-348'
      WARNING: CPU: 24 PID: 88176 at /home/greg/Work/linux/kernel-kvm-ppc/fs/proc/generic.c:684 remove_proc_entry+0x1ec/0x200
      Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Not tainted 5.3.0-xive-nr-servers-5.3-gku+ #38
      NIP:  c00000000053b0cc LR: c00000000053b0c8 CTR: c0000000000ba3b0
      REGS: c000003f7f9934b0 TRAP: 0700   Not tainted  (5.3.0-xive-nr-servers-5.3-gku+)
      MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48228222  XER: 20040000
      CFAR: c000000000131a50 IRQMASK: 0
      GPR00: c00000000053b0c8 c000003f7f993740 c0000000015ec500 0000000000000057
      GPR04: 0000000000000001 0000000000000000 000049fb98484262 0000000000001bcf
      GPR08: 0000000000000007 0000000000000007 0000000000000001 9000000000001033
      GPR12: 0000000000008000 c000003ffffeb800 0000000000000000 000000012f4ce5a1
      GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000
      GPR20: 000000012f45d918 c000003f863758b0 c000003f86375870 0000000000000006
      GPR24: c000003f86375a30 0000000000000007 c0002039373d9020 c0000000014c4a48
      GPR28: 0000000000000001 c000003fe62a4f6b c00020394b2e9fab c000003fe62a4ec0
      NIP [c00000000053b0cc] remove_proc_entry+0x1ec/0x200
      LR [c00000000053b0c8] remove_proc_entry+0x1e8/0x200
      Call Trace:
      [c000003f7f993740] [c00000000053b0c8] remove_proc_entry+0x1e8/0x200 (unreliable)
      [c000003f7f9937e0] [c0000000001d3654] unregister_irq_proc+0x114/0x150
      [c000003f7f993880] [c0000000001c6284] free_desc+0x54/0xb0
      [c000003f7f9938c0] [c0000000001c65ec] irq_free_descs+0xac/0x100
      [c000003f7f993910] [c0000000001d1ff8] irq_dispose_mapping+0x68/0x80
      [c000003f7f993940] [c00800000d44e8a4] kvmppc_xive_attach_escalation+0x1fc/0x270 [kvm]
      [c000003f7f9939d0] [c00800000d45013c] kvmppc_xive_connect_vcpu+0x424/0x620 [kvm]
      [c000003f7f993ac0] [c00800000d444428] kvm_arch_vcpu_ioctl+0x260/0x448 [kvm]
      [c000003f7f993b90] [c00800000d43593c] kvm_vcpu_ioctl+0x154/0x7c8 [kvm]
      [c000003f7f993d00] [c0000000004840f0] do_vfs_ioctl+0xe0/0xc30
      [c000003f7f993db0] [c000000000484d44] ksys_ioctl+0x104/0x120
      [c000003f7f993e00] [c000000000484d88] sys_ioctl+0x28/0x80
      [c000003f7f993e20] [c00000000000b278] system_call+0x5c/0x68
      Instruction dump:
      2c230000 41820008 3923ff78 e8e900a0 3c82ff69 3c62ff8d 7fa6eb78 7fc5f378
      3884f080 3863b948 4bbf6925 60000000 <0fe00000> 4bffff7c fba10088 4bbf6e41
      ---[ end trace b925b67a74a1d8d1 ]---
      BUG: Kernel NULL pointer dereference at 0x00000010
      Faulting instruction address: 0xc00800000d44fc04
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: kvm_hv kvm dm_mod vhost_net vhost tap xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter squashfs loop fuse i2c_dev sg ofpart ocxl powernv_flash at24 xts mtd uio_pdrv_genirq vmx_crypto opal_prd ipmi_powernv uio ipmi_devintf ipmi_msghandler ibmpowernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables ext4 mbcache jbd2 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c raid1 raid0 linear sd_mod ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci libata tg3 drm_panel_orientation_quirks [last unloaded: kvm]
      CPU: 24 PID: 88176 Comm: qemu-system-ppc Tainted: G        W         5.3.0-xive-nr-servers-5.3-gku+ #38
      NIP:  c00800000d44fc04 LR: c00800000d44fc00 CTR: c0000000001cd970
      REGS: c000003f7f9938e0 TRAP: 0300   Tainted: G        W          (5.3.0-xive-nr-servers-5.3-gku+)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24228882  XER: 20040000
      CFAR: c0000000001cd9ac DAR: 0000000000000010 DSISR: 40000000 IRQMASK: 0
      GPR00: c00800000d44fc00 c000003f7f993b70 c00800000d468300 0000000000000000
      GPR04: 00000000000000c7 0000000000000000 0000000000000000 c000003ffacd06d8
      GPR08: 0000000000000000 c000003ffacd0738 0000000000000000 fffffffffffffffd
      GPR12: 0000000000000040 c000003ffffeb800 0000000000000000 000000012f4ce5a1
      GPR16: 000000012ef5a0c8 0000000000000000 000000012f113bb0 0000000000000000
      GPR20: 000000012f45d918 00007ffffe0d9a80 000000012f4f5df0 000000012ef8c9f8
      GPR24: 0000000000000001 0000000000000000 c000003fe4501ed0 c000003f8b1d0000
      GPR28: c0000033314689c0 c000003fe4501c00 c000003fe4501e70 c000003fe4501e90
      NIP [c00800000d44fc04] kvmppc_xive_cleanup_vcpu+0xfc/0x210 [kvm]
      LR [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm]
      Call Trace:
      [c000003f7f993b70] [c00800000d44fc00] kvmppc_xive_cleanup_vcpu+0xf8/0x210 [kvm] (unreliable)
      [c000003f7f993bd0] [c00800000d450bd4] kvmppc_xive_release+0xdc/0x1b0 [kvm]
      [c000003f7f993c30] [c00800000d436a98] kvm_device_release+0xb0/0x110 [kvm]
      [c000003f7f993c70] [c00000000046730c] __fput+0xec/0x320
      [c000003f7f993cd0] [c000000000164ae0] task_work_run+0x150/0x1c0
      [c000003f7f993d30] [c000000000025034] do_notify_resume+0x304/0x440
      [c000003f7f993e20] [c00000000000dcc4] ret_from_except_lite+0x70/0x74
      Instruction dump:
      3bff0008 7fbfd040 419e0054 847e0004 2fa30000 419effec e93d0000 8929203c
      2f890000 419effb8 4800821d e8410018 <e9230010> e9490008 9b2a0039 7c0004ac
      ---[ end trace b925b67a74a1d8d2 ]---
      
      Kernel panic - not syncing: Fatal exception
      
      This affects both XIVE and XICS-on-XIVE devices since the beginning.
      
      Check the VP id instead of the vCPU id when a new vCPU is connected.
      The allocation of the XIVE CPU structure in kvmppc_xive_connect_vcpu()
      is moved after the check to avoid the need for rollback.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      12ade69c
  10. 09 10月, 2019 1 次提交
    • J
      powerpc/kvm: Fix kvmppc_vcore->in_guest value in kvmhv_switch_to_host · 7fe4e117
      Jordan Niethe 提交于
      kvmhv_switch_to_host() in arch/powerpc/kvm/book3s_hv_rmhandlers.S
      needs to set kvmppc_vcore->in_guest to 0 to signal secondary CPUs to
      continue. This happens after resetting the PCR. Before commit
      13c7bb3c ("powerpc/64s: Set reserved PCR bits"), r0 would always
      be 0 before it was stored to kvmppc_vcore->in_guest. However because
      of this change in the commit:
      
                /* Reset PCR */
                ld      r0, VCORE_PCR(r5)
        -       cmpdi   r0, 0
        +       LOAD_REG_IMMEDIATE(r6, PCR_MASK)
        +       cmpld   r0, r6
                beq     18f
        -       li      r0, 0
        -       mtspr   SPRN_PCR, r0
        +       mtspr   SPRN_PCR, r6
         18:
                /* Signal secondary CPUs to continue */
                stb     r0,VCORE_IN_GUEST(r5)
      
      We are no longer comparing r0 against 0 and loading it with 0 if it
      contains something else. Hence when we store r0 to
      kvmppc_vcore->in_guest, it might not be 0. This means that secondary
      CPUs will not be signalled to continue. Those CPUs get stuck and
      errors like the following are logged:
      
          KVM: CPU 1 seems to be stuck
          KVM: CPU 2 seems to be stuck
          KVM: CPU 3 seems to be stuck
          KVM: CPU 4 seems to be stuck
          KVM: CPU 5 seems to be stuck
          KVM: CPU 6 seems to be stuck
          KVM: CPU 7 seems to be stuck
      
      This can be reproduced with:
          $ for i in `seq 1 7` ; do chcpu -d $i ; done ;
          $ taskset -c 0 qemu-system-ppc64 -smp 8,threads=8 \
             -M pseries,accel=kvm,kvm-type=HV -m 1G -nographic -vga none \
             -kernel vmlinux -initrd initrd.cpio.xz
      
      Fix by making sure r0 is 0 before storing it to
      kvmppc_vcore->in_guest.
      
      Fixes: 13c7bb3c ("powerpc/64s: Set reserved PCR bits")
      Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Reviewed-by: NAlistair Popple <alistair@popple.id.au>
      Tested-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20191004025317.19340-1-jniethe5@gmail.com
      7fe4e117
  11. 01 10月, 2019 1 次提交
  12. 24 9月, 2019 3 次提交
    • A
      powerpc/mm: Fixup tlbie vs mtpidr/mtlpidr ordering issue on POWER9 · 047e6575
      Aneesh Kumar K.V 提交于
      On POWER9, under some circumstances, a broadcast TLB invalidation will
      fail to invalidate the ERAT cache on some threads when there are
      parallel mtpidr/mtlpidr happening on other threads of the same core.
      This can cause stores to continue to go to a page after it's unmapped.
      
      The workaround is to force an ERAT flush using PID=0 or LPID=0 tlbie
      flush. This additional TLB flush will cause the ERAT cache
      invalidation. Since we are using PID=0 or LPID=0, we don't get
      filtered out by the TLB snoop filtering logic.
      
      We need to still follow this up with another tlbie to take care of
      store vs tlbie ordering issue explained in commit:
      a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on
      POWER9"). The presence of ERAT cache implies we can still get new
      stores and they may miss store queue marking flush.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-3-aneesh.kumar@linux.ibm.com
      047e6575
    • A
      powerpc/book3s64/radix: Rename CPU_FTR_P9_TLBIE_BUG feature flag · 09ce98ca
      Aneesh Kumar K.V 提交于
      Rename the #define to indicate this is related to store vs tlbie
      ordering issue. In the next patch, we will be adding another feature
      flag that is used to handles ERAT flush vs tlbie ordering issue.
      
      Fixes: a5d4b589 ("powerpc/mm: Fixup tlbie vs store ordering issue on POWER9")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190924035254.24612-2-aneesh.kumar@linux.ibm.com
      09ce98ca
    • M
      KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag · 3a83f677
      Michael Roth 提交于
      On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB
      of memory running the following guest configs:
      
        guest A:
          - 224GB of memory
          - 56 VCPUs (sockets=1,cores=28,threads=2), where:
            VCPUs 0-1 are pinned to CPUs 0-3,
            VCPUs 2-3 are pinned to CPUs 4-7,
            ...
            VCPUs 54-55 are pinned to CPUs 108-111
      
        guest B:
          - 4GB of memory
          - 4 VCPUs (sockets=1,cores=4,threads=1)
      
      with the following workloads (with KSM and THP enabled in all):
      
        guest A:
          stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
      
        guest B:
          stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
      
        host:
          stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
      
      the below soft-lockup traces were observed after an hour or so and
      persisted until the host was reset (this was found to be reliably
      reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
      and 5.3-rc5):
      
        [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1253.183319] rcu:     124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941
        [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709]
        [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913]
        [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331]
        [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338]
        [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525]
        [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1316.198032] rcu:     124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243
        [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1379.212629] rcu:     124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714
        [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791]
        [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
        [ 1442.227115] rcu:     124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403
        [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
        [ 1455.111822]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
        [ 1455.111905]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
        [ 1455.111986]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
        [ 1455.112068]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
        [ 1455.112159]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
        [ 1455.112231]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
        [ 1455.112303]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
        [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
        [ 1455.112392]       Tainted: G             L    5.3.0-rc5-mdr-vanilla+ #1
      
      CPUs 45, 24, and 124 are stuck on spin locks, likely held by
      CPUs 105 and 31.
      
      CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on
      target CPU 42. For instance:
      
        # CPU 105 registers (via xmon)
        R00 = c00000000020b20c   R16 = 00007d1bcd800000
        R01 = c00000363eaa7970   R17 = 0000000000000001
        R02 = c0000000019b3a00   R18 = 000000000000006b
        R03 = 000000000000002a   R19 = 00007d537d7aecf0
        R04 = 000000000000002a   R20 = 60000000000000e0
        R05 = 000000000000002a   R21 = 0801000000000080
        R06 = c0002073fb0caa08   R22 = 0000000000000d60
        R07 = c0000000019ddd78   R23 = 0000000000000001
        R08 = 000000000000002a   R24 = c00000000147a700
        R09 = 0000000000000001   R25 = c0002073fb0ca908
        R10 = c000008ffeb4e660   R26 = 0000000000000000
        R11 = c0002073fb0ca900   R27 = c0000000019e2464
        R12 = c000000000050790   R28 = c0000000000812b0
        R13 = c000207fff623e00   R29 = c0002073fb0ca808
        R14 = 00007d1bbee00000   R30 = c0002073fb0ca800
        R15 = 00007d1bcd600000   R31 = 0000000000000800
        pc  = c00000000020b260 smp_call_function_many+0x3d0/0x460
        cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460
        lr  = c00000000020b20c smp_call_function_many+0x37c/0x460
        msr = 900000010288b033   cr  = 44024824
        ctr = c000000000050790   xer = 0000000000000000   trap =  100
      
      CPU 42 is running normally, doing VCPU work:
      
        # CPU 42 stack trace (via xmon)
        [link register   ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv]
        [c000008ed3343820] c000008ed3343850 (unreliable)
        [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv]
        [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv]
        [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm]
        [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
        [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm]
        [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70
        [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120
        [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80
        [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70
        --- Exception: c00 (System Call) at 00007d545cfd7694
        SP (7d53ff7edf50) is in userspace
      
      It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION]
      was set for CPU 42 by at least 1 of the CPUs waiting in
      smp_call_function_many(), but somehow the corresponding
      call_single_queue entries were never processed by CPU 42, causing the
      callers to spin in csd_lock_wait() indefinitely.
      
      Nick Piggin suggested something similar to the following sequence as
      a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE
      IPI messages seems to be most common, but any mix of CALL_FUNCTION and
      !CALL_FUNCTION messages could trigger it):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
          105: smp_muxed_ipi_set_message():
          105:   smb_mb()
               // STORE DEFERRED DUE TO RE-ORDERING
        --105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |  42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        |  42: // returns to executing guest
        |      // RE-ORDERED STORE COMPLETES
        ->105:   message[CALL_FUNCTION] = 1
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      This can be prevented with an smp_mb() at the beginning of
      kvmppc_set_host_ipi(), such that stores to message[<type>] (or other
      state indicated by the host_ipi flag) are ordered vs. the store to
      to host_ipi.
      
      However, doing so might still allow for the following scenario (not
      yet observed):
      
          CPU
            X: smp_muxed_ipi_set_message():
            X:   smp_mb()
            X:   message[RESCHEDULE] = 1
            X: doorbell_global_ipi(42):
            X:   kvmppc_set_host_ipi(42, 1)
            X:   ppc_msgsnd_sync()/smp_mb()
            X:   ppc_msgsnd() -> 42
           42: doorbell_exception(): // from CPU X
           42:   ppc_msgsync()
               // STORE DEFERRED DUE TO RE-ORDERING
        -- 42:   kvmppc_set_host_ipi(42, 0)
        |  42: smp_ipi_demux_relaxed()
        | 105: smp_muxed_ipi_set_message():
        | 105:   smb_mb()
        | 105:   message[CALL_FUNCTION] = 1
        | 105: doorbell_global_ipi(42):
        | 105:   kvmppc_set_host_ipi(42, 1)
        |      // RE-ORDERED STORE COMPLETES
        -> 42:   kvmppc_set_host_ipi(42, 0)
           42: // returns to executing guest
          105:   ppc_msgsnd_sync()/smp_mb()
          105:   ppc_msgsnd() -> 42
           42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored
          105: // hangs waiting on 42 to process messages/call_single_queue
      
      Fixing this scenario would require an smp_mb() *after* clearing
      host_ipi flag in kvmppc_set_host_ipi() to order the store vs.
      subsequent processing of IPI messages.
      
      To handle both cases, this patch splits kvmppc_set_host_ipi() into
      separate set/clear functions, where we execute smp_mb() prior to
      setting host_ipi flag, and after clearing host_ipi flag. These
      functions pair with each other to synchronize the sender and receiver
      sides.
      
      With that change in place the above workload ran for 20 hours without
      triggering any lock-ups.
      
      Fixes: 755563bc ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0
      Signed-off-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
      3a83f677
  13. 21 9月, 2019 1 次提交
    • J
      powerpc/64s: Set reserved PCR bits · 13c7bb3c
      Jordan Niethe 提交于
      Currently the reserved bits of the Processor Compatibility
      Register (PCR) are cleared as per the Programming Note in Section
      1.3.3 of version 3.0B of the Power ISA. This causes all new
      architecture features to be made available when running on newer
      processors with new architecture features added to the PCR as bits
      must be set to disable a given feature.
      
      For example to disable new features added as part of Version 2.07 of
      the ISA the corresponding bit in the PCR needs to be set.
      
      As new processor features generally require explicit kernel support
      they should be disabled until such support is implemented. Therefore
      kernels should set all unknown/reserved bits in the PCR such that any
      new architecture features which the kernel does not currently know
      about get disabled.
      
      An update is planned to the ISA to clarify that the PCR is an
      exception to the Programming Note on reserved bits in Section 1.3.3.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Tested-by: NJoel Stanley <joel@jms.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190917004605.22471-2-alistair@popple.id.au
      13c7bb3c
  14. 05 9月, 2019 1 次提交
    • N
      powerpc/64s/radix: introduce options to disable use of the tlbie instruction · 2275d7b5
      Nicholas Piggin 提交于
      Introduce two options to control the use of the tlbie instruction. A
      boot time option which completely disables the kernel using the
      instruction, this is currently incompatible with HASH MMU, KVM, and
      coherent accelerators.
      
      And a debugfs option can be switched at runtime and avoids using tlbie
      for invalidating CPU TLBs for normal process and kernel address
      mappings. Coherent accelerators are still managed with tlbie, as will
      KVM partition scope translations.
      
      Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
      basic implementation which does not attempt to make any optimisation
      beyond the tlbie implementation.
      
      This is useful for performance testing among other things. For example
      in certain situations on large systems, using IPIs may be faster than
      tlbie as they can be directed rather than broadcast. Later we may also
      take advantage of the IPIs to do more interesting things such as trim
      the mm cpumask more aggressively.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
      2275d7b5