1. 13 7月, 2019 1 次提交
  2. 20 6月, 2019 1 次提交
    • S
      KVM: PPC: Book3S HV: Invalidate ERAT when flushing guest TLB entries · 50087112
      Suraj Jitindar Singh 提交于
      When a guest vcpu moves from one physical thread to another it is
      necessary for the host to perform a tlb flush on the previous core if
      another vcpu from the same guest is going to run there. This is because the
      guest may use the local form of the tlb invalidation instruction meaning
      stale tlb entries would persist where it previously ran. This is handled
      on guest entry in kvmppc_check_need_tlb_flush() which calls
      flush_guest_tlb() to perform the tlb flush.
      
      Previously the generic radix__local_flush_tlb_lpid_guest() function was
      used, however the functionality was reimplemented in flush_guest_tlb()
      to avoid the trace_tlbie() call as the flushing may be done in real
      mode. The reimplementation in flush_guest_tlb() was missing an erat
      invalidation after flushing the tlb.
      
      This lead to observable memory corruption in the guest due to the
      caching of stale translations. Fix this by adding the erat invalidation.
      
      Fixes: 70ea13f6 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      50087112
  3. 19 6月, 2019 1 次提交
  4. 18 6月, 2019 2 次提交
    • S
      KVM: PPC: Book3S HV: Only write DAWR[X] when handling h_set_dawr in real mode · 84b02824
      Suraj Jitindar Singh 提交于
      The hcall H_SET_DAWR is used by a guest to set the data address
      watchpoint register (DAWR). This hcall is handled in the host in
      kvmppc_h_set_dawr() which can be called in either real mode on the
      guest exit path from hcall_try_real_mode() in book3s_hv_rmhandlers.S,
      or in virtual mode when called from kvmppc_pseries_do_hcall() in
      book3s_hv.c.
      
      The function kvmppc_h_set_dawr() updates the dawr and dawrx fields in
      the vcpu struct accordingly and then also writes the respective values
      into the DAWR and DAWRX registers directly. It is necessary to write
      the registers directly here when calling the function in real mode
      since the path to re-enter the guest won't do this. However when in
      virtual mode the host DAWR and DAWRX values have already been
      restored, and so writing the registers would overwrite these.
      Additionally there is no reason to write the guest values here as
      these will be read from the vcpu struct and written to the registers
      appropriately the next time the vcpu is run.
      
      This also avoids the case when handling h_set_dawr for a nested guest
      where the guest hypervisor isn't able to write the DAWR and DAWRX
      registers directly and must rely on the real hypervisor to do this for
      it when it calls H_ENTER_NESTED.
      
      Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      84b02824
    • M
      KVM: PPC: Book3S HV: Fix r3 corruption in h_set_dabr() · fabb2efc
      Michael Neuling 提交于
      Commit c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      screwed up some assembler and corrupted a pointer in r3. This resulted
      in crashes like the below:
      
        BUG: Kernel NULL pointer dereference at 0x000013bf
        Faulting instruction address: 0xc00000000010b044
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        CPU: 8 PID: 1771 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc4+ #3
        NIP:  c00000000010b044 LR: c0080000089dacf4 CTR: c00000000010aff4
        REGS: c00000179b397710 TRAP: 0300   Not tainted  (5.2.0-rc4+)
        MSR:  800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 42244842  XER: 00000000
        CFAR: c00000000010aff8 DAR: 00000000000013bf DSISR: 42000000 IRQMASK: 0
        GPR00: c0080000089dd6bc c00000179b3979a0 c008000008a04300 ffffffffffffffff
        GPR04: 0000000000000000 0000000000000003 000000002444b05d c0000017f11c45d0
        ...
        NIP kvmppc_h_set_dabr+0x50/0x68
        LR  kvmppc_pseries_do_hcall+0xa3c/0xeb0 [kvm_hv]
        Call Trace:
          0xc0000017f11c0000 (unreliable)
          kvmppc_vcpu_run_hv+0x694/0xec0 [kvm_hv]
          kvmppc_vcpu_run+0x34/0x48 [kvm]
          kvm_arch_vcpu_ioctl_run+0x2f4/0x400 [kvm]
          kvm_vcpu_ioctl+0x460/0x850 [kvm]
          do_vfs_ioctl+0xe4/0xb40
          ksys_ioctl+0xc4/0x110
          sys_ioctl+0x28/0x80
          system_call+0x5c/0x70
        Instruction dump:
        4082fff4 4c00012c 38600000 4e800020 e96280c0 896b0000 2c2b0000 3860ffff
        4d820020 50852e74 508516f6 78840724 <f88313c0> f8a313c8 7c942ba6 7cbc2ba6
      
      Fix the bug by only changing r3 when we are returning immediately.
      
      Fixes: c1fe190c ("powerpc: Add force enable of DAWR on P9 option")
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reported-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fabb2efc
  5. 05 6月, 2019 2 次提交
  6. 31 5月, 2019 2 次提交
  7. 30 5月, 2019 6 次提交
    • S
      KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() · d724c9e5
      Suraj Jitindar Singh 提交于
      The sprgs are a set of 4 general purpose sprs provided for software use.
      SPRG3 is special in that it can also be read from userspace. Thus it is
      used on linux to store the cpu and numa id of the process to speed up
      syscall access to this information.
      
      This register is overwritten with the guest value on kvm guest entry,
      and so needs to be restored on exit again. Thus restore the value on
      the guest exit path in kvmhv_p9_guest_entry().
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d724c9e5
    • P
      KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 · 1b28d553
      Paul Mackerras 提交于
      Commit 3309bec8 ("KVM: PPC: Book3S HV: Fix lockdep warning when
      entering the guest") moved calls to trace_hardirqs_{on,off} in the
      entry path used for HPT guests.  Similar code exists in the new
      streamlined entry path used for radix guests on POWER9.  This makes
      the same change there, so as to avoid lockdep warnings such as this:
      
      [  228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
      [  228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270
      [  228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat
      +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter
      +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf
      +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs
      +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor
      +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea
      +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core
      [  228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42
      [  228.686911] NIP:  c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20
      [  228.686963] REGS: c000200cdb50f570 TRAP: 0700   Not tainted  (5.2.0-rc1-xive+)
      [  228.687001] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 48222222  XER: 20040000
      [  228.687060] CFAR: c000000000116db0 IRQMASK: 1
      [  228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e
      [  228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f
      [  228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000
      [  228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000
      [  228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900
      [  228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074
      [  228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000
      [  228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0
      [  228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270
      [  228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270
      [  228.687466] Call Trace:
      [  228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable)
      [  228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0
      [  228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100
      [  228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340
      [  228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv]
      [  228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv]
      [  228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
      [  228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
      [  228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm]
      [  228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
      [  228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
      [  228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80
      [  228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70
      [  228.688109] Instruction dump:
      [  228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87
      [  228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87
      [  228.688251] irq event stamp: 205
      [  228.688287] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688412] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      [  228.688513] ---[ end trace eb16f6260022a812 ]---
      [  228.688548] possible reason: unannotated irqs-off.
      [  228.688571] irq event stamp: 205
      [  228.688607] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
      [  228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
      [  228.688719] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
      [  228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
      
      Cc: stable@vger.kernel.org # v4.20+
      Fixes: 95a6432c ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests")
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NCédric Le Goater <clg@kaod.org>
      Tested-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1b28d553
    • C
      KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pages · bcaa3110
      Cédric Le Goater 提交于
      Under XIVE, the ESB pages of an interrupt are used for interrupt
      management (EOI) and triggering. They are made available to guests
      through a mapping of the XIVE KVM device.
      
      When a device is passed-through, the passthru_irq helpers,
      kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), clear the ESB
      pages of the guest IRQ number being mapped and let the VM fault
      handler repopulate with the correct page.
      
      The ESB pages are mapped at offset 4 (KVM_XIVE_ESB_PAGE_OFFSET) in the
      KVM device mapping. Unfortunately, this offset was not taken into
      account when clearing the pages. This lead to issues with the
      passthrough devices for which the interrupts were not functional under
      some guest configuration (tg3 and single CPU) or in any configuration
      (e1000e adapter).
      Reviewed-by: NGreg Kurz <groug@kaod.org>
      Tested-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      bcaa3110
    • C
      KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslots · aedb5b19
      Cédric Le Goater 提交于
      According to Documentation/virtual/kvm/locking.txt, the srcu read lock
      should be taken when accessing the memslots of the VM. The XIVE KVM
      device needs to do so when configuring the page of the OS event queue
      of vCPU for a given priority and when marking the same page dirty
      before migration.
      
      This avoids warnings such as :
      
      [  208.224882] =============================
      [  208.224884] WARNING: suspicious RCU usage
      [  208.224889] 5.2.0-rc2-xive+ #47 Not tainted
      [  208.224890] -----------------------------
      [  208.224894] ../include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage!
      [  208.224896]
                     other info that might help us debug this:
      
      [  208.224898]
                     rcu_scheduler_active = 2, debug_locks = 1
      [  208.224901] no locks held by qemu-system-ppc/3923.
      [  208.224902]
                     stack backtrace:
      [  208.224907] CPU: 64 PID: 3923 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc2-xive+ #47
      [  208.224909] Call Trace:
      [  208.224918] [c000200cdd98fa30] [c000000000be1934] dump_stack+0xe8/0x164 (unreliable)
      [  208.224924] [c000200cdd98fa80] [c0000000001aec80] lockdep_rcu_suspicious+0x110/0x180
      [  208.224935] [c000200cdd98fb00] [c0080000075933a0] gfn_to_memslot+0x1c8/0x200 [kvm]
      [  208.224943] [c000200cdd98fb40] [c008000007599600] gfn_to_pfn+0x28/0x60 [kvm]
      [  208.224951] [c000200cdd98fb70] [c008000007599658] gfn_to_page+0x20/0x40 [kvm]
      [  208.224959] [c000200cdd98fb90] [c0080000075b495c] kvmppc_xive_native_set_attr+0x8b4/0x1480 [kvm]
      [  208.224967] [c000200cdd98fca0] [c00800000759261c] kvm_device_ioctl_attr+0x64/0xb0 [kvm]
      [  208.224974] [c000200cdd98fcf0] [c008000007592730] kvm_device_ioctl+0xc8/0x110 [kvm]
      [  208.224979] [c000200cdd98fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
      [  208.224981] [c000200cdd98fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
      [  208.224984] [c000200cdd98fe00] [c000000000434768] sys_ioctl+0x28/0x80
      [  208.224988] [c000200cdd98fe20] [c00000000000b888] system_call+0x5c/0x70
      legoater@boss01:~$
      
      Fixes: 13ce3297 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration")
      Fixes: e6714bd1 ("KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      aedb5b19
    • C
      KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts · ef974020
      Cédric Le Goater 提交于
      The passthrough interrupts are defined at the host level and their IRQ
      data should not be cleared unless specifically deconfigured (shutdown)
      by the host. They differ from the IPI interrupts which are allocated
      by the XIVE KVM device and reserved to the guest usage only.
      
      This fixes a host crash when destroying a VM in which a PCI adapter
      was passed-through. In this case, the interrupt is cleared and freed
      by the KVM device and then shutdown by vfio at the host level.
      
      [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00
      [ 1007.360285] Faulting instruction address: 0xc00000000009da34
      [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1]
      [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
      [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core
      [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4
      [ 1007.360454] NIP:  c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0
      [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300   Not tainted  (5.1.0-gad7e7d0ef)
      [ 1007.360500] MSR:  900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002484  XER: 00000000
      [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1
      [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0
      [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200
      [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd
      [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718
      [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100
      [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027
      [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c
      [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0
      [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120
      [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50
      [ 1007.360732] Call Trace:
      [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable)
      [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0
      [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0
      [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420
      [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0
      [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350
      [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0
      [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110
      [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0
      [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170
      [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0
      [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90
      [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70
      [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0
      [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0
      [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00
      [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100
      [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0
      [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430
      [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74
      [ 1007.361159] Instruction dump:
      [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020
      [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022
      
      Cc: stable@vger.kernel.org # v4.12+
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ef974020
    • C
      KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE device · 7e10b9a6
      Cédric Le Goater 提交于
      The XICS-on-XIVE KVM device needs to allocate XIVE event queues when a
      priority is used by the OS. This is referred as EQ provisioning and it
      is done under the hood when :
      
        1. a CPU is hot-plugged in the VM
        2. the "set-xive" is called at VM startup
        3. sources are restored at VM restore
      
      The kvm->lock mutex is used to protect the different XIVE structures
      being modified but in some contexts, kvm->lock is taken under the
      vcpu->mutex which is not permitted by the KVM locking rules.
      
      Introduce a new mutex 'lock' for the KVM devices for them to
      synchronize accesses to the XIVE device structures.
      Reviewed-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7e10b9a6
  8. 29 5月, 2019 7 次提交
  9. 28 5月, 2019 1 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · a86cb413
      Thomas Huth 提交于
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      a86cb413
  10. 24 5月, 2019 2 次提交
  11. 15 5月, 2019 1 次提交
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
  12. 14 5月, 2019 2 次提交
  13. 30 4月, 2019 12 次提交
    • N
      powerpc/64s: Reimplement book3s idle code in C · 10d91611
      Nicholas Piggin 提交于
      Reimplement Book3S idle code in C, moving POWER7/8/9 implementation
      speific HV idle code to the powernv platform code.
      
      Book3S assembly stubs are kept in common code and used only to save
      the stack frame and non-volatile GPRs before executing architected
      idle instructions, and restoring the stack and reloading GPRs then
      returning to C after waking from idle.
      
      The complex logic dealing with threads and subcores, locking, SPRs,
      HMIs, timebase resync, etc., is all done in C which makes it more
      maintainable.
      
      This is not a strict translation to C code, there are some
      significant differences:
      
      - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
        but saves and restores them itself.
      
      - The optimisation where EC=ESL=0 idle modes did not have to save GPRs
        or change MSR is restored, because it's now simple to do. ESL=1
        sleeps that do not lose GPRs can use this optimization too.
      
      - KVM secondary entry and cede is now more of a call/return style
        rather than branchy. nap_state_lost is not required because KVM
        always returns via NVGPR restoring path.
      
      - KVM secondary wakeup from offline sequence is moved entirely into
        the offline wakeup, which avoids a hwsync in the normal idle wakeup
        path.
      
      Performance measured with context switch ping-pong on different
      threads or cores, is possibly improved a small amount, 1-3% depending
      on stop state and core vs thread test for shallow states. Deep states
      it's in the noise compared with other latencies.
      
      KVM improvements:
      
      - Idle sleepers now always return to caller rather than branch out
        to KVM first.
      
      - This allows optimisations like very fast return to caller when no
        state has been lost.
      
      - KVM no longer requires nap_state_lost because it controls NVGPR
        save/restore itself on the way in and out.
      
      - The heavy idle wakeup KVM request check can be moved out of the
        normal host idle code and into the not-performance-critical offline
        code.
      
      - KVM nap code now returns from where it is called, which makes the
        flow a bit easier to follow.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Squash the KVM changes in]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10d91611
    • P
      KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device close · 0caecf5b
      Paul Mackerras 提交于
      This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device
      is closed, KVM will not try to enable or disable any of the escalation
      interrupts for the VCPUs.  We don't have to worry about races between
      clearing the pointers and use of the pointers by the XIVE context
      push/pull code, because the callers hold the vcpu->mutex, which is
      also taken by the KVM_RUN code.  Therefore the vcpu cannot be entering
      or exiting the guest concurrently.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0caecf5b
    • P
      KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device · 6f868405
      Paul Mackerras 提交于
      Now that we have the possibility of a XIVE or XICS-on-XIVE device being
      released while the VM is still running, we need to be careful about
      races and potential use-after-free bugs.  Although the kvmppc_xive
      struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu
      structs are freed, and they are used extensively in both the XIVE native
      and XICS-on-XIVE code.
      
      There are various ways in which XIVE code gets invoked:
      
      - VCPU entry and exit, which do push and pull operations on the XIVE hardware
      - one_reg get and set functions (vcpu->mutex is held)
      - XICS hypercalls (but only inside guest execution, not from
        kvmppc_pseries_do_hcall)
      - device creation calls (kvm->lock is held)
      - device callbacks - get/set attribute, mmap, pagefault, release/destroy
      - set_mapped/clr_mapped calls (kvm->lock is held)
      - connect_vcpu calls
      - debugfs file read callbacks
      
      Inside a device release function, we know that userspace cannot have an
      open file descriptor referring to the device, nor can it have any mmapped
      regions from the device.  Therefore the device callbacks are excluded,
      as are the connect_vcpu calls (since they need a fd for the device).
      Further, since the caller holds the kvm->lock mutex, no other device
      creation calls or set/clr_mapped calls can be executing concurrently.
      
      To exclude VCPU execution and XICS hypercalls, we temporarily set
      kvm->arch.mmu_ready to 0.  This forces any VCPU task that is trying to
      enter the guest to take the kvm->lock mutex, which is held by the caller
      of the release function.  Then, sending an IPI to all other CPUs forces
      any VCPU currently executing in the guest to exit.
      
      Finally, we take the vcpu->mutex for each VCPU around the process of
      cleaning up and freeing its XIVE data structures, in order to exclude
      any one_reg get/set calls.
      
      To exclude the debugfs read callbacks, we just need to ensure that
      debugfs_remove is called before freeing any data structures.  Once it
      returns we know that no CPU can be executing the callbacks (for our
      kvmppc_xive instance).
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6f868405
    • C
      KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' method · 5422e951
      Cédric Le Goater 提交于
      When a P9 sPAPR VM boots, the CAS negotiation process determines which
      interrupt mode to use (XICS legacy or XIVE native) and invokes a
      machine reset to activate the chosen mode.
      
      We introduce 'release' methods for the XICS-on-XIVE and the XIVE
      native KVM devices which are called when the file descriptor of the
      device is closed after the TIMA and ESB pages have been unmapped.
      They perform the necessary cleanups : clear the vCPU interrupt
      presenters that could be attached and then destroy the device. The
      'release' methods replace the 'destroy' methods as 'destroy' is not
      called anymore once 'release' is. Compatibility with older QEMU is
      nevertheless maintained.
      
      This is not considered as a safe operation as the vCPUs are still
      running and could be referencing the KVM device through their
      presenters. To protect the system from any breakage, the kvmppc_xive
      objects representing both KVM devices are now stored in an array under
      the VM. Allocation is performed on first usage and memory is freed
      only when the VM exits.
      
      [paulus@ozlabs.org - Moved freeing of xive structures to book3s.c,
       put it under #ifdef CONFIG_KVM_XICS.]
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5422e951
    • C
      KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation mode · 3fab2d10
      Cédric Le Goater 提交于
      Full support for the XIVE native exploitation mode is now available,
      advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on
      PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM
      Hypervisor) is not yet available. XIVE should also have been activated
      which is default setting on POWER9 systems running a recent Linux
      kernel.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3fab2d10
    • C
      KVM: PPC: Book3S HV: XIVE: Add passthrough support · 232b984b
      Cédric Le Goater 提交于
      The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
      implement an IRQ space for the guest using the generic IPI interrupts
      of the XIVE IC controller. These interrupts are allocated at the OPAL
      level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
      Interrupt management is performed in the XIVE way: using loads and
      stores on the addresses of the XIVE IPI interrupt ESB pages.
      
      Both KVM devices share the same internal structure caching information
      on the interrupts, among which the xive_irq_data struct containing the
      addresses of the IPI ESB pages and an extra one in case of pass-through.
      The later contains the addresses of the ESB pages of the underlying HW
      controller interrupts, PHB4 in all cases for now.
      
      A guest, when running in the XICS legacy interrupt mode, lets the KVM
      XICS-over-XIVE device "handle" interrupt management, that is to
      perform the loads and stores on the addresses of the ESB pages of the
      guest interrupts. However, when running in XIVE native exploitation
      mode, the KVM XIVE native device exposes the interrupt ESB pages to
      the guest and lets the guest perform directly the loads and stores.
      
      The VMA exposing the ESB pages make use of a custom VM fault handler
      which role is to populate the VMA with appropriate pages. When a fault
      occurs, the guest IRQ number is deduced from the offset, and the ESB
      pages of associated XIVE IPI interrupt are inserted in the VMA (using
      the internal structure caching information on the interrupts).
      
      Supporting device passthrough in the guest running in XIVE native
      exploitation mode adds some extra refinements because the ESB pages
      of a different HW controller (PHB4) need to be exposed to the guest
      along with the initial IPI ESB pages of the XIVE IC controller. But
      the overall mechanic is the same.
      
      When the device HW irqs are mapped into or unmapped from the guest
      IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
      and kvmppc_xive_clr_mapped(), are called to record or clear the
      passthrough interrupt information and to perform the switch.
      
      The approach taken by this patch is to clear the ESB pages of the
      guest IRQ number being mapped and let the VM fault handler repopulate.
      The handler will insert the ESB page corresponding to the HW interrupt
      of the device being passed-through or the initial IPI ESB page if the
      device is being removed.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      232b984b
    • C
      KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pages · 6520ca64
      Cédric Le Goater 提交于
      Each source is associated with an Event State Buffer (ESB) with a
      even/odd pair of pages which provides commands to manage the source:
      to trigger, to EOI, to turn off the source for instance.
      
      The custom VM fault handler will deduce the guest IRQ number from the
      offset of the fault, and the ESB page of the associated XIVE interrupt
      will be inserted into the VMA using the internal structure caching
      information on the interrupts.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      6520ca64
    • C
      KVM: PPC: Book3S HV: XIVE: Add a TIMA mapping · 39e9af3d
      Cédric Le Goater 提交于
      Each thread has an associated Thread Interrupt Management context
      composed of a set of registers. These registers let the thread handle
      priority management and interrupt acknowledgment. The most important
      are :
      
          - Interrupt Pending Buffer     (IPB)
          - Current Processor Priority   (CPPR)
          - Notification Source Register (NSR)
      
      They are exposed to software in four different pages each proposing a
      view with a different privilege. The first page is for the physical
      thread context and the second for the hypervisor. Only the third
      (operating system) and the fourth (user level) are exposed the guest.
      
      A custom VM fault handler will populate the VMA with the appropriate
      pages, which should only be the OS page for now.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      39e9af3d
    • C
      KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE state · e4945b9d
      Cédric Le Goater 提交于
      The state of the thread interrupt management registers needs to be
      collected for migration. These registers are cached under the
      'xive_saved_state.w01' field of the VCPU when the VPCU context is
      pulled from the HW thread. An OPAL call retrieves the backup of the
      IPB register in the underlying XIVE NVT structure and merges it in the
      KVM state.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e4945b9d
    • C
      KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages · e6714bd1
      Cédric Le Goater 提交于
      When migration of a VM is initiated, a first copy of the RAM is
      transferred to the destination before the VM is stopped, but there is
      no guarantee that the EQ pages in which the event notifications are
      queued have not been modified.
      
      To make sure migration will capture a consistent memory state, the
      XIVE device should perform a XIVE quiesce sequence to stop the flow of
      event notifications and stabilize the EQs. This is the purpose of the
      KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
      to force their transfer.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e6714bd1
    • C
      KVM: PPC: Book3S HV: XIVE: Add a control to sync the sources · 7b46b616
      Cédric Le Goater 提交于
      This control will be used by the H_INT_SYNC hcall from QEMU to flush
      event notifications on the XIVE IC owning the source.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7b46b616
    • C
      KVM: PPC: Book3S HV: XIVE: Add a global reset control · 5ca80647
      Cédric Le Goater 提交于
      This control is to be used by the H_INT_RESET hcall from QEMU. Its
      purpose is to clear all configuration of the sources and EQs. This is
      necessary in case of a kexec (for a kdump kernel for instance) to make
      sure that no remaining configuration is left from the previous boot
      setup so that the new kernel can start safely from a clean state.
      
      The queue 7 is ignored when the XIVE device is configured to run in
      single escalation mode. Prio 7 is used by escalations.
      
      The XIVE VP is kept enabled as the vCPU is still active and connected
      to the XIVE device.
      Signed-off-by: NCédric Le Goater <clg@kaod.org>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      5ca80647