1. 31 8月, 2017 4 次提交
  2. 08 8月, 2017 1 次提交
    • L
      KVM: add spinlock optimization framework · 199b5763
      Longpeng(Mike) 提交于
      If a vcpu exits due to request a user mode spinlock, then
      the spinlock-holder may be preempted in user mode or kernel mode.
      (Note that not all architectures trap spin loops in user mode,
      only AMD x86 and ARM/ARM64 currently do).
      
      But if a vcpu exits in kernel mode, then the holder must be
      preempted in kernel mode, so we should choose a vcpu in kernel mode
      as a more likely candidate for the lock holder.
      
      This introduces kvm_arch_vcpu_in_kernel() to decide whether the
      vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
      new argument says the same of the spinning VCPU.
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      199b5763
  3. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  4. 24 7月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix host crash on changing HPT size · ef427198
      Paul Mackerras 提交于
      Commit f98a8bf9 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB
      ioctl() to change HPT size", 2016-12-20) changed the behaviour of
      the KVM_PPC_ALLOCATE_HTAB ioctl so that it now allocates a new HPT
      and new revmap array if there was a previously-allocated HPT of a
      different size from the size being requested.  In this case, we need
      to reset the rmap arrays of the memslots, because the rmap arrays
      will contain references to HPTEs which are no longer valid.  Worse,
      these references are also references to slots in the new revmap
      array (which parallels the HPT), and the new revmap array contains
      random contents, since it doesn't get zeroed on allocation.
      
      The effect of having these stale references to slots in the revmap
      array that contain random contents is that subsequent calls to
      functions such as kvmppc_add_revmap_chain will crash because they
      will interpret the non-zero contents of the revmap array as HPTE
      indexes and thus index outside of the revmap array.  This leads to
      host crashes such as the following.
      
      [ 7072.862122] Unable to handle kernel paging request for data at address 0xd000000c250c00f8
      [ 7072.862218] Faulting instruction address: 0xc0000000000e1c78
      [ 7072.862233] Oops: Kernel access of bad area, sig: 11 [#1]
      [ 7072.862286] SMP NR_CPUS=1024
      [ 7072.862286] NUMA
      [ 7072.862325] PowerNV
      [ 7072.862378] Modules linked in: kvm_hv vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm iw_cxgb3 mlx5_ib ib_core ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel i2c_opal nfsd auth_rpcgss oid_registry
      [ 7072.863085]  nfs_acl lockd grace sunrpc kvm_pr kvm xfs libcrc32c scsi_dh_alua dm_service_time radeon lpfc nvme_fc nvme_fabrics nvme_core scsi_transport_fc i2c_algo_bit tg3 drm_kms_helper ptp pps_core syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm dm_multipath i2c_core cxgb3 mlx5_core mdio [last unloaded: kvm_hv]
      [ 7072.863381] CPU: 72 PID: 56929 Comm: qemu-system-ppc Not tainted 4.12.0-kvm+ #59
      [ 7072.863457] task: c000000fe29e7600 task.stack: c000001e3ffec000
      [ 7072.863520] NIP: c0000000000e1c78 LR: c0000000000e2e3c CTR: c0000000000e25f0
      [ 7072.863596] REGS: c000001e3ffef560 TRAP: 0300   Not tainted  (4.12.0-kvm+)
      [ 7072.863658] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE,TM[E]>
      [ 7072.863667]   CR: 44082882  XER: 20000000
      [ 7072.863767] CFAR: c0000000000e2e38 DAR: d000000c250c00f8 DSISR: 42000000 SOFTE: 1
      GPR00: c0000000000e2e3c c000001e3ffef7e0 c000000001407d00 d000000c250c00f0
      GPR04: d00000006509fb70 d00000000b3d2048 0000000003ffdfb7 0000000000000000
      GPR08: 00000001007fdfb7 00000000c000000f d0000000250c0000 000000000070f7bf
      GPR12: 0000000000000008 c00000000fdad000 0000000010879478 00000000105a0d78
      GPR16: 00007ffaf4080000 0000000000001190 0000000000000000 0000000000010000
      GPR20: 4001ffffff000415 d00000006509fb70 0000000004091190 0000000ee1881190
      GPR24: 0000000003ffdfb7 0000000003ffdfb7 00000000007fdfb7 c000000f5c958000
      GPR28: d00000002d09fb70 0000000003ffdfb7 d00000006509fb70 d00000000b3d2048
      [ 7072.864439] NIP [c0000000000e1c78] kvmppc_add_revmap_chain+0x88/0x130
      [ 7072.864503] LR [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
      [ 7072.864566] Call Trace:
      [ 7072.864594] [c000001e3ffef7e0] [c000001e3ffef830] 0xc000001e3ffef830 (unreliable)
      [ 7072.864671] [c000001e3ffef830] [c0000000000e2e3c] kvmppc_do_h_enter+0x84c/0x9e0
      [ 7072.864751] [c000001e3ffef920] [d00000000b38d878] kvmppc_map_vrma+0x168/0x200 [kvm_hv]
      [ 7072.864831] [c000001e3ffef9e0] [d00000000b38a684] kvmppc_vcpu_run_hv+0x1284/0x1300 [kvm_hv]
      [ 7072.864914] [c000001e3ffefb30] [d00000000f465664] kvmppc_vcpu_run+0x44/0x60 [kvm]
      [ 7072.865008] [c000001e3ffefb60] [d00000000f461864] kvm_arch_vcpu_ioctl_run+0x114/0x290 [kvm]
      [ 7072.865152] [c000001e3ffefbe0] [d00000000f453c98] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
      [ 7072.865292] [c000001e3ffefd40] [c000000000389328] do_vfs_ioctl+0xd8/0x8c0
      [ 7072.865410] [c000001e3ffefde0] [c000000000389be4] SyS_ioctl+0xd4/0x130
      [ 7072.865526] [c000001e3ffefe30] [c00000000000b760] system_call+0x58/0x6c
      [ 7072.865644] Instruction dump:
      [ 7072.865715] e95b2110 793a0020 7b4926e4 7f8a4a14 409e0098 807c000c 786326e4 7c6a1a14
      [ 7072.865857] 935e0008 7bbd0020 813c000c 913e000c <93a30008> 93bc000c 48000038 60000000
      [ 7072.866001] ---[ end trace 627b6e4bf8080edc ]---
      
      Note that to trigger this, it is necessary to use a recent upstream
      QEMU (or other userspace that resizes the HPT at CAS time), specify
      a maximum memory size substantially larger than the current memory
      size, and boot a guest kernel that does not support HPT resizing.
      
      This fixes the problem by resetting the rmap arrays when the old HPT
      is freed.
      
      Fixes: f98a8bf9 ("KVM: PPC: Book3S HV: Allow KVM_PPC_ALLOCATE_HTAB ioctl() to change HPT size")
      Cc: stable@vger.kernel.org # v4.11+
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ef427198
    • P
      KVM: PPC: Book3S HV: Enable TM before accessing TM registers · e4705715
      Paul Mackerras 提交于
      Commit 46a704f8 ("KVM: PPC: Book3S HV: Preserve userspace HTM state
      properly", 2017-06-15) added code to read transactional memory (TM)
      registers but forgot to enable TM before doing so.  The result is
      that if userspace does have live values in the TM registers, a KVM_RUN
      ioctl will cause a host kernel crash like this:
      
      [  181.328511] Unrecoverable TM Unavailable Exception f60 at d00000001e7d9980
      [  181.328605] Oops: Unrecoverable TM Unavailable Exception, sig: 6 [#1]
      [  181.328613] SMP NR_CPUS=2048
      [  181.328613] NUMA
      [  181.328618] PowerNV
      [  181.328646] Modules linked in: vhost_net vhost tap nfs_layout_nfsv41_files rpcsec_gss_krb5 nfsv4 dns_resolver nfs
      +fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
      +nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables
      +ip6table_filter ip6_tables iptable_filter bridge stp llc kvm_hv kvm nfsd ses enclosure scsi_transport_sas ghash_generic
      +auth_rpcgss gf128mul xts sg ctr nfs_acl lockd vmx_crypto shpchp ipmi_powernv i2c_opal grace ipmi_devintf i2c_core
      +powernv_rng sunrpc ipmi_msghandler ibmpowernv uio_pdrv_genirq uio leds_powernv powernv_op_panel ip_tables xfs sd_mod
      +lpfc ipr bnx2x libata mdio ptp pps_core scsi_transport_fc libcrc32c dm_mirror dm_region_hash dm_log dm_mod
      [  181.329278] CPU: 40 PID: 9926 Comm: CPU 0/KVM Not tainted 4.12.0+ #1
      [  181.329337] task: c000003fc6980000 task.stack: c000003fe4d80000
      [  181.329396] NIP: d00000001e7d9980 LR: d00000001e77381c CTR: d00000001e7d98f0
      [  181.329465] REGS: c000003fe4d837e0 TRAP: 0f60   Not tainted  (4.12.0+)
      [  181.329523] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  181.329527]   CR: 24022448  XER: 00000000
      [  181.329608] CFAR: d00000001e773818 SOFTE: 1
      [  181.329608] GPR00: d00000001e77381c c000003fe4d83a60 d00000001e7ef410 c000003fdcfe0000
      [  181.329608] GPR04: c000003fe4f00000 0000000000000000 0000000000000000 c000003fd7954800
      [  181.329608] GPR08: 0000000000000001 c000003fc6980000 0000000000000000 d00000001e7e2880
      [  181.329608] GPR12: d00000001e7d98f0 c000000007b19000 00000001295220e0 00007fffc0ce2090
      [  181.329608] GPR16: 0000010011886608 00007fff8c89f260 0000000000000001 00007fff8c080028
      [  181.329608] GPR20: 0000000000000000 00000100118500a6 0000010011850000 0000010011850000
      [  181.329608] GPR24: 00007fffc0ce1b48 0000010011850000 00000000d673b901 0000000000000000
      [  181.329608] GPR28: 0000000000000000 c000003fdcfe0000 c000003fdcfe0000 c000003fe4f00000
      [  181.330199] NIP [d00000001e7d9980] kvmppc_vcpu_run_hv+0x90/0x6b0 [kvm_hv]
      [  181.330264] LR [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [  181.330322] Call Trace:
      [  181.330351] [c000003fe4d83a60] [d00000001e773478] kvmppc_set_one_reg+0x48/0x340 [kvm] (unreliable)
      [  181.330437] [c000003fe4d83b30] [d00000001e77381c] kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [  181.330513] [c000003fe4d83b50] [d00000001e7700b4] kvm_arch_vcpu_ioctl_run+0x114/0x2a0 [kvm]
      [  181.330586] [c000003fe4d83bd0] [d00000001e7642f8] kvm_vcpu_ioctl+0x598/0x7a0 [kvm]
      [  181.330658] [c000003fe4d83d40] [c0000000003451b8] do_vfs_ioctl+0xc8/0x8b0
      [  181.330717] [c000003fe4d83de0] [c000000000345a64] SyS_ioctl+0xc4/0x120
      [  181.330776] [c000003fe4d83e30] [c00000000000b004] system_call+0x58/0x6c
      [  181.330833] Instruction dump:
      [  181.330869] e92d0260 e9290b50 e9290108 792807e3 41820058 e92d0260 e9290b50 e9290108
      [  181.330941] 792ae8a4 794a1f87 408204f4 e92d0260 <7d4022a6> f9490ff0 e92d0260 7d4122a6
      [  181.331013] ---[ end trace 6f6ddeb4bfe92a92 ]---
      
      The fix is just to turn on the TM bit in the MSR before accessing the
      registers.
      
      Cc: stable@vger.kernel.org # v3.14+
      Fixes: 46a704f8 ("KVM: PPC: Book3S HV: Preserve userspace HTM state properly")
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e4705715
  5. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  6. 03 7月, 2017 1 次提交
    • P
      KVM: PPC: Book3S: Fix typo in XICS-on-XIVE state saving code · 00c14757
      Paul Mackerras 提交于
      This fixes a typo where the wrong loop index was used to index
      the kvmppc_xive_vcpu.queues[] array in xive_pre_save_scan().
      The variable i contains the vcpu number; we need to index queues[]
      using j, which iterates from 0 to KVMPPC_XIVE_Q_COUNT-1.
      
      The effect of this bug is that things that save the interrupt
      controller state, such as "virsh dump", on a VM with more than
      8 vCPUs, result in xive_pre_save_queue() getting called on a
      bogus queue structure, usually resulting in a crash like this:
      
      [  501.821107] Unable to handle kernel paging request for data at address 0x00000084
      [  501.821212] Faulting instruction address: 0xc008000004c7c6f8
      [  501.821234] Oops: Kernel access of bad area, sig: 11 [#1]
      [  501.821305] SMP NR_CPUS=1024
      [  501.821307] NUMA
      [  501.821376] PowerNV
      [  501.821470] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf ipmi_msghandler powernv_op_panel kvm_hv nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm tg3 ptp pps_core
      [  501.822477] CPU: 3 PID: 3934 Comm: live_migration Not tainted 4.11.0-4.git8caa70f.el7.centos.ppc64le #1
      [  501.822633] task: c0000003f9e3ae80 task.stack: c0000003f9ed4000
      [  501.822745] NIP: c008000004c7c6f8 LR: c008000004c7c628 CTR: 0000000030058018
      [  501.822877] REGS: c0000003f9ed7980 TRAP: 0300   Not tainted  (4.11.0-4.git8caa70f.el7.centos.ppc64le)
      [  501.823030] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
      [  501.823047]   CR: 28022244  XER: 00000000
      [  501.823203] CFAR: c008000004c7c77c DAR: 0000000000000084 DSISR: 40000000 SOFTE: 1
      [  501.823203] GPR00: c008000004c7c628 c0000003f9ed7c00 c008000004c91450 00000000000000ff
      [  501.823203] GPR04: c0000003f5580000 c0000003f559bf98 9000000000009033 0000000000000000
      [  501.823203] GPR08: 0000000000000084 0000000000000000 00000000000001e0 9000000000001003
      [  501.823203] GPR12: c00000000008a7d0 c00000000fdc1b00 000000000a9a0000 0000000000000000
      [  501.823203] GPR16: 00000000402954e8 000000000a9a0000 0000000000000004 0000000000000000
      [  501.823203] GPR20: 0000000000000008 c000000002e8f180 c000000002e8f1e0 0000000000000001
      [  501.823203] GPR24: 0000000000000008 c0000003f5580008 c0000003f4564018 c000000002e8f1e8
      [  501.823203] GPR28: 00003ff6e58bdc28 c0000003f4564000 0000000000000000 0000000000000000
      [  501.825441] NIP [c008000004c7c6f8] xive_get_attr+0x3b8/0x5b0 [kvm]
      [  501.825671] LR [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm]
      [  501.825887] Call Trace:
      [  501.825991] [c0000003f9ed7c00] [c008000004c7c628] xive_get_attr+0x2e8/0x5b0 [kvm] (unreliable)
      [  501.826312] [c0000003f9ed7cd0] [c008000004c62ec4] kvm_device_ioctl_attr+0x64/0xa0 [kvm]
      [  501.826581] [c0000003f9ed7d20] [c008000004c62fcc] kvm_device_ioctl+0xcc/0xf0 [kvm]
      [  501.826843] [c0000003f9ed7d40] [c000000000350c70] do_vfs_ioctl+0xd0/0x8c0
      [  501.827060] [c0000003f9ed7de0] [c000000000351534] SyS_ioctl+0xd4/0xf0
      [  501.827282] [c0000003f9ed7e30] [c00000000000b8e0] system_call+0x38/0xfc
      [  501.827496] Instruction dump:
      [  501.827632] 419e0078 3b760008 e9160008 83fb000c 83db0010 80fb0008 2f280000 60000000
      [  501.827901] 60000000 60420000 419a0050 7be91764 <7d284c2c> 552a0ffe 7f8af040 419e003c
      [  501.828176] ---[ end trace 2d0529a5bbbbafed ]---
      
      Cc: stable@vger.kernel.org
      Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      00c14757
  7. 01 7月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Close race with testing for signals on guest entry · 8b24e69f
      Paul Mackerras 提交于
      At present, interrupts are hard-disabled fairly late in the guest
      entry path, in the assembly code.  Since we check for pending signals
      for the vCPU(s) task(s) earlier in the guest entry path, it is
      possible for a signal to be delivered before we enter the guest but
      not be noticed until after we exit the guest for some other reason.
      
      Similarly, it is possible for the scheduler to request a reschedule
      while we are in the guest entry path, and we won't notice until after
      we have run the guest, potentially for a whole timeslice.
      
      Furthermore, with a radix guest on POWER9, we can take the interrupt
      with the MMU on.  In this case we end up leaving interrupts
      hard-disabled after the guest exit, and they are likely to stay
      hard-disabled until we exit to userspace or context-switch to
      another process.  This was masking the fact that we were also not
      setting the RI (recoverable interrupt) bit in the MSR, meaning
      that if we had taken an interrupt, it would have crashed the host
      kernel with an unrecoverable interrupt message.
      
      To close these races, we need to check for signals and reschedule
      requests after hard-disabling interrupts, and then keep interrupts
      hard-disabled until we enter the guest.  If there is a signal or a
      reschedule request from another CPU, it will send an IPI, which will
      cause a guest exit.
      
      This puts the interrupt disabling before we call kvmppc_start_thread()
      for all the secondary threads of this core that are going to run vCPUs.
      The reason for that is that once we have started the secondary threads
      there is no easy way to back out without going through at least part
      of the guest entry path.  However, kvmppc_start_thread() includes some
      code for radix guests which needs to call smp_call_function(), which
      must be called with interrupts enabled.  To solve this problem, this
      patch moves that code into a separate function that is called earlier.
      
      When the guest exit is caused by an external interrupt, a hypervisor
      doorbell or a hypervisor maintenance interrupt, we now handle these
      using the replay facility.  __kvmppc_vcore_entry() now returns the
      trap number that caused the exit on this thread, and instead of the
      assembly code jumping to the handler entry, we return to C code with
      interrupts still hard-disabled and set the irq_happened flag in the
      PACA, so that when we do local_irq_enable() the appropriate handler
      gets called.
      
      With all this, we now have the interrupt soft-enable flag clear while
      we are in the guest.  This is useful because code in the real-mode
      hypercall handlers that checks whether interrupts are enabled will
      now see that they are disabled, which is correct, since interrupts
      are hard-disabled in the real-mode code.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8b24e69f
    • P
      KVM: PPC: Book3S HV: Simplify dynamic micro-threading code · 898b25b2
      Paul Mackerras 提交于
      Since commit b009031f ("KVM: PPC: Book3S HV: Take out virtual
      core piggybacking code", 2016-09-15), we only have at most one
      vcore per subcore.  Previously, the fact that there might be more
      than one vcore per subcore meant that we had the notion of a
      "master vcore", which was the vcore that controlled thread 0 of
      the subcore.  We also needed a list per subcore in the core_info
      struct to record which vcores belonged to each subcore.  Now that
      there can only be one vcore in the subcore, we can replace the
      list with a simple pointer and get rid of the notion of the
      master vcore (and in fact treat every vcore as a master vcore).
      
      We can also get rid of the subcore_vm[] field in the core_info
      struct since it is never read.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      898b25b2
  8. 23 6月, 2017 1 次提交
    • B
      powerpc/mm: Trace tlbie(l) instructions · 0428491c
      Balbir Singh 提交于
      Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
      Entry (Local)) instructions.
      
      The tlbie instruction has changed over the years, so not all versions
      accept the same operands. Use the ISA v3 field operands because they are
      the most verbose, we may change them in future.
      
      Example output:
      
        qemu-system-ppc-5371  [016]  1412.369519: tlbie:
        	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Add some missing trace_tlbie()s, reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0428491c
  9. 22 6月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Add capability to report possible virtual SMT modes · 2ed4f9dd
      Paul Mackerras 提交于
      Now that userspace can set the virtual SMT mode by enabling the
      KVM_CAP_PPC_SMT capability, it is useful for userspace to be able
      to query the set of possible virtual SMT modes.  This provides a
      new capability, KVM_CAP_PPC_SMT_POSSIBLE, to provide this
      information.  The return value is a bitmap of possible modes, with
      bit N set if virtual SMT mode 2^N is available.  That is, 1 indicates
      SMT1 is available, 2 indicates that SMT2 is available, 3 indicates
      that both SMT1 and SMT2 are available, and so on.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2ed4f9dd
    • A
      KVM: PPC: Book3S HV: Exit guest upon MCE when FWNMI capability is enabled · e20bbd3d
      Aravinda Prasad 提交于
      Enhance KVM to cause a guest exit with KVM_EXIT_NMI
      exit reason upon a machine check exception (MCE) in
      the guest address space if the KVM_CAP_PPC_FWNMI
      capability is enabled (instead of delivering a 0x200
      interrupt to guest). This enables QEMU to build error
      log and deliver machine check exception to guest via
      guest registered machine check handler.
      
      This approach simplifies the delivery of machine
      check exception to guest OS compared to the earlier
      approach of KVM directly invoking 0x200 guest interrupt
      vector.
      
      This design/approach is based on the feedback for the
      QEMU patches to handle machine check exception. Details
      of earlier approach of handling machine check exception
      in QEMU and related discussions can be found at:
      
      https://lists.nongnu.org/archive/html/qemu-devel/2014-11/msg00813.html
      
      Note:
      
      This patch now directly invokes machine_check_print_event_info()
      from kvmppc_handle_exit_hv() to print the event to host console
      at the time of guest exit before the exception is passed on to the
      guest. Hence, the host-side handling which was performed earlier
      via machine_check_fwnmi is removed.
      
      The reasons for this approach is (i) it is not possible
      to distinguish whether the exception occurred in the
      guest or the host from the pt_regs passed on the
      machine_check_exception(). Hence machine_check_exception()
      calls panic, instead of passing on the exception to
      the guest, if the machine check exception is not
      recoverable. (ii) the approach introduced in this
      patch gives opportunity to the host kernel to perform
      actions in virtual mode before passing on the exception
      to the guest. This approach does not require complex
      tweaks to machine_check_fwnmi and friends.
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e20bbd3d
  10. 21 6月, 2017 1 次提交
    • A
      KVM: PPC: Book3S HV: Add new capability to control MCE behaviour · 134764ed
      Aravinda Prasad 提交于
      This introduces a new KVM capability to control how KVM behaves
      on machine check exception (MCE) in HV KVM guests.
      
      If this capability has not been enabled, KVM redirects machine check
      exceptions to guest's 0x200 vector, if the address in error belongs to
      the guest. With this capability enabled, KVM will cause a guest exit
      with the exit reason indicating an NMI.
      
      The new capability is required to avoid problems if a new kernel/KVM
      is used with an old QEMU, running a guest that doesn't issue
      "ibm,nmi-register".  As old QEMU does not understand the NMI exit
      type, it treats it as a fatal error.  However, the guest could have
      handled the machine check error if the exception was delivered to
      guest's 0x200 interrupt vector instead of NMI exit in case of old
      QEMU.
      
      [paulus@ozlabs.org - Reworded the commit message to be clearer,
       enable only on HV KVM.]
      Signed-off-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      134764ed
  11. 20 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Don't sleep if XIVE interrupt pending on POWER9 · ee3308a2
      Paul Mackerras 提交于
      On a POWER9 system, it is possible for an interrupt to become pending
      for a VCPU when that VCPU is about to cede (execute a H_CEDE hypercall)
      and has already disabled interrupts, or in the H_CEDE processing up
      to the point where the XIVE context is pulled from the hardware.  In
      such a case, the H_CEDE should not sleep, but should return immediately
      to the guest.  However, the conditions tested in kvmppc_vcpu_woken()
      don't include the condition that a XIVE interrupt is pending, so the
      VCPU could sleep until the next decrementer interrupt.
      
      To fix this, we add a new xive_interrupt_pending() helper which looks
      in the XIVE context that was pulled from the hardware to see if the
      priority of any pending interrupt is higher (numerically lower than)
      the CPU priority.  If so then kvmppc_vcpu_woken() will return true.
      If the XIVE context has never been used, then both the pipr and the
      cppr fields will be zero and the test will indicate that no interrupt
      is pending.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ee3308a2
  12. 19 6月, 2017 6 次提交
    • N
      powerpc/64s/idle: Avoid SRR usage in idle sleep/wake paths · 9d292501
      Nicholas Piggin 提交于
      Idle code now always runs at the 0xc... effective address whether
      in real or virtual mode. This means rfid can be ditched, along
      with a lot of SRR manipulations.
      
      In the wakeup path, carry SRR1 around in r12. Use mtmsrd to change
      MSR states as required.
      
      This also balances the return prediction for the idle call, by
      doing blr rather than rfid to return to the idle caller.
      
      On POWER9, 2-process context switch on different cores, with snooze
      disabled, increases performance by 2%.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Incorporate v2 fixes from Nick]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d292501
    • P
      KVM: PPC: Book3S HV: Virtualize doorbell facility on POWER9 · 57900694
      Paul Mackerras 提交于
      On POWER9, we no longer have the restriction that we had on POWER8
      where all threads in a core have to be in the same partition, so
      the CPU threads are now independent.  However, we still want to be
      able to run guests with a virtual SMT topology, if only to allow
      migration of guests from POWER8 systems to POWER9.
      
      A guest that has a virtual SMT mode greater than 1 will expect to
      be able to use the doorbell facility; it will expect the msgsndp
      and msgclrp instructions to work appropriately and to be able to read
      sensible values from the TIR (thread identification register) and
      DPDES (directed privileged doorbell exception status) special-purpose
      registers.  However, since each CPU thread is a separate sub-processor
      in POWER9, these instructions and registers can only be used within
      a single CPU thread.
      
      In order for these instructions to appear to act correctly according
      to the guest's virtual SMT mode, we have to trap and emulate them.
      We cause them to trap by clearing the HFSCR_MSGP bit in the HFSCR
      register.  The emulation is triggered by the hypervisor facility
      unavailable interrupt that occurs when the guest uses them.
      
      To cause a doorbell interrupt to occur within the guest, we set the
      DPDES register to 1.  If the guest has interrupts enabled, the CPU
      will generate a doorbell interrupt and clear the DPDES register in
      hardware.  The DPDES hardware register for the guest is saved in the
      vcpu->arch.vcore->dpdes field.  Since this gets written by the guest
      exit code, other VCPUs wishing to cause a doorbell interrupt don't
      write that field directly, but instead set a vcpu->arch.doorbell_request
      flag.  This is consumed and set to 0 by the guest entry code, which
      then sets DPDES to 1.
      
      Emulating reads of the DPDES register is somewhat involved, because
      it requires reading the doorbell pending interrupt status of all of the
      VCPU threads in the virtual core, and if any of those VCPUs are
      running, their doorbell status is only up-to-date in the hardware
      DPDES registers of the CPUs where they are running.  In order to get
      a reasonable approximation of the current doorbell status, we send
      those CPUs an IPI, causing an exit from the guest which will update
      the vcpu->arch.vcore->dpdes field.  We then use that value in
      constructing the emulated DPDES register value.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      57900694
    • P
      KVM: PPC: Book3S HV: Allow userspace to set the desired SMT mode · 3c313524
      Paul Mackerras 提交于
      This allows userspace to set the desired virtual SMT (simultaneous
      multithreading) mode for a VM, that is, the number of VCPUs that
      get assigned to each virtual core.  Previously, the virtual SMT mode
      was fixed to the number of threads per subcore, and if userspace
      wanted to have fewer vcpus per vcore, then it would achieve that by
      using a sparse CPU numbering.  This had the disadvantage that the
      vcpu numbers can get quite large, particularly for SMT1 guests on
      a POWER8 with 8 threads per core.  With this patch, userspace can
      set its desired virtual SMT mode and then use contiguous vcpu
      numbering.
      
      On POWER8, where the threading mode is "strict", the virtual SMT mode
      must be less than or equal to the number of threads per subcore.  On
      POWER9, which implements a "loose" threading mode, the virtual SMT
      mode can be any power of 2 between 1 and 8, even though there is
      effectively one thread per subcore, since the threads are independent
      and can all be in different partitions.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3c313524
    • P
      KVM: PPC: Book3S HV: Context-switch HFSCR between host and guest on POWER9 · 769377f7
      Paul Mackerras 提交于
      This adds code to allow us to use a different value for the HFSCR
      (Hypervisor Facilities Status and Control Register) when running the
      guest from that which applies in the host.  The reason for doing this
      is to allow us to trap the msgsndp instruction and related operations
      in future so that they can be virtualized.  We also save the value of
      HFSCR when a hypervisor facility unavailable interrupt occurs, because
      the high byte of HFSCR indicates which facility the guest attempted to
      access.
      
      We save and restore the host value on guest entry/exit because some
      bits of it affect host userspace execution.
      
      We only do all this on POWER9, not on POWER8, because we are not
      intending to virtualize any of the facilities controlled by HFSCR on
      POWER8.  In particular, the HFSCR bit that controls execution of
      msgsndp and related operations does not exist on POWER8.  The HFSCR
      doesn't exist at all on POWER7.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      769377f7
    • P
      KVM: PPC: Book3S HV: Don't let VCPU sleep if it has a doorbell pending · 1da4e2f4
      Paul Mackerras 提交于
      It is possible, through a narrow race condition, for a VCPU to exit
      the guest with a H_CEDE hypercall while it has a doorbell interrupt
      pending.  In this case, the H_CEDE should return immediately, but in
      fact it puts the VCPU to sleep until some other interrupt becomes
      pending or a prod is received (via another VCPU doing H_PROD).
      
      This fixes it by checking the DPDES (Directed Privileged Doorbell
      Exception Status) bit for the thread along with the other interrupt
      pending bits.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1da4e2f4
    • P
      KVM: PPC: Book3S HV: Enable guests to use large decrementer mode on POWER9 · 1bc3fe81
      Paul Mackerras 提交于
      This allows userspace (e.g. QEMU) to enable large decrementer mode for
      the guest when running on a POWER9 host, by setting the LPCR_LD bit in
      the guest LPCR value.  With this, the guest exit code saves 64 bits of
      the guest DEC value on exit.  Other places that use the guest DEC
      value check the LPCR_LD bit in the guest LPCR value, and if it is set,
      omit the 32-bit sign extension that would otherwise be done.
      
      This doesn't change the DEC emulation used by PR KVM because PR KVM
      is not supported on POWER9 yet.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      1bc3fe81
  13. 16 6月, 2017 2 次提交
    • P
      KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1 · 3d3efb68
      Paul Mackerras 提交于
      POWER9 DD1 has an erratum where writing to the TBU40 register, which
      is used to apply an offset to the timebase, can cause the timebase to
      lose counts.  This results in the timebase on some CPUs getting out of
      sync with other CPUs, which then results in misbehaviour of the
      timekeeping code.
      
      To work around the problem, we make KVM ignore the timebase offset for
      all guests on POWER9 DD1 machines.  This means that live migration
      cannot be supported on POWER9 DD1 machines.
      
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      3d3efb68
    • P
      KVM: PPC: Book3S HV: Save/restore host values of debug registers · 7ceaa6dc
      Paul Mackerras 提交于
      At present, HV KVM on POWER8 and POWER9 machines loses any instruction
      or data breakpoint set in the host whenever a guest is run.
      Instruction breakpoints are currently only used by xmon, but ptrace
      and the perf_event subsystem can set data breakpoints as well as xmon.
      
      To fix this, we save the host values of the debug registers (CIABR,
      DAWR and DAWRX) before entering the guest and restore them on exit.
      To provide space to save them in the stack frame, we expand the stack
      frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      7ceaa6dc
  14. 15 6月, 2017 3 次提交
    • B
      powerpc/xive: Fix offset for store EOI MMIOs · 25642705
      Benjamin Herrenschmidt 提交于
      Architecturally we should apply a 0x400 offset for these. Not doing
      it will break future HW implementations.
      
      The offset of 0 is supposed to remain for "triggers" though not all
      sources support both trigger and store EOI, and in P9 specifically,
      some sources will treat 0 as a store EOI. But future chips will not.
      So this makes us use the properly architected offset which should work
      always.
      
      Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      25642705
    • P
      KVM: PPC: Book3S HV: Preserve userspace HTM state properly · 46a704f8
      Paul Mackerras 提交于
      If userspace attempts to call the KVM_RUN ioctl when it has hardware
      transactional memory (HTM) enabled, the values that it has put in the
      HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
      guest values.  To fix this, we detect this condition and save those
      SPR values in the thread struct, and disable HTM for the task.  If
      userspace goes to access those SPRs or the HTM facility in future,
      a TM-unavailable interrupt will occur and the handler will reload
      those SPRs and re-enable HTM.
      
      If userspace has started a transaction and suspended it, we would
      currently lose the transactional state in the guest entry path and
      would almost certainly get a "TM Bad Thing" interrupt, which would
      cause the host to crash.  To avoid this, we detect this case and
      return from the KVM_RUN ioctl with an EINVAL error, with the KVM
      exit reason set to KVM_EXIT_FAIL_ENTRY.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      46a704f8
    • P
      KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit · 4c3bb4cc
      Paul Mackerras 提交于
      This restores several special-purpose registers (SPRs) to sane values
      on guest exit that were missed before.
      
      TAR and VRSAVE are readable and writable by userspace, and we need to
      save and restore them to prevent the guest from potentially affecting
      userspace execution (not that TAR or VRSAVE are used by any known
      program that run uses the KVM_RUN ioctl).  We save/restore these
      in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.
      
      FSCR affects userspace execution in that it can prohibit access to
      certain facilities by userspace.  We restore it to the normal value
      for the task on exit from the KVM_RUN ioctl.
      
      IAMR is normally 0, and is restored to 0 on guest exit.  However,
      with a radix host on POWER9, it is set to a value that prevents the
      kernel from executing user-accessible memory.  On POWER9, we save
      IAMR on guest entry and restore it on guest exit to the saved value
      rather than 0.  On POWER8 we continue to set it to 0 on guest exit.
      
      PSPB is normally 0.  We restore it to 0 on guest exit to prevent
      userspace taking advantage of the guest having set it non-zero
      (which would allow userspace to set its SMT priority to high).
      
      UAMOR is normally 0.  We restore it to 0 on guest exit to prevent
      the AMR from being used as a covert channel between userspace
      processes, since the AMR is not context-switched at present.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      4c3bb4cc
  15. 13 6月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Context-switch EBB registers properly · ca8efa1d
      Paul Mackerras 提交于
      This adds code to save the values of three SPRs (special-purpose
      registers) used by userspace to control event-based branches (EBBs),
      which are essentially interrupts that get delivered directly to
      userspace.  These registers are loaded up with guest values when
      entering the guest, and their values are saved when exiting the
      guest, but we were not saving the host values and restoring them
      before going back to userspace.
      
      On POWER8 this would only affect userspace programs which explicitly
      request the use of EBBs and also use the KVM_RUN ioctl, since the
      only source of EBBs on POWER8 is the PMU, and there is an explicit
      enable bit in the PMU registers (and those PMU registers do get
      properly context-switched between host and guest).  On POWER9 there
      is provision for externally-generated EBBs, and these are not subject
      to the control in the PMU registers.
      
      Since these registers only affect userspace, we can save them when
      we first come in from userspace and restore them before returning to
      userspace, rather than saving/restoring the host values on every
      guest entry/exit.  Similarly, we don't need to worry about their
      values on offline secondary threads since they execute in the context
      of the idle task, which never executes in userspace.
      
      Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ca8efa1d
  16. 04 6月, 2017 1 次提交
  17. 29 5月, 2017 1 次提交
    • P
      KVM: PPC: Book3S HV: Cope with host using large decrementer mode · 2f272463
      Paul Mackerras 提交于
      POWER9 introduces a new mode for the decrementer register, called
      large decrementer mode, in which the decrementer counter is 56 bits
      wide rather than 32, and reads are sign-extended rather than
      zero-extended.  For the decrementer, this new mode is optional and
      controlled by a bit in the LPCR.  The hypervisor decrementer (HDEC)
      is 56 bits wide on POWER9 and has no mode control.
      
      Since KVM code reads and writes the decrementer and hypervisor
      decrementer registers in a few places, it needs to be aware of the
      need to treat the decrementer value as a 64-bit quantity, and only do
      a 32-bit sign extension when large decrementer mode is not in effect.
      Similarly, the HDEC should always be treated as a 64-bit quantity on
      POWER9.  We define a new EXTEND_HDEC macro to encapsulate the feature
      test for POWER9 and the sign extension.
      
      To enable the sign extension to be removed in large decrementer mode,
      we test the LPCR_LD bit in the host LPCR image stored in the struct
      kvm for the guest.  If is set then large decrementer mode is enabled
      and the sign extension should be skipped.
      
      This is partly based on an earlier patch by Oliver O'Halloran.
      
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      2f272463
  18. 26 5月, 2017 1 次提交
  19. 12 5月, 2017 3 次提交
    • P
      KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms · 76d837a4
      Paul Mackerras 提交于
      Commit e91aa8e6 ("KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64
      permanently", 2017-03-22) enabled the SPAPR TCE code for all 64-bit
      Book 3S kernel configurations in order to simplify the code and
      reduce #ifdefs.  However, 64-bit Book 3S PPC platforms other than
      pseries and powernv don't implement the necessary IOMMU callbacks,
      leading to build failures like the following (for a pasemi config):
      
      scripts/kconfig/conf  --silentoldconfig Kconfig
      warning: (KVM_BOOK3S_64) selects SPAPR_TCE_IOMMU which has unmet direct dependencies (IOMMU_SUPPORT && (PPC_POWERNV || PPC_PSERIES))
      
      ...
      
        CC [M]  arch/powerpc/kvm/book3s_64_vio.o
      /home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_64_vio.c: In function ‘kvmppc_clear_tce’:
      /home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_64_vio.c:363:2: error: implicit declaration of function ‘iommu_tce_xchg’ [-Werror=implicit-function-declaration]
        iommu_tce_xchg(tbl, entry, &hpa, &dir);
        ^
      
      To fix this, we make the inclusion of the SPAPR TCE support, and the
      code that uses it in book3s_vio.c and book3s_vio_hv.c, depend on
      the inclusion of support for the pseries and/or powernv platforms.
      This means that when running a 'pseries' guest on those platforms,
      the guest won't have in-kernel acceleration of the PAPR TCE hypercalls,
      but at least now they compile.
      Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      76d837a4
    • P
      KVM: PPC: Book3S PR: Check copy_to/from_user return values · 67325e98
      Paul Mackerras 提交于
      The PR KVM implementation of the PAPR HPT hypercalls (H_ENTER etc.)
      access an image of the HPT in userspace memory using copy_from_user
      and copy_to_user.  Recently, the declarations of those functions were
      annotated to indicate that the return value must be checked.  Since
      this code doesn't currently check the return value, this causes
      compile warnings like the ones shown below, and since on PPC the
      default is to compile arch/powerpc with -Werror, this causes the
      build to fail.
      
      To fix this, we check the return values, and if non-zero, fail the
      hypercall being processed with a H_FUNCTION error return value.
      There is really no good error return value to use since PAPR didn't
      envisage the possibility that the hypervisor may not be able to access
      the guest's HPT, and H_FUNCTION (function not supported) seems as
      good as any.
      
      The typical compile warnings look like this:
      
        CC      arch/powerpc/kvm/book3s_pr_papr.o
      /home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c: In function ‘kvmppc_h_pr_enter’:
      /home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c:53:2: error: ignoring return value of ‘copy_from_user’, declared with attribute warn_unused_result [-Werror=unused-result]
        copy_from_user(pteg, (void __user *)pteg_addr, sizeof(pteg));
        ^
      /home/paulus/kernel/kvm/arch/powerpc/kvm/book3s_pr_papr.c:74:2: error: ignoring return value of ‘copy_to_user’, declared with attribute warn_unused_result [-Werror=unused-result]
        copy_to_user((void __user *)pteg_addr, hpte, HPTE_SIZE);
        ^
      
      ... etc.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      67325e98
    • P
      KVM: PPC: Book3S HV: Add radix checks in real-mode hypercall handlers · acde2572
      Paul Mackerras 提交于
      POWER9 running a radix guest will take some hypervisor interrupts
      without going to real mode (turning off the MMU).  This means that
      early hypercall handlers may now be called in virtual mode.  Most of
      the handlers work just fine in both modes, but there are some that
      can crash the host if called in virtual mode, notably the TCE (IOMMU)
      hypercalls H_PUT_TCE, H_STUFF_TCE and H_PUT_TCE_INDIRECT.  These
      already have both a real-mode and a virtual-mode version, so we
      arrange for the real-mode version to return H_TOO_HARD for radix
      guests, which will result in the virtual-mode version being called.
      
      The other hypercall which is sensitive to the MMU mode is H_RANDOM.
      It doesn't have a virtual-mode version, so this adds code to enable
      it to be called in either mode.
      
      An alternative solution was considered which would refuse to call any
      of the early hypercall handlers when doing a virtual-mode exit from a
      radix guest.  However, the XICS-on-XIVE code depends on the XICS
      hypercalls being handled early even for virtual-mode exits, because
      the handlers need to be called before the XIVE vCPU state has been
      pulled off the hardware.  Therefore that solution would have become
      quite invasive and complicated, and was rejected in favour of the
      simpler, though less elegant, solution presented here.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      acde2572
  20. 28 4月, 2017 1 次提交
    • D
      KVM: PPC: Book3S HV: Avoid preemptibility warning in module initialization · db4b0dfa
      Denis Kirjanov 提交于
      With CONFIG_DEBUG_PREEMPT, get_paca() produces the following warning
      in kvmppc_book3s_init_hv() since it calls debug_smp_processor_id().
      
      There is no real issue with the xics_phys field.
      If paca->kvm_hstate.xics_phys is non-zero on one cpu, it will be
      non-zero on them all.  Therefore this is not fixing any actual
      problem, just the warning.
      
      [  138.521188] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/5596
      [  138.521308] caller is .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
      [  138.521404] CPU: 5 PID: 5596 Comm: modprobe Not tainted 4.11.0-rc3-00022-gc7e790c5 #1
      [  138.521509] Call Trace:
      [  138.521563] [c0000007d018b810] [c0000000023eef10] .dump_stack+0xe4/0x150 (unreliable)
      [  138.521694] [c0000007d018b8a0] [c000000001f6ec04] .check_preemption_disabled+0x134/0x150
      [  138.521829] [c0000007d018b940] [d00000000a010274] .kvmppc_book3s_init_hv+0x184/0x350 [kvm_hv]
      [  138.521963] [c0000007d018ba00] [c00000000191d5cc] .do_one_initcall+0x5c/0x1c0
      [  138.522082] [c0000007d018bad0] [c0000000023e9494] .do_init_module+0x84/0x240
      [  138.522201] [c0000007d018bb70] [c000000001aade18] .load_module+0x1f68/0x2a10
      [  138.522319] [c0000007d018bd20] [c000000001aaeb30] .SyS_finit_module+0xc0/0xf0
      [  138.522439] [c0000007d018be30] [c00000000191baec] system_call+0x38/0xfc
      Signed-off-by: NDenis Kirjanov <kda@linux-powerpc.org>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      db4b0dfa
  21. 27 4月, 2017 2 次提交
  22. 20 4月, 2017 2 次提交
    • T
      KVM: PPC: Book3S PR: Do not fail emulation with mtspr/mfspr for unknown SPRs · feafd13c
      Thomas Huth 提交于
      According to the PowerISA 2.07, mtspr and mfspr should not always
      generate an illegal instruction exception when being used with an
      undefined SPR, but rather treat the instruction as a NOP or inject a
      privilege exception in some cases, too - depending on the SPR number.
      Also turn the printk here into a ratelimited print statement, so that
      the guest can not flood the dmesg log of the host by issueing lots of
      illegal mtspr/mfspr instruction here.
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      feafd13c
    • A
      KVM: PPC: VFIO: Add in-kernel acceleration for VFIO · 121f80ba
      Alexey Kardashevskiy 提交于
      This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
      and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
      without passing them to user space which saves time on switching
      to user space and back.
      
      This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
      KVM tries to handle a TCE request in the real mode, if failed
      it passes the request to the virtual mode to complete the operation.
      If it a virtual mode handler fails, the request is passed to
      the user space; this is not expected to happen though.
      
      To avoid dealing with page use counters (which is tricky in real mode),
      this only accelerates SPAPR TCE IOMMU v2 clients which are required
      to pre-register the userspace memory. The very first TCE request will
      be handled in the VFIO SPAPR TCE driver anyway as the userspace view
      of the TCE table (iommu_table::it_userspace) is not allocated till
      the very first mapping happens and we cannot call vmalloc in real mode.
      
      If we fail to update a hardware IOMMU table unexpected reason, we just
      clear it and move on as there is nothing really we can do about it -
      for example, if we hot plug a VFIO device to a guest, existing TCE tables
      will be mirrored automatically to the hardware and there is no interface
      to report to the guest about possible failures.
      
      This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
      the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
      and associates a physical IOMMU table with the SPAPR TCE table (which
      is a guest view of the hardware IOMMU table). The iommu_table object
      is cached and referenced so we do not have to look up for it in real mode.
      
      This does not implement the UNSET counterpart as there is no use for it -
      once the acceleration is enabled, the existing userspace won't
      disable it unless a VFIO container is destroyed; this adds necessary
      cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
      
      This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
      space.
      
      This adds real mode version of WARN_ON_ONCE() as the generic version
      causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
      returns in the code, this also adds a check for already existing
      vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
      
      This finally makes use of vfio_external_user_iommu_id() which was
      introduced quite some time ago and was considered for removal.
      
      Tests show that this patch increases transmission speed from 220MB/s
      to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      121f80ba