1. 01 3月, 2016 1 次提交
  2. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  3. 14 1月, 2016 1 次提交
  4. 10 12月, 2015 1 次提交
    • P
      KVM: PPC: Book3S HV: Prohibit setting illegal transaction state in MSR · c20875a3
      Paul Mackerras 提交于
      Currently it is possible for userspace (e.g. QEMU) to set a value
      for the MSR for a guest VCPU which has both of the TS bits set,
      which is an illegal combination.  The result of this is that when
      we execute a hrfid (hypervisor return from interrupt doubleword)
      instruction to enter the guest, the CPU will take a TM Bad Thing
      type of program interrupt (vector 0x700).
      
      Now, if PR KVM is configured in the kernel along with HV KVM, we
      actually handle this without crashing the host or giving hypervisor
      privilege to the guest; instead what happens is that we deliver a
      program interrupt to the guest, with SRR0 reflecting the address
      of the hrfid instruction and SRR1 containing the MSR value at that
      point.  If PR KVM is not configured in the kernel, then we try to
      run the host's program interrupt handler with the MMU set to the
      guest context, which almost certainly causes a host crash.
      
      This closes the hole by making kvmppc_set_msr_hv() check for the
      illegal combination and force the TS field to a safe value (00,
      meaning non-transactional).
      
      Cc: stable@vger.kernel.org # v3.9+
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      c20875a3
  5. 09 12月, 2015 3 次提交
    • G
      KVM: PPC: Book3S PR: Remove unused variable 'vcpu_book3s' · edfaff26
      Geyslan G. Bem 提交于
      The vcpu_book3s variable is assigned but never used. So remove it.
      Found using cppcheck.
      Signed-off-by: NGeyslan G. Bem <geyslan@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      edfaff26
    • T
      KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8 · 760a7364
      Thomas Huth 提交于
      In the old DABR register, the BT (Breakpoint Translation) bit
      is bit number 61. In the new DAWRX register, the WT (Watchpoint
      Translation) bit is bit number 59. So to move the DABR-BT bit
      into the position of the DAWRX-WT bit, it has to be shifted by
      two, not only by one. This fixes hardware watchpoints in gdb of
      older guests that only use the H_SET_DABR/X interface instead
      of the new H_SET_MODE interface.
      
      Cc: stable@vger.kernel.org # v3.14+
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      760a7364
    • P
      KVM: PPC: Book3S HV: Handle unexpected traps in guest entry/exit code better · 1c9e3d51
      Paul Mackerras 提交于
      As we saw with the TM Bad Thing type of program interrupt occurring
      on the hrfid that enters the guest, it is not completely impossible
      to have a trap occurring in the guest entry/exit code, despite the
      fact that the code has been written to avoid taking any traps.
      
      This adds a check in the kvmppc_handle_exit_hv() function to detect
      the case when a trap has occurred in the hypervisor-mode code, and
      instead of treating it just like a trap in guest code, we now print
      a message and return to userspace with a KVM_EXIT_INTERNAL_ERROR
      exit reason.
      
      Of the various interrupts that get handled in the assembly code in
      the guest exit path and that can return directly to the guest, the
      only one that can occur when MSR.HV=1 and MSR.EE=0 is machine check
      (other than system call, which we can avoid just by not doing a sc
      instruction).  Therefore this adds code to the machine check path to
      ensure that if the MCE occurred in hypervisor mode, we exit to the
      host rather than trying to continue the guest.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      1c9e3d51
  6. 02 12月, 2015 2 次提交
  7. 01 12月, 2015 1 次提交
  8. 30 11月, 2015 1 次提交
  9. 26 11月, 2015 1 次提交
  10. 06 11月, 2015 2 次提交
    • P
      KVM: PPC: Book3S HV: Don't dynamically split core when already split · f74f2e2e
      Paul Mackerras 提交于
      In static micro-threading modes, the dynamic micro-threading code
      is supposed to be disabled, because subcores can't make independent
      decisions about what micro-threading mode to put the core in - there is
      only one micro-threading mode for the whole core.  The code that
      implements dynamic micro-threading checks for this, except that the
      check was missed in one case.  This means that it is possible for a
      subcore in static 2-way micro-threading mode to try to put the core
      into 4-way micro-threading mode, which usually leads to stuck CPUs,
      spinlock lockups, and other stalls in the host.
      
      The problem was in the can_split_piggybacked_subcores() function, which
      should always return false if the system is in a static micro-threading
      mode.  This fixes the problem by making can_split_piggybacked_subcores()
      use subcore_config_ok() for its checks, as subcore_config_ok() includes
      the necessary check for the static micro-threading modes.
      
      Credit to Gautham Shenoy for working out that the reason for the hangs
      and stalls we were seeing was that we were trying to do dynamic 4-way
      micro-threading while we were in static 2-way mode.
      
      Fixes: b4deba5c
      Cc: vger@stable.kernel.org # v4.3
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f74f2e2e
    • P
      KVM: PPC: Book3S HV: Synthesize segment fault if SLB lookup fails · cf29b215
      Paul Mackerras 提交于
      When handling a hypervisor data or instruction storage interrupt (HDSI
      or HISI), we look up the SLB entry for the address being accessed in
      order to translate the effective address to a virtual address which can
      be looked up in the guest HPT.  This lookup can occasionally fail due
      to the guest replacing an SLB entry without invalidating the evicted
      SLB entry.  In this situation an ERAT (effective to real address
      translation cache) entry can persist and be used by the hardware even
      though there is no longer a corresponding SLB entry.
      
      Previously we would just deliver a data or instruction storage interrupt
      (DSI or ISI) to the guest in this case.  However, this is not correct
      and has been observed to cause guests to crash, typically with a
      data storage protection interrupt on a store to the vmemmap area.
      
      Instead, what we do now is to synthesize a data or instruction segment
      interrupt.  That should cause the guest to reload an appropriate entry
      into the SLB and retry the faulting instruction.  If it still faults,
      we should find an appropriate SLB entry next time and be able to handle
      the fault.
      Tested-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      cf29b215
  11. 21 10月, 2015 5 次提交
    • P
      powerpc: Revert "Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8" · 23316316
      Paul Mackerras 提交于
      This reverts commit 9678cdaa ("Use the POWER8 Micro Partition
      Prefetch Engine in KVM HV on POWER8") because the original commit had
      multiple, partly self-cancelling bugs, that could cause occasional
      memory corruption.
      
      In fact the logmpp instruction was incorrectly using register r0 as the
      source of the buffer address and operation code, and depending on what
      was in r0, it would either do nothing or corrupt the 64k page pointed to
      by r0.
      
      The logmpp instruction encoding and the operation code definitions could
      be corrected, but then there is the problem that there is no clearly
      defined way to know when the hardware has finished writing to the
      buffer.
      
      The original commit attempted to work around this by aborting the
      write-out before starting the prefetch, but this is ineffective in the
      case where the virtual core is now executing on a different physical
      core from the one where the write-out was initiated.
      
      These problems plus advice from the hardware designers not to use the
      function (since the measured performance improvement from using the
      feature was actually mostly negative), mean that reverting the code is
      the best option.
      
      Fixes: 9678cdaa ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      23316316
    • G
      KVM: PPC: Book3S HV: Handle H_DOORBELL on the guest exit path · 70aa3961
      Gautham R. Shenoy 提交于
      Currently a CPU running a guest can receive a H_DOORBELL in the
      following two cases:
      1) When the CPU is napping due to CEDE or there not being a guest
      vcpu.
      2) The CPU is running the guest vcpu.
      
      Case 1), the doorbell message is not cleared since we were waking up
      from nap. Hence when the EE bit gets set on transition from guest to
      host, the H_DOORBELL interrupt is delivered to the host and the
      corresponding handler is invoked.
      
      However in Case 2), the message gets cleared by the action of taking
      the H_DOORBELL interrupt. Since the CPU was running a guest, instead
      of invoking the doorbell handler, the code invokes the second-level
      interrupt handler to switch the context from the guest to the host. At
      this point the setting of the EE bit doesn't result in the CPU getting
      the doorbell interrupt since it has already been delivered once. So,
      the handler for this doorbell is never invoked!
      
      This causes softlockups if the missed DOORBELL was an IPI sent from a
      sibling subcore on the same CPU.
      
      This patch fixes it by explitly invoking the doorbell handler on the
      exit path if the exit reason is H_DOORBELL similar to the way an
      EXTERNAL interrupt is handled. Since this will also handle Case 1), we
      can unconditionally clear the doorbell message in
      kvmppc_check_wake_reason.
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      70aa3961
    • N
      KVM: PPC: Implement extension to report number of memslots · bfec5c2c
      Nikunj A Dadhania 提交于
      QEMU assumes 32 memslots if this extension is not implemented. Although,
      current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
      changes QEMU would take a wrong value.
      Signed-off-by: NNikunj A Dadhania <nikunj@linux.vnet.ibm.com>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      bfec5c2c
    • P
      KVM: PPC: Book3S HV: Make H_REMOVE return correct HPTE value for absent HPTEs · c64dfe2a
      Paul Mackerras 提交于
      This fixes a bug where the old HPTE value returned by H_REMOVE has
      the valid bit clear if the HPTE was an absent HPTE, as happens for
      HPTEs for emulated MMIO pages and for RAM pages that have been paged
      out by the host.  If the absent bit is set, we clear it and set the
      valid bit, because from the guest's point of view, the HPTE is valid.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      c64dfe2a
    • P
      KVM: PPC: Book3S HV: Don't fall back to smaller HPT size in allocation ioctl · 572abd56
      Paul Mackerras 提交于
      Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested
      size of HPT, and if that is not possible, then try to allocate smaller
      sizes (by factors of 2) until either a minimum is reached or the
      allocation succeeds.  This is not ideal for userspace, particularly in
      migration scenarios, where the destination VM really does require the
      size requested.  Also, the minimum HPT size of 256kB may be
      insufficient for the guest to run successfully.
      
      This removes the fallback to smaller sizes on allocation failure for
      the KVM_PPC_ALLOCATE_HTAB ioctl.  The fallback still exists for the
      case where the HPT is allocated at the time the first VCPU is run, if
      no HPT has been allocated by ioctl by that time.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      572abd56
  12. 16 10月, 2015 1 次提交
    • M
      KVM: PPC: Book3S HV: Deliver machine check with MSR(RI=0) to guest as MCE · 966d713e
      Mahesh Salgaonkar 提交于
      For the machine check interrupt that happens while we are in the guest,
      kvm layer attempts the recovery, and then delivers the machine check interrupt
      directly to the guest if recovery fails. On successful recovery we go back to
      normal functioning of the guest. But there can be cases where a machine check
      interrupt can happen with MSR(RI=0) while we are in the guest. This means
      MC interrupt is unrecoverable and we have to deliver a machine check to the
      guest since the machine check interrupt might have trashed valid values in
      SRR0/1. The current implementation do not handle this case, causing guest
      to crash with Bad kernel stack pointer instead of machine check oops message.
      
      [26281.490060] Bad kernel stack pointer 3fff9ccce5b0 at c00000000000490c
      [26281.490434] Oops: Bad kernel stack pointer, sig: 6 [#1]
      [26281.490472] SMP NR_CPUS=2048 NUMA pSeries
      
      This patch fixes this issue by checking MSR(RI=0) in KVM layer and forwarding
      unrecoverable interrupt to guest which then panics with proper machine check
      Oops message.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      966d713e
  13. 15 10月, 2015 3 次提交
  14. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6
  15. 21 9月, 2015 3 次提交
    • T
      KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store() · 3eb4ee68
      Thomas Huth 提交于
      Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
      functions) has to be protected via the kvm->srcu lock.
      The kvmppc_h_logical_ci_load() and -store() functions are missing
      this lock so far, so let's add it there, too.
      This fixes the problem that the kernel reports "suspicious RCU usage"
      when lock debugging is enabled.
      
      Cc: stable@vger.kernel.org # v4.1+
      Fixes: 99342cf8Signed-off-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      3eb4ee68
    • G
      KVM: PPC: Book3S HV: Pass the correct trap argument to kvmhv_commence_exit · 7e022e71
      Gautham R. Shenoy 提交于
      In guest_exit_cont we call kvmhv_commence_exit which expects the trap
      number as the argument. However r3 doesn't contain the trap number at
      this point and as a result we would be calling the function with a
      spurious trap number.
      
      Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
      r12 contains the trap number.
      
      Cc: stable@vger.kernel.org # v4.1+
      Fixes: eddb60fbSigned-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7e022e71
    • P
      KVM: PPC: Book3S HV: Fix handling of interrupted VCPUs · 5fc3e64f
      Paul Mackerras 提交于
      This fixes a bug which results in stale vcore pointers being left in
      the per-cpu preempted vcore lists when a VM is destroyed.  The result
      of the stale vcore pointers is usually either a crash or a lockup
      inside collect_piggybacks() when another VM is run.  A typical
      lockup message looks like:
      
      [  472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
      [  472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
      [  472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
      [  472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
      [  472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
      [  472.162337] REGS: c000001e41bff520 TRAP: 0901   Not tainted  (4.2.0-kvm+)
      [  472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24848844  XER: 00000000
      [  472.162588] CFAR: c00000000096b0ac SOFTE: 1
      GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
      GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
      GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
      GPR12: c000000000111130 c00000000fdae400
      [  472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
      [  472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
      [  472.163179] Call Trace:
      [  472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
      [  472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
      [  472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
      [  472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
      [  472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
      [  472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
      [  472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
      [  472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
      [  472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
      [  472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
      [  472.164060] Instruction dump:
      [  472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
      [  472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
      
      The bug is that kvmppc_run_vcpu does not correctly handle the case
      where a vcpu task receives a signal while its guest vcpu is executing
      in the guest as a result of being piggy-backed onto the execution of
      another vcore.  In that case we need to wait for the vcpu to finish
      executing inside the guest, and then remove this vcore from the
      preempted vcores list.  That way, we avoid leaving this vcpu's vcore
      on the preempted vcores list when the vcpu gets interrupted.
      
      Fixes: ec257165Reported-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      5fc3e64f
  16. 16 9月, 2015 1 次提交
    • P
      KVM: add halt_attempted_poll to VCPU stats · 62bea5bf
      Paolo Bonzini 提交于
      This new statistic can help diagnosing VCPUs that, for any reason,
      trigger bad behavior of halt_poll_ns autotuning.
      
      For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
      like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
      10+20+40+80+160+320+480 = 1110 microseconds out of every
      479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
      is consuming about 30% more CPU than it would use without
      polling.  This would show as an abnormally high number of
      attempted polling compared to the successful polls.
      
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62bea5bf
  17. 04 9月, 2015 1 次提交
  18. 03 9月, 2015 2 次提交
    • G
      KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set · 06554d9f
      Gautham R. Shenoy 提交于
      The code that handles the case when we receive a H_DOORBELL interrupt
      has a comment which says "Hypervisor doorbell - exit only if host IPI
      flag set".  However, the current code does not actually check if the
      host IPI flag is set.  This is due to a comparison instruction that
      got missed.
      
      As a result, the current code performs the exit to host only
      if some sibling thread or a sibling sub-core is exiting to the
      host.  This implies that, an IPI sent to a sibling core in
      (subcores-per-core != 1) mode will be missed by the host unless the
      sibling core is on the exit path to the host.
      
      This patch adds the missing comparison operation which will ensure
      that when HOST_IPI flag is set, we unconditionally exit to the host.
      
      Fixes: 66feed61
      Cc: stable@vger.kernel.org # v4.1+
      Signed-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      06554d9f
    • G
      KVM: PPC: Book3S HV: Fix race in starting secondary threads · 7f235328
      Gautham R. Shenoy 提交于
      The current dynamic micro-threading code has a race due to which a
      secondary thread naps when it is supposed to be running a vcpu. As a
      side effect of this, on a guest exit, the primary thread in
      kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
      its vcore pointer. This results in "CPU X seems to be stuck!"
      warnings.
      
      The race is possible since the primary thread on exiting the guests
      only waits for all the secondaries to clear its vcore pointer. It
      subsequently expects the secondary threads to enter nap while it
      unsplits the core. A secondary thread which hasn't yet entered the nap
      will loop in kvm_no_guest until its vcore pointer and the do_nap flag
      are unset. Once the core has been unsplit, a new vcpu thread can grab
      the core and set the do_nap flag *before* setting the vcore pointers
      of the secondary. As a result, the secondary thread will now enter nap
      via kvm_unsplit_nap instead of running the guest vcpu.
      
      Fix this by setting the do_nap flag after setting the vcore pointer in
      the PACA of the secondary in kvmppc_run_core. Also, ensure that a
      secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
      in its PACA struct is set.
      
      Fixes: b4deba5cSigned-off-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      7f235328
  19. 22 8月, 2015 9 次提交
    • S
      KVM: PPC: Book3S: correct width in XER handling · c63517c2
      Sam bobroff 提交于
      In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
      bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
      accessed as such.
      
      This patch corrects places where it is accessed as a 32 bit field by a
      64 bit kernel.  In some cases this is via a 32 bit load or store
      instruction which, depending on endianness, will cause either the
      lower or upper 32 bits to be missed.  In another case it is cast as a
      u32, causing the upper 32 bits to be cleared.
      
      This patch corrects those places by extending the access methods to
      64 bits.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Reviewed-by: NLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: NThomas Huth <thuth@redhat.com>
      Tested-by: NThomas Huth <thuth@redhat.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c63517c2
    • P
      KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation · 563a1e93
      Paul Mackerras 提交于
      Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
      time for it.  This currently isn't the case when we have a vcore that
      no longer has any runnable threads in it but still has a runner task,
      so we do an explicit call to kvmppc_core_start_stolen() in that case.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      563a1e93
    • P
      KVM: PPC: Book3S HV: Fix preempted vcore list locking · 402813fe
      Paul Mackerras 提交于
      When a vcore gets preempted, we put it on the preempted vcore list for
      the current CPU.  The runner task then calls schedule() and comes back
      some time later and takes itself off the list.  We need to be careful
      to lock the list that it was put onto, which may not be the list for the
      current CPU since the runner task may have moved to another CPU.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      402813fe
    • P
      KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD · cdeee518
      Paul Mackerras 提交于
      This adds implementations for the H_CLEAR_REF (test and clear reference
      bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.
      
      When clearing the reference or change bit in the guest view of the HPTE,
      we also have to clear it in the real HPTE so that we can detect future
      references or changes.  When we do so, we transfer the R or C bit value
      to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
      kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
      has been referenced and/or changed.
      
      These hypercalls are not used by Linux guests.  These implementations
      have been tested using a FreeBSD guest.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      cdeee518
    • P
      KVM: PPC: Book3S HV: Fix bug in dirty page tracking · 08fe1e7b
      Paul Mackerras 提交于
      This fixes a bug in the tracking of pages that get modified by the
      guest.  If the guest creates a large-page HPTE, writes to memory
      somewhere within the large page, and then removes the HPTE, we only
      record the modified state for the first normal page within the large
      page, when in fact the guest might have modified some other normal
      page within the large page.
      
      To fix this we use some unused bits in the rmap entry to record the
      order (log base 2) of the size of the page that was modified, when
      removing an HPTE.  Then in kvm_test_clear_dirty_npages() we use that
      order to return the correct number of modified pages.
      
      The same thing could in principle happen when removing a HPTE at the
      host's request, i.e. when paging out a page, except that we never
      page out large pages, and the guest can only create large-page HPTEs
      if the guest RAM is backed by large pages.  However, we also fix
      this case for the sake of future-proofing.
      
      The reference bit is also subject to the same loss of information.  We
      don't make the same fix here for the reference bit because there isn't
      an interface for userspace to find out which pages the guest has
      referenced, whereas there is one for userspace to find out which pages
      the guest has modified.  Because of this loss of information, the
      kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
      say that a page has not been referenced when it has, but that doesn't
      matter greatly because we never page or swap out large pages.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      08fe1e7b
    • P
      KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE · 1e5bf454
      Paul Mackerras 提交于
      The reference (R) and change (C) bits in a HPT entry can be set by
      hardware at any time up until the HPTE is invalidated and the TLB
      invalidation sequence has completed.  This means that when removing
      a HPTE, we need to read the HPTE after the invalidation sequence has
      completed in order to obtain reliable values of R and C.  The code
      in kvmppc_do_h_remove() used to do this.  However, commit 6f22bd32
      ("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
      read after invalidation as a side effect of other changes.  This
      restores the read of the HPTE after invalidation.
      
      The user-visible effect of this bug would be that when migrating a
      guest, there is a small probability that a page modified by the guest
      and then unmapped by the guest might not get re-transmitted and thus
      the destination might end up with a stale copy of the page.
      
      Fixes: 6f22bd32Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      1e5bf454
    • P
      KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 · b4deba5c
      Paul Mackerras 提交于
      This builds on the ability to run more than one vcore on a physical
      core by using the micro-threading (split-core) modes of the POWER8
      chip.  Previously, only vcores from the same VM could be run together,
      and (on POWER8) only if they had just one thread per core.  With the
      ability to split the core on guest entry and unsplit it on guest exit,
      we can run up to 8 vcpu threads from up to 4 different VMs, and we can
      run multiple vcores with 2 or 4 vcpus per vcore.
      
      Dynamic micro-threading is only available if the static configuration
      of the cores is whole-core mode (unsplit), and only on POWER8.
      
      To manage this, we introduce a new kvm_split_mode struct which is
      shared across all of the subcores in the core, with a pointer in the
      paca on each thread.  In addition we extend the core_info struct to
      have information on each subcore.  When deciding whether to add a
      vcore to the set already on the core, we now have two possibilities:
      (a) piggyback the vcore onto an existing subcore, or (b) start a new
      subcore.
      
      Currently, when any vcpu needs to exit the guest and switch to host
      virtual mode, we interrupt all the threads in all subcores and switch
      the core back to whole-core mode.  It may be possible in future to
      allow some of the subcores to keep executing in the guest while
      subcore 0 switches to the host, but that is not implemented in this
      patch.
      
      This adds a module parameter called dynamic_mt_modes which controls
      which micro-threading (split-core) modes the code will consider, as a
      bitmap.  In other words, if it is 0, no micro-threading mode is
      considered; if it is 2, only 2-way micro-threading is considered; if
      it is 4, only 4-way, and if it is 6, both 2-way and 4-way
      micro-threading mode will be considered.  The default is 6.
      
      With this, we now have secondary threads which are the primary thread
      for their subcore and therefore need to do the MMU switch.  These
      threads will need to be started even if they have no vcpu to run, so
      we use the vcore pointer in the PACA rather than the vcpu pointer to
      trigger them.
      
      It is now possible for thread 0 to find that an exit has been
      requested before it gets to switch the subcore state to the guest.  In
      that case we haven't added the guest's timebase offset to the
      timebase, so we need to be careful not to subtract the offset in the
      guest exit path.  In fact we just skip the whole path that switches
      back to host context, since we haven't switched to the guest context.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b4deba5c
    • P
      KVM: PPC: Book3S HV: Make use of unused threads when running guests · ec257165
      Paul Mackerras 提交于
      When running a virtual core of a guest that is configured with fewer
      threads per core than the physical cores have, the extra physical
      threads are currently unused.  This makes it possible to use them to
      run one or more other virtual cores from the same guest when certain
      conditions are met.  This applies on POWER7, and on POWER8 to guests
      with one thread per virtual core.  (It doesn't apply to POWER8 guests
      with multiple threads per vcore because they require a 1-1 virtual to
      physical thread mapping in order to be able to use msgsndp and the
      TIR.)
      
      The idea is that we maintain a list of preempted vcores for each
      physical cpu (i.e. each core, since the host runs single-threaded).
      Then, when a vcore is about to run, it checks to see if there are
      any vcores on the list for its physical cpu that could be
      piggybacked onto this vcore's execution.  If so, those additional
      vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
      threads are started as well as the original vcore, which is called
      the master vcore.
      
      After the vcores have exited the guest, the extra ones are put back
      onto the preempted list if any of their VCPUs are still runnable and
      not idle.
      
      This means that vcpu->arch.ptid is no longer necessarily the same as
      the physical thread that the vcpu runs on.  In order to make it easier
      for code that wants to send an IPI to know which CPU to target, we
      now store that in a new field in struct vcpu_arch, called thread_cpu.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      ec257165
    • T
      KVM: PPC: add missing pt_regs initialization · 845ac985
      Tudor Laurentiu 提交于
      On this switch branch the regs initialization
      doesn't happen so add it.
      This was found with the help of a static
      code analysis tool.
      Signed-off-by: NLaurentiu Tudor <Laurentiu.Tudor@freescale.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      845ac985