1. 19 5月, 2016 3 次提交
  2. 13 5月, 2016 3 次提交
    • C
      KVM: s390: set halt polling to 80 microseconds · c4a8de35
      Christian Borntraeger 提交于
      on s390 we disabled the halt polling with commit 920552b2
      ("KVM: disable halt_poll_ns as default for s390x"), as floating
      interrupts would let all CPUs have a successful poll, resulting
      in much higher CPU usage (on otherwise idle systems).
      
      With the improved selection of polls we can now retry halt polling.
      Performance measurements with different choices like 25,50,80,100,200
      microseconds showed that 80 microseconds seems to improve several cases
      without increasing the CPU costs too much. Higher values would improve
      the performance even more but increased the cpu time as well.
      So let's start small and use this value of 80 microseconds on s390 until
      we have a better understanding of cost/benefit of higher values.
      Acked-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c4a8de35
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
    • P
      Merge branch 'kvm-ppc-next' of... · d7e1633a
      Paolo Bonzini 提交于
      Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
      d7e1633a
  3. 12 5月, 2016 6 次提交
    • P
      KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts · b1a4286b
      Paul Mackerras 提交于
      Commit c9a5ecca ("kvm/eventfd: add arch-specific set_irq",
      2015-10-16) added the possibility for architecture-specific code
      to handle the generation of virtual interrupts in atomic context
      where possible, without having to schedule a work function.
      
      Since we can easily generate virtual interrupts on XICS without
      having to do anything worse than take a spinlock, we define a
      kvm_arch_set_irq_inatomic() for XICS.  We also remove kvm_set_msi()
      since it is not used any more.
      
      The one slightly tricky thing is that with the new interface, we
      don't get told whether the interrupt is an MSI (or other edge
      sensitive interrupt) vs. level-sensitive.  The difference as far
      as interrupt generation is concerned is that for LSIs we have to
      set the asserted flag so it will continue to fire until it is
      explicitly cleared.
      
      In fact the XICS code gets told which interrupts are LSIs by userspace
      when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
      attribute group on the XICS device.  To store this information, we add
      a new "lsi" field to struct ics_irq_state.  With that we can also do a
      better job of returning accurate values when reading the attribute
      group.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      b1a4286b
    • A
      kvm: Conditionally register IRQ bypass consumer · 14717e20
      Alex Williamson 提交于
      If we don't support a mechanism for bypassing IRQs, don't register as
      a consumer.  This eliminates meaningless dev_info()s when the connect
      fails between producer and consumer, such as on AMD systems where
      kvm_x86_ops->update_pi_irte is not implemented
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14717e20
    • A
      irqbypass: Disallow NULL token · b52f3ed0
      Alex Williamson 提交于
      A NULL token is meaningless and can only lead to unintended problems.
      Error on registration with a NULL token, ignore de-registrations with
      a NULL token.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b52f3ed0
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
    • G
      KVM: remove NULL return path for vcpu ids >= KVM_MAX_VCPUS · 9b9e3fc4
      Greg Kurz 提交于
      Commit c896939f ("KVM: use heuristic for fast VCPU lookup by id") added
      a return path that prevents vcpu ids to exceed KVM_MAX_VCPUS. This is a
      problem for powerpc where vcpu ids can grow up to 8*KVM_MAX_VCPUS.
      
      This patch simply reverses the logic so that we only try fast path if the
      vcpu id can be tried as an index in kvm->vcpus[]. The slow path is not
      affected by the change.
      Reviewed-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9b9e3fc4
    • P
      Merge tag 'kvm-arm-for-4.7' of... · bdb4094e
      Paolo Bonzini 提交于
      Merge tag 'kvm-arm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/ARM Changes for Linux v4.7
      
      Reworks our stage 2 page table handling to have page table manipulation
      macros separate from those of the host systems as the underlying
      hardware page tables can be configured to be noticably different in
      layout from the stage 1 page tables used by the host.
      
      Adds 16K page size support based on the above.
      
      Adds a generic firmware probing layer for the timer and GIC so that KVM
      initializes using the same logic based on both ACPI and FDT.
      
      Finally adds support for hardware updating of the access flag.
      bdb4094e
  4. 11 5月, 2016 4 次提交
  5. 10 5月, 2016 7 次提交
    • P
      Merge tag 'kvm-s390-next-4.7-2' of... · 6ac0f61f
      Paolo Bonzini 提交于
      Merge tag 'kvm-s390-next-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      KVM: s390: features and fixes for 4.7 part2
      
      - Use hardware provided information about facility bits that do not
        need any hypervisor activitiy
      - Add missing documentation for KVM_CAP_S390_RI
      - Some updates/fixes for handling cpu models and facilities
      6ac0f61f
    • J
      MIPS: KVM: Add missing disable FPU hazard barriers · 4ac33429
      James Hogan 提交于
      Add the necessary hazard barriers after disabling the FPU in
      kvm_lose_fpu(), just to be safe.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ac33429
    • J
      MIPS: KVM: Fix preemption warning reading FPU capability · 556f2a52
      James Hogan 提交于
      Reading the KVM_CAP_MIPS_FPU capability returns cpu_has_fpu, however
      this uses smp_processor_id() to read the current CPU capabilities (since
      some old MIPS systems could have FPUs present on only a subset of CPUs).
      
      We don't support any such systems, so work around the warning by using
      raw_cpu_has_fpu instead.
      
      We should probably instead claim not to support FPU at all if any one
      CPU is lacking an FPU, but this should do for now.
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      556f2a52
    • J
      MIPS: KVM: Fix preemptable kvm_mips_get_*_asid() calls · f049729c
      James Hogan 提交于
      There are a couple of places in KVM fault handling code which implicitly
      use smp_processor_id() via kvm_mips_get_kernel_asid() and
      kvm_mips_get_user_asid() from preemptable context. This is unsafe as a
      preemption could cause the guest kernel ASID to be changed, resulting in
      a host TLB entry being written with the wrong ASID.
      
      Fix by disabling preemption around the kvm_mips_get_*_asid() call and
      the corresponding kvm_mips_host_tlb_write().
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f049729c
    • J
      MIPS: KVM: Fix timer IRQ race when writing CP0_Compare · b45bacd2
      James Hogan 提交于
      Writing CP0_Compare clears the timer interrupt pending bit
      (CP0_Cause.TI), but this wasn't being done atomically. If a timer
      interrupt raced with the write of the guest CP0_Compare, the timer
      interrupt could end up being pending even though the new CP0_Compare is
      nowhere near CP0_Count.
      
      We were already updating the hrtimer expiry with
      kvm_mips_update_hrtimer(), which used both kvm_mips_freeze_hrtimer() and
      kvm_mips_resume_hrtimer(). Close the race window by expanding out
      kvm_mips_update_hrtimer(), and clearing CP0_Cause.TI and setting
      CP0_Compare between the freeze and resume. Since the pending timer
      interrupt should not be cleared when CP0_Compare is written via the KVM
      user API, an ack argument is added to distinguish the source of the
      write.
      
      Fixes: e30492bb ("MIPS: KVM: Rewrite count/compare timer emulation")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.16.x-
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b45bacd2
    • J
      MIPS: KVM: Fix timer IRQ race when freezing timer · 4355c44f
      James Hogan 提交于
      There's a particularly narrow and subtle race condition when the
      software emulated guest timer is frozen which can allow a guest timer
      interrupt to be missed.
      
      This happens due to the hrtimer expiry being inexact, so very
      occasionally the freeze time will be after the moment when the emulated
      CP0_Count transitions to the same value as CP0_Compare (so an IRQ should
      be generated), but before the moment when the hrtimer is due to expire
      (so no IRQ is generated). The IRQ won't be generated when the timer is
      resumed either, since the resume CP0_Count will already match CP0_Compare.
      
      With VZ guests in particular this is far more likely to happen, since
      the soft timer may be frozen frequently in order to restore the timer
      state to the hardware guest timer. This happens after 5-10 hours of
      guest soak testing, resulting in an overflow in guest kernel timekeeping
      calculations, hanging the guest. A more focussed test case to
      intentionally hit the race (with the help of a new hypcall to cause the
      timer state to migrated between hardware & software) hits the condition
      fairly reliably within around 30 seconds.
      
      Instead of relying purely on the inexact hrtimer expiry to determine
      whether an IRQ should be generated, read the guest CP0_Compare and
      directly check whether the freeze time is before or after it. Only if
      CP0_Count is on or after CP0_Compare do we check the hrtimer expiry to
      determine whether the last IRQ has already been generated (which will
      have pushed back the expiry by one timer period).
      
      Fixes: e30492bb ("MIPS: KVM: Rewrite count/compare timer emulation")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: linux-mips@linux-mips.org
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 3.16.x-
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4355c44f
    • C
      kvm: arm64: Enable hardware updates of the Access Flag for Stage 2 page tables · 06485053
      Catalin Marinas 提交于
      The ARMv8.1 architecture extensions introduce support for hardware
      updates of the access and dirty information in page table entries. With
      VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the
      PTE_AF bit cleared in the stage 2 page table, instead of raising an
      Access Flag fault to EL2 the CPU sets the actual page table entry bit
      (10). To ensure that kernel modifications to the page table do not
      inadvertently revert a bit set by hardware updates, certain Stage 2
      software pte/pmd operations must be performed atomically.
      
      The main user of the AF bit is the kvm_age_hva() mechanism. The
      kvm_age_hva_handler() function performs a "test and clear young" action
      on the pte/pmd. This needs to be atomic in respect of automatic hardware
      updates of the AF bit. Since the AF bit is in the same position for both
      Stage 1 and Stage 2, the patch reuses the existing
      ptep_test_and_clear_young() functionality if
      __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the
      existing pte_young/pte_mkold mechanism is preserved.
      
      The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have
      to perform atomic modifications in order to avoid a race with updates of
      the AF bit. The arm64 implementation has been re-written using
      exclusives.
      
      Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer
      argument and modify the pte/pmd in place. However, these functions are
      only used on local variables rather than actual page table entries, so
      it makes more sense to follow the pte_mkwrite() approach for stage 1
      attributes. The change to kvm_s2pte_mkwrite() makes it clear that these
      functions do not modify the actual page table entries.
      
      The (pte|pmd)_mkyoung() uses on Stage 2 entries (setting the AF bit
      explicitly) do not need to be modified since hardware updates of the
      dirty status are not supported by KVM, so there is no possibility of
      losing such information.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      06485053
  6. 09 5月, 2016 9 次提交
  7. 04 5月, 2016 2 次提交
  8. 03 5月, 2016 6 次提交