1. 14 7月, 2017 3 次提交
  2. 13 7月, 2017 2 次提交
    • R
      kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 · efc479e6
      Roman Kagan 提交于
      There is a flaw in the Hyper-V SynIC implementation in KVM: when message
      page or event flags page is enabled by setting the corresponding msr,
      KVM zeroes it out.  This is problematic because on migration the
      corresponding MSRs are loaded on the destination, so the content of
      those pages is lost.
      
      This went unnoticed so far because the only user of those pages was
      in-KVM hyperv synic timers, which could continue working despite that
      zeroing.
      
      Newer QEMU uses those pages for Hyper-V VMBus implementation, and
      zeroing them breaks the migration.
      
      Besides, in newer QEMU the content of those pages is fully managed by
      QEMU, so zeroing them is undesirable even when writing the MSRs from the
      guest side.
      
      To support this new scheme, introduce a new capability,
      KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
      pages aren't zeroed out in KVM.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      efc479e6
    • L
      KVM: x86: make backwards_tsc_observed a per-VM variable · a826faf1
      Ladi Prosek 提交于
      The backwards_tsc_observed global introduced in commit 16a96021 is never
      reset to false. If a VM happens to be running while the host is suspended
      (a common source of the TSC jumping backwards), master clock will never
      be enabled again for any VM. In contrast, if no VM is running while the
      host is suspended, master clock is unaffected. This is inconsistent and
      unnecessarily strict. Let's track the backwards_tsc_observed variable
      separately and let each VM start with a clean slate.
      
      Real world impact: My Windows VMs get slower after my laptop undergoes a
      suspend/resume cycle. The only way to get the perf back is unloading and
      reloading the kvm module.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a826faf1
  3. 03 7月, 2017 1 次提交
    • P
      kvm: x86: mmu: allow A/D bits to be disabled in an mmu · ac8d57e5
      Peter Feiner 提交于
      Adds the plumbing to disable A/D bits in the MMU based on a new role
      bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
      aren't available (i.e., using access tracking faults instead).
      
      To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
      place, A/D disablement is now stored in the SPTE. This state is stored
      in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
      tracking. Rather than just setting SPTE_SPECIAL_MASK when an
      access-tracking SPTE is non-present, we now always set
      SPTE_SPECIAL_MASK for access-tracking SPTEs.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      [Use role.ad_disabled even for direct (non-shadow) EPT page tables.  Add
       documentation and a few MMU_WARN_ONs. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d57e5
  4. 04 6月, 2017 1 次提交
  5. 17 5月, 2017 1 次提交
    • P
      KVM: x86: lower default for halt_poll_ns · b401ee0b
      Paolo Bonzini 提交于
      In some fio benchmarks, halt_poll_ns=400000 caused CPU utilization to
      increase heavily even in cases where the performance improvement was
      small.  In particular, bandwidth divided by CPU usage was as much as
      60% lower.
      
      To some extent this is the expected effect of the patch, and the
      additional CPU utilization is only visible when running the
      benchmarks.  However, halving the threshold also halves the extra
      CPU utilization (from +30-130% to +20-70%) and has no negative
      effect on performance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b401ee0b
  6. 09 5月, 2017 1 次提交
  7. 02 5月, 2017 1 次提交
  8. 27 4月, 2017 2 次提交
    • P
      KVM: mark requests that need synchronization · 7a97cec2
      Paolo Bonzini 提交于
      kvm_make_all_requests() provides a synchronization that waits until all
      kicked VCPUs have acknowledged the kick.  This is important for
      KVM_REQ_MMU_RELOAD as it prevents freeing while lockless paging is
      underway.
      
      This patch adds the synchronization property into all requests that are
      currently being used with kvm_make_all_requests() in order to preserve
      the current behavior and only introduce a new framework.  Removing it
      from requests where it is not necessary is left for future patches.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a97cec2
    • R
      KVM: mark requests that do not need a wakeup · 930f7fd6
      Radim Krčmář 提交于
      Some operations must ensure that the guest is not running with stale
      data, but if the guest is halted, then the update can wait until another
      event happens.  kvm_make_all_requests() currently doesn't wake up, so we
      can mark all requests used with it.
      
      First 8 bits were arbitrarily reserved for request numbers.
      
      Most uses of requests have the request type as a constant, so a compiler
      will optimize the '&'.
      
      An alternative would be to have an inline function that would return
      whether the request needs a wake-up or not, but I like this one better
      even though it might produce worse assembly.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      930f7fd6
  9. 21 4月, 2017 1 次提交
  10. 13 4月, 2017 1 次提交
  11. 07 4月, 2017 2 次提交
    • P
      kvm: make KVM_COALESCED_MMIO_PAGE_OFFSET public · 4b4357e0
      Paolo Bonzini 提交于
      Its value has never changed; we might as well make it part of the ABI instead
      of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).
      
      Because PPC does not always make MMIO available, the code has to be made
      dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      4b4357e0
    • P
      kvm: nVMX: support EPT accessed/dirty bits · ae1e2d10
      Paolo Bonzini 提交于
      Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
      thing to change is that, when EPT accessed and dirty bits are not in use,
      VMX treats accesses to guest paging structures as data reads.  When they
      are in use (bit 6 of EPTP is set), they are treated as writes and the
      corresponding EPT dirty bit is set.  The MMU didn't know this detail,
      so this patch adds it.
      
      We also have to fix up the exit qualification.  It may be wrong because
      KVM sets bit 6 but the guest might not.
      
      L1 emulates EPT A/D bits using write permissions, so in principle it may
      be possible for EPT A/D bits to be used by L1 even though not available
      in hardware.  The problem is that guest page-table walks will be treated
      as reads rather than writes, so they would not cause an EPT violation.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Fixed typo in walk_addr_generic() comment and changed bit clear +
       conditional-set pattern in handle_ept_violation() to conditional-clear]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ae1e2d10
  12. 17 2月, 2017 1 次提交
  13. 15 2月, 2017 1 次提交
    • P
      KVM: x86: do not scan IRR twice on APICv vmentry · 76dfafd5
      Paolo Bonzini 提交于
      Calls to apic_find_highest_irr are scanning IRR twice, once
      in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
      sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
      now that it does the computation, it can also do the RVI write.
      
      In order to avoid complications in svm.c, make the callback optional.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76dfafd5
  14. 09 1月, 2017 7 次提交
    • P
      KVM: x86: add VCPU stat for KVM_REQ_EVENT processing · 0f1e261e
      Paolo Bonzini 提交于
      This statistic can be useful to estimate the cost of an IRQ injection
      scenario, by comparing it with irq_injections.  For example the stat
      shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f1e261e
    • T
      kvm: svm: Use the hardware provided GPA instead of page walk · 0f89b207
      Tom Lendacky 提交于
      When a guest causes a NPF which requires emulation, KVM sometimes walks
      the guest page tables to translate the GVA to a GPA. This is unnecessary
      most of the time on AMD hardware since the hardware provides the GPA in
      EXITINFO2.
      
      The only exception cases involve string operations involving rep or
      operations that use two memory locations. With rep, the GPA will only be
      the value of the initial NPF and with dual memory locations we won't know
      which memory address was translated into EXITINFO2.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f89b207
    • J
      kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits. · f160c7b7
      Junaid Shahid 提交于
      This change implements lockless access tracking for Intel CPUs without EPT
      A bits. This is achieved by marking the PTEs as not-present (but not
      completely clearing them) when clear_flush_young() is called after marking
      the pages as accessed. When an EPT Violation is generated as a result of
      the VM accessing those pages, the PTEs are restored to their original values.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f160c7b7
    • J
      kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs · 37f0e8fe
      Junaid Shahid 提交于
      MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
      PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
      is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
      Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
      acceptable. However, the upcoming fast access tracking feature adds another
      type of special tracking PTE, which uses not-Present PTEs and hence should
      not set bit 63.
      
      In order to use common bits to distinguish both type of special PTEs, we
      now use only bit 62 as the special bit.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37f0e8fe
    • D
      kvm: x86: reduce collisions in mmu_page_hash · 114df303
      David Matlack 提交于
      When using two-dimensional paging, the mmu_page_hash (which provides
      lookups for existing kvm_mmu_page structs), becomes imbalanced; with
      too many collisions in buckets 0 and 512. This has been seen to cause
      mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
      VMs with a large amount of RAM mapped with 4K pages.
      
      The current hash function uses the lower 10 bits of gfn to index into
      mmu_page_hash. When doing shadow paging, gfn is the address of the
      guest page table being shadow. These tables are 4K-aligned, which
      makes the low bits of gfn a good hash. However, with two-dimensional
      paging, no guest page tables are being shadowed, so gfn is the base
      address that is mapped by the table. Thus page tables (level=1) have
      a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
      etc. This means hashes will only differ in their 10th bit.
      
      hash_64() provides a better hash. For example, on a VM with ~200G
      (99458 direct=1 kvm_mmu_page structs):
      
      hash            max_mmu_page_hash_collisions
      --------------------------------------------
      low 10 bits     49847
      hash_64         105
      perfect         97
      
      While we're changing the hash, increase the table size by 4x to better
      support large VMs (further reduces number of collisions in 200G VM to
      29).
      
      Note that hash_64() does not provide a good distribution prior to commit
      ef703f49 ("Eliminate bad hash multipliers from hash_32() and
      hash_64()").
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      114df303
    • D
      kvm: x86: export maximum number of mmu_page_hash collisions · f3414bc7
      David Matlack 提交于
      Report the maximum number of mmu_page_hash collisions as a per-VM stat.
      This will make it easy to identify problems with the mmu_page_hash in
      the future.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f3414bc7
    • R
      KVM: x86: decouple irqchip_in_kernel() and pic_irqchip() · 49776faf
      Radim Krčmář 提交于
      irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
      just complicated the code.
      Add a separate state for the irqchip mode.
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Used Paolo's version of condition in irqchip_in_kernel().]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      49776faf
  15. 25 12月, 2016 1 次提交
  16. 17 12月, 2016 1 次提交
  17. 08 12月, 2016 3 次提交
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • K
      KVM: x86: Add kvm_skip_emulated_instruction and use it. · 6affcbed
      Kyle Huey 提交于
      kvm_skip_emulated_instruction calls both
      kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
      skipping the emulated instruction and generating a trap if necessary.
      
      Replacing skip_emulated_instruction calls with
      kvm_skip_emulated_instruction is straightforward, except for:
      
      - ICEBP, which is already inside a trap, so avoid triggering another trap.
      - Instructions that can trigger exits to userspace, such as the IO insns,
        MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
        KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
        IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
        take precedence. The singlestep will be triggered again on the next
        instruction, which is the current behavior.
      - Task switch instructions which would require additional handling (e.g.
        the task switch bit) and are instead left alone.
      - Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
        which do not trigger singlestep traps as mentioned previously.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6affcbed
    • K
      KVM: x86: Add a return value to kvm_emulate_cpuid · 6a908b62
      Kyle Huey 提交于
      Once skipping the emulated instruction can potentially trigger an exit to
      userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
      propagate a return value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6a908b62
  18. 25 11月, 2016 2 次提交
    • T
      kvm: svm: Add kvm_fast_pio_in support · 8370c3d0
      Tom Lendacky 提交于
      Update the I/O interception support to add the kvm_fast_pio_in function
      to speed up the in instruction similar to the out instruction.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8370c3d0
    • T
      kvm: svm: Add support for additional SVM NPF error codes · 14727754
      Tom Lendacky 提交于
      AMD hardware adds two additional bits to aid in nested page fault handling.
      
      Bit 32 - NPF occurred while translating the guest's final physical address
      Bit 33 - NPF occurred while translating the guest page tables
      
      The guest page tables fault indicator can be used as an aid for nested
      virtualization. Using V0 for the host, V1 for the first level guest and
      V2 for the second level guest, when both V1 and V2 are using nested paging
      there are currently a number of unnecessary instruction emulations. When
      V2 is launched shadow paging is used in V1 for the nested tables of V2. As
      a result, KVM marks these pages as RO in the host nested page tables. When
      V2 exits and we resume V1, these pages are still marked RO.
      
      Every nested walk for a guest page table is treated as a user-level write
      access and this causes a lot of NPFs because the V1 page tables are marked
      RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
      sees a write to a read-only page, emulates the V1 instruction and unprotects
      the page (marking it RW). This patch looks for cases where we get a NPF due
      to a guest page table walk where the page was marked RO. It immediately
      unprotects the page and resumes the guest, leading to far fewer instruction
      emulations when nested virtualization is used.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      14727754
  19. 03 11月, 2016 1 次提交
    • P
      KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK · ea26e4ec
      Paolo Bonzini 提交于
      Since commit a545ab6a ("kvm: x86: add tsc_offset field to struct
      kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
      cached and need not be fished out of the VMCS or VMCB.  This means
      that we can implement adjust_tsc_offset_guest and read_l1_tsc
      entirely in generic code.  The simplification is particularly
      significant for VMX code, where vmx->nested.vmcs01_tsc_offset
      was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
      the vmcs01_tsc_offset can be dropped completely.
      
      More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
      which, after commit 108b249c ("KVM: x86: introduce get_kvmclock_ns",
      2016-09-01) called read_l1_tsc while the VMCS was not loaded.
      It thus returned bogus values on Intel CPUs.
      
      Fixes: 108b249cReported-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea26e4ec
  20. 20 9月, 2016 1 次提交
  21. 16 9月, 2016 2 次提交
  22. 08 9月, 2016 3 次提交
  23. 14 7月, 2016 1 次提交