1. 09 1月, 2017 4 次提交
    • J
      kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs · 37f0e8fe
      Junaid Shahid 提交于
      MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
      PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
      is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
      Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
      acceptable. However, the upcoming fast access tracking feature adds another
      type of special tracking PTE, which uses not-Present PTEs and hence should
      not set bit 63.
      
      In order to use common bits to distinguish both type of special PTEs, we
      now use only bit 62 as the special bit.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37f0e8fe
    • D
      kvm: x86: reduce collisions in mmu_page_hash · 114df303
      David Matlack 提交于
      When using two-dimensional paging, the mmu_page_hash (which provides
      lookups for existing kvm_mmu_page structs), becomes imbalanced; with
      too many collisions in buckets 0 and 512. This has been seen to cause
      mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
      VMs with a large amount of RAM mapped with 4K pages.
      
      The current hash function uses the lower 10 bits of gfn to index into
      mmu_page_hash. When doing shadow paging, gfn is the address of the
      guest page table being shadow. These tables are 4K-aligned, which
      makes the low bits of gfn a good hash. However, with two-dimensional
      paging, no guest page tables are being shadowed, so gfn is the base
      address that is mapped by the table. Thus page tables (level=1) have
      a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
      etc. This means hashes will only differ in their 10th bit.
      
      hash_64() provides a better hash. For example, on a VM with ~200G
      (99458 direct=1 kvm_mmu_page structs):
      
      hash            max_mmu_page_hash_collisions
      --------------------------------------------
      low 10 bits     49847
      hash_64         105
      perfect         97
      
      While we're changing the hash, increase the table size by 4x to better
      support large VMs (further reduces number of collisions in 200G VM to
      29).
      
      Note that hash_64() does not provide a good distribution prior to commit
      ef703f49 ("Eliminate bad hash multipliers from hash_32() and
      hash_64()").
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      114df303
    • D
      kvm: x86: export maximum number of mmu_page_hash collisions · f3414bc7
      David Matlack 提交于
      Report the maximum number of mmu_page_hash collisions as a per-VM stat.
      This will make it easy to identify problems with the mmu_page_hash in
      the future.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f3414bc7
    • R
      KVM: x86: decouple irqchip_in_kernel() and pic_irqchip() · 49776faf
      Radim Krčmář 提交于
      irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
      just complicated the code.
      Add a separate state for the irqchip mode.
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Used Paolo's version of condition in irqchip_in_kernel().]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      49776faf
  2. 25 12月, 2016 1 次提交
  3. 17 12月, 2016 1 次提交
  4. 08 12月, 2016 3 次提交
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • K
      KVM: x86: Add kvm_skip_emulated_instruction and use it. · 6affcbed
      Kyle Huey 提交于
      kvm_skip_emulated_instruction calls both
      kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
      skipping the emulated instruction and generating a trap if necessary.
      
      Replacing skip_emulated_instruction calls with
      kvm_skip_emulated_instruction is straightforward, except for:
      
      - ICEBP, which is already inside a trap, so avoid triggering another trap.
      - Instructions that can trigger exits to userspace, such as the IO insns,
        MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
        KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
        IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
        take precedence. The singlestep will be triggered again on the next
        instruction, which is the current behavior.
      - Task switch instructions which would require additional handling (e.g.
        the task switch bit) and are instead left alone.
      - Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
        which do not trigger singlestep traps as mentioned previously.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6affcbed
    • K
      KVM: x86: Add a return value to kvm_emulate_cpuid · 6a908b62
      Kyle Huey 提交于
      Once skipping the emulated instruction can potentially trigger an exit to
      userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
      propagate a return value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6a908b62
  5. 25 11月, 2016 2 次提交
    • T
      kvm: svm: Add kvm_fast_pio_in support · 8370c3d0
      Tom Lendacky 提交于
      Update the I/O interception support to add the kvm_fast_pio_in function
      to speed up the in instruction similar to the out instruction.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8370c3d0
    • T
      kvm: svm: Add support for additional SVM NPF error codes · 14727754
      Tom Lendacky 提交于
      AMD hardware adds two additional bits to aid in nested page fault handling.
      
      Bit 32 - NPF occurred while translating the guest's final physical address
      Bit 33 - NPF occurred while translating the guest page tables
      
      The guest page tables fault indicator can be used as an aid for nested
      virtualization. Using V0 for the host, V1 for the first level guest and
      V2 for the second level guest, when both V1 and V2 are using nested paging
      there are currently a number of unnecessary instruction emulations. When
      V2 is launched shadow paging is used in V1 for the nested tables of V2. As
      a result, KVM marks these pages as RO in the host nested page tables. When
      V2 exits and we resume V1, these pages are still marked RO.
      
      Every nested walk for a guest page table is treated as a user-level write
      access and this causes a lot of NPFs because the V1 page tables are marked
      RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
      sees a write to a read-only page, emulates the V1 instruction and unprotects
      the page (marking it RW). This patch looks for cases where we get a NPF due
      to a guest page table walk where the page was marked RO. It immediately
      unprotects the page and resumes the guest, leading to far fewer instruction
      emulations when nested virtualization is used.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      14727754
  6. 03 11月, 2016 1 次提交
    • P
      KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK · ea26e4ec
      Paolo Bonzini 提交于
      Since commit a545ab6a ("kvm: x86: add tsc_offset field to struct
      kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
      cached and need not be fished out of the VMCS or VMCB.  This means
      that we can implement adjust_tsc_offset_guest and read_l1_tsc
      entirely in generic code.  The simplification is particularly
      significant for VMX code, where vmx->nested.vmcs01_tsc_offset
      was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
      the vmcs01_tsc_offset can be dropped completely.
      
      More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
      which, after commit 108b249c ("KVM: x86: introduce get_kvmclock_ns",
      2016-09-01) called read_l1_tsc while the VMCS was not loaded.
      It thus returned bogus values on Intel CPUs.
      
      Fixes: 108b249cReported-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea26e4ec
  7. 20 9月, 2016 1 次提交
  8. 16 9月, 2016 2 次提交
  9. 08 9月, 2016 3 次提交
  10. 14 7月, 2016 8 次提交
  11. 24 6月, 2016 1 次提交
  12. 16 6月, 2016 3 次提交
    • Y
      kvm: vmx: hook preemption timer support · 64672c95
      Yunhong Jiang 提交于
      Hook the VMX preemption timer to the "hv timer" functionality added
      by the previous patch.  This includes: checking if the feature is
      supported, if the feature is broken on the CPU, the hooks to
      setup/clean the VMX preemption timer, arming the timer on vmentry
      and handling the vmexit.
      
      A module parameter states if the VMX preemption timer should be
      utilized.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      [Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
       Put all VMX bits here.  Enable it by default #yolo. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64672c95
    • Y
      KVM: x86: support using the vmx preemption timer for tsc deadline timer · ce7a058a
      Yunhong Jiang 提交于
      The VMX preemption timer can be used to virtualize the TSC deadline timer.
      The VMX preemption timer is armed when the vCPU is running, and a VMExit
      will happen if the virtual TSC deadline timer expires.
      
      When the vCPU thread is blocked because of HLT, KVM will switch to use
      an hrtimer, and then go back to the VMX preemption timer when the vCPU
      thread is unblocked.
      
      This solution avoids the complex OS's hrtimer system, and the host
      timer interrupt handling cost, replacing them with a little math
      (for guest->host TSC and host TSC->preemption timer conversion)
      and a cheaper VMexit.  This benefits latency for isolated pCPUs.
      
      [A word about performance... Yunhong reported a 30% reduction in average
       latency from cyclictest.  I made a similar test with tscdeadline_latency
       from kvm-unit-tests, and measured
      
       - ~20 clock cycles loss (out of ~3200, so less than 1% but still
         statistically significant) in the worst case where the test halts
         just after programming the TSC deadline timer
      
       - ~800 clock cycles gain (25% reduction in latency) in the best case
         where the test busy waits.
      
       I removed the VMX bits from Yunhong's patch, to concentrate them in the
       next patch - Paolo]
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce7a058a
    • S
      kvm: svm: Fix implicit declaration for __default_cpu_present_to_apicid() · 7d669f50
      Suravee Suthikulpanit 提交于
      The commit 8221c137 ("svm: Manage vcpu load/unload when enable AVIC")
      introduces a build error due to implicit function declaration
      when #ifdef CONFIG_X86_32 and #ifndef CONFIG_X86_LOCAL_APIC
      (as reported by Kbuild test robot i386-randconfig-x0-06121009).
      
      So, this patch introduces kvm_cpu_get_apicid() wrapper
      around __default_cpu_present_to_apicid() with additional
      handling if CONFIG_X86_LOCAL_APIC is not defined.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Fixes: commit 8221c137 ("svm: Manage vcpu load/unload when enable AVIC")
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d669f50
  13. 19 5月, 2016 6 次提交
  14. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  15. 20 4月, 2016 1 次提交
  16. 01 4月, 2016 1 次提交
    • P
      KVM: x86: reduce default value of halt_poll_ns parameter · 14ebda33
      Paolo Bonzini 提交于
      Windows lets applications choose the frequency of the timer tick,
      and in Windows 10 the maximum rate was changed from 1024 Hz to
      2048 Hz.  Unfortunately, because of the way the Windows API
      works, most applications who need a higher rate than the default
      64 Hz will just do
      
         timeGetDevCaps(&tc, sizeof(tc));
         timeBeginPeriod(tc.wPeriodMin);
      
      and pick the maximum rate.  This causes very high CPU usage when
      playing media or games on Windows 10, even if the guest does not
      actually use the CPU very much, because the frequent timer tick
      causes halt_poll_ns to kick in.
      
      There is no really good solution, especially because Microsoft
      could sooner or later bump the limit to 4096 Hz, but for now
      the best we can do is lower a bit the upper limit for
      halt_poll_ns. :-(
      Reported-by: NJon Panozzo <jonp@lime-technology.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14ebda33
  17. 22 3月, 2016 1 次提交