1. 09 1月, 2017 5 次提交
    • J
      kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs · 37f0e8fe
      Junaid Shahid 提交于
      MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
      PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
      is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
      Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
      acceptable. However, the upcoming fast access tracking feature adds another
      type of special tracking PTE, which uses not-Present PTEs and hence should
      not set bit 63.
      
      In order to use common bits to distinguish both type of special PTEs, we
      now use only bit 62 as the special bit.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37f0e8fe
    • J
      kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications · 27959a44
      Junaid Shahid 提交于
      This change adds some symbolic constants for VM Exit Qualifications
      related to EPT Violations and updates handle_ept_violation() to use
      these constants instead of hard-coded numbers.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      27959a44
    • D
      kvm: x86: reduce collisions in mmu_page_hash · 114df303
      David Matlack 提交于
      When using two-dimensional paging, the mmu_page_hash (which provides
      lookups for existing kvm_mmu_page structs), becomes imbalanced; with
      too many collisions in buckets 0 and 512. This has been seen to cause
      mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
      VMs with a large amount of RAM mapped with 4K pages.
      
      The current hash function uses the lower 10 bits of gfn to index into
      mmu_page_hash. When doing shadow paging, gfn is the address of the
      guest page table being shadow. These tables are 4K-aligned, which
      makes the low bits of gfn a good hash. However, with two-dimensional
      paging, no guest page tables are being shadowed, so gfn is the base
      address that is mapped by the table. Thus page tables (level=1) have
      a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
      etc. This means hashes will only differ in their 10th bit.
      
      hash_64() provides a better hash. For example, on a VM with ~200G
      (99458 direct=1 kvm_mmu_page structs):
      
      hash            max_mmu_page_hash_collisions
      --------------------------------------------
      low 10 bits     49847
      hash_64         105
      perfect         97
      
      While we're changing the hash, increase the table size by 4x to better
      support large VMs (further reduces number of collisions in 200G VM to
      29).
      
      Note that hash_64() does not provide a good distribution prior to commit
      ef703f49 ("Eliminate bad hash multipliers from hash_32() and
      hash_64()").
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      114df303
    • D
      kvm: x86: export maximum number of mmu_page_hash collisions · f3414bc7
      David Matlack 提交于
      Report the maximum number of mmu_page_hash collisions as a per-VM stat.
      This will make it easy to identify problems with the mmu_page_hash in
      the future.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f3414bc7
    • R
      KVM: x86: decouple irqchip_in_kernel() and pic_irqchip() · 49776faf
      Radim Krčmář 提交于
      irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
      just complicated the code.
      Add a separate state for the irqchip mode.
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Used Paolo's version of condition in irqchip_in_kernel().]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      49776faf
  2. 30 12月, 2016 1 次提交
    • L
      mm: optimize PageWaiters bit use for unlock_page() · b91e1302
      Linus Torvalds 提交于
      In commit 62906027 ("mm: add PageWaiters indicating tasks are
      waiting for a page bit") Nick Piggin made our page locking no longer
      unconditionally touch the hashed page waitqueue, which not only helps
      performance in general, but is particularly helpful on NUMA machines
      where the hashed wait queues can bounce around a lot.
      
      However, the "clear lock bit atomically and then test the waiters bit"
      sequence turns out to be much more expensive than it needs to be,
      because you get a nasty stall when trying to access the same word that
      just got updated atomically.
      
      On architectures where locking is done with LL/SC, this would be trivial
      to fix with a new primitive that clears one bit and tests another
      atomically, but that ends up not working on x86, where the only atomic
      operations that return the result end up being cmpxchg and xadd.  The
      atomic bit operations return the old value of the same bit we changed,
      not the value of an unrelated bit.
      
      On x86, we could put the lock bit in the high bit of the byte, and use
      "xadd" with that bit (where the overflow ends up not touching other
      bits), and look at the other bits of the result.  However, an even
      simpler model is to just use a regular atomic "and" to clear the lock
      bit, and then the sign bit in eflags will indicate the resulting state
      of the unrelated bit #7.
      
      So by moving the PageWaiters bit up to bit #7, we can atomically clear
      the lock bit and test the waiters bit on x86 too.  And architectures
      with LL/SC (which is all the usual RISC suspects), the particular bit
      doesn't matter, so they are fine with this approach too.
      
      This avoids the extra access to the same atomic word, and thus avoids
      the costly stall at page unlock time.
      
      The only downside is that the interface ends up being a bit odd and
      specialized: clear a bit in a byte, and test the sign bit.  Nick doesn't
      love the resulting name of the new primitive, but I'd rather make the
      name be descriptive and very clear about the limitation imposed by
      trying to work across all relevant architectures than make it be some
      generic thing that doesn't make the odd semantics explicit.
      
      So this introduces the new architecture primitive
      
          clear_bit_unlock_is_negative_byte();
      
      and adds the trivial implementation for x86.  We have a generic
      non-optimized fallback (that just does a "clear_bit()"+"test_bit(7)"
      combination) which can be overridden by any architecture that can do
      better.  According to Nick, Power has the same hickup x86 has, for
      example, but some other architectures may not even care.
      
      All these optimizations mean that my page locking stress-test (which is
      just executing a lot of small short-lived shell scripts: "make test" in
      the git source tree) no longer makes our page locking look horribly bad.
      Before all these optimizations, just the unlock_page() costs were just
      over 3% of all CPU overhead on "make test".  After this, it's down to
      0.66%, so just a quarter of the cost it used to be.
      
      (The difference on NUMA is bigger, but there this micro-optimization is
      likely less noticeable, since the big issue on NUMA was not the accesses
      to 'struct page', but the waitqueue accesses that were already removed
      by Nick's earlier commit).
      Acked-by: NNick Piggin <npiggin@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b91e1302
  3. 25 12月, 2016 2 次提交
  4. 19 12月, 2016 6 次提交
  5. 18 12月, 2016 1 次提交
  6. 17 12月, 2016 2 次提交
  7. 15 12月, 2016 3 次提交
    • K
      x86/mm: Drop unused argument 'removed' from sync_global_pgds() · 5372e155
      Kirill A. Shutemov 提交于
      Since commit af2cf278 ("x86/mm/hotplug: Don't remove PGD entries in
      remove_pagetable()") there are no callers of sync_global_pgds() which set
      the 'removed' argument to 1.
      
      Remove the argument and the related conditionals in the function.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Link: http://lkml.kernel.org/r/20161214234403.137556-1-kirill.shutemov@linux.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5372e155
    • T
      x86/tsc: Force TSC_ADJUST register to value >= zero · 5bae1562
      Thomas Gleixner 提交于
      Roland reported that his DELL T5810 sports a value add BIOS which
      completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with
      random negative TSC_ADJUST values, different on all CPUs. That renders the
      TSC useless because the sycnchronization check fails.
      
      Roland tested the new TSC_ADJUST mechanism. While it manages to readjust
      the TSCs he needs to disable the TSC deadline timer, otherwise the machine
      just stops booting.
      
      Deeper investigation unearthed that the TSC deadline timer is sensitive to
      the TSC_ADJUST value. Writing TSC_ADJUST to a negative value results in an
      interrupt storm caused by the TSC deadline timer.
      
      This does not make any sense and it's hard to imagine what kind of hardware
      wreckage is behind that misfeature, but it's reliably reproducible on other
      systems which have TSC_ADJUST and TSC deadline timer.
      
      While it would be understandable that a big enough negative value which
      moves the resulting TSC readout into the negative space could have the
      described effect, this happens even with a adjust value of -1, which keeps
      the TSC readout definitely in the positive space. The compare register for
      the TSC deadline timer is set to a positive value larger than the TSC, but
      despite not having reached the deadline the interrupt is raised
      immediately. If this happens on the boot CPU, then the machine dies
      silently because this setup happens before the NMI watchdog is armed.
      
      Further experiments showed that any other adjustment of TSC_ADJUST works as
      expected as long as it stays in the positive range. The direction of the
      adjustment has no influence either. See the lkml link for further analysis.
      
      Yet another proof for the theory that timers are designed by janitors and
      the underlying (obviously undocumented) mechanisms which allow BIOSes to
      wreckage them are considered a feature. Well done Intel - NOT!
      
      To address this wreckage add the following sanity measures:
      
      - If the TSC_ADJUST value on the boot cpu is not 0, set it to 0
      
      - If the TSC_ADJUST value on any cpu is negative, set it to 0
      
      - Prevent the cross package synchronization mechanism from setting negative
        TSC_ADJUST values.
      Reported-and-tested-by: NRoland Scheidegger <rscheidegger_lists@hispeed.ch>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Kevin Stanton <kevin.b.stanton@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Allen Hung <allen_hung@dell.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/20161213131211.397588033@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      5bae1562
    • T
      x86/tsc: Validate TSC_ADJUST after resume · 6a369583
      Thomas Gleixner 提交于
      Some 'feature' BIOSes fiddle with the TSC_ADJUST register during
      suspend/resume which renders the TSC unusable.
      
      Add sanity checks into the resume path and restore the
      original value if it was adjusted.
      Reported-and-tested-by: NRoland Scheidegger <rscheidegger_lists@hispeed.ch>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Bruce Schlobohm <bruce.schlobohm@intel.com>
      Cc: Kevin Stanton <kevin.b.stanton@intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Allen Hung <allen_hung@dell.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/20161213131211.317654500@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6a369583
  8. 14 12月, 2016 1 次提交
  9. 11 12月, 2016 1 次提交
    • P
      x86/paravirt: Fix bool return type for PVOP_CALL() · 11f254db
      Peter Zijlstra 提交于
      Commit:
      
        3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      
      introduced a paravirt op with bool return type [*]
      
      It turns out that the PVOP_CALL*() macros miscompile when rettype is
      bool. Code that looked like:
      
         83 ef 01                sub    $0x1,%edi
         ff 15 32 a0 d8 00       callq  *0xd8a032(%rip)        # ffffffff81e28120 <pv_lock_ops+0x20>
         84 c0                   test   %al,%al
      
      ended up looking like so after PVOP_CALL1() was applied:
      
         83 ef 01                sub    $0x1,%edi
         48 63 ff                movslq %edi,%rdi
         ff 14 25 20 81 e2 81    callq  *0xffffffff81e28120
         48 85 c0                test   %rax,%rax
      
      Note how it tests the whole of %rax, even though a typical bool return
      function only sets %al, like:
      
        0f 95 c0                setne  %al
        c3                      retq
      
      This is because ____PVOP_CALL() does:
      
      		__ret = (rettype)__eax;
      
      and while regular integer type casts truncate the result, a cast to
      bool tests for any !0 value. Fix this by explicitly truncating to
      sizeof(rettype) before casting.
      
      [*] The actual bug should've been exposed in commit:
            446f3dc8 ("locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests")
          but that didn't properly implement the paravirt call.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      Link: http://lkml.kernel.org/r/20161208154349.346057680@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11f254db
  10. 10 12月, 2016 5 次提交
  11. 09 12月, 2016 2 次提交
    • S
      tracing: Have the reg function allow to fail · 8cf868af
      Steven Rostedt (Red Hat) 提交于
      Some tracepoints have a registration function that gets enabled when the
      tracepoint is enabled. There may be cases that the registraction function
      must fail (for example, can't allocate enough memory). In this case, the
      tracepoint should also fail to register, otherwise the user would not know
      why the tracepoint is not working.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8cf868af
    • A
      x86: Make E820_X_MAX unconditionally larger than E820MAX · 9d2f86c6
      Alex Thorlton 提交于
      It's really not necessary to limit E820_X_MAX to 128 in the non-EFI
      case.  This commit drops E820_X_MAX's dependency on CONFIG_EFI, so that
      E820_X_MAX is always at least slightly larger than E820MAX.
      
      The real motivation behind this is actually to prevent some issues in
      the Xen kernel, where the XENMEM_machine_memory_map hypercall can
      produce an e820 map larger than 128 entries, even on systems where the
      original e820 table was quite a bit smaller than that, depending on how
      many IOAPICs are installed on the system.
      Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
      Suggested-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      9d2f86c6
  12. 08 12月, 2016 5 次提交
    • L
      KVM: nVMX: check host CR3 on vmentry and vmexit · 1dc35dac
      Ladi Prosek 提交于
      This commit adds missing host CR3 checks. Before entering guest mode, the value
      of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
      called to set the new CR3 value and check and load PDPTRs.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1dc35dac
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • D
      KVM: nVMX: support restore of VMX capability MSRs · 62cc6b9d
      David Matlack 提交于
      The VMX capability MSRs advertise the set of features the KVM virtual
      CPU can support. This set of features varies across different host CPUs
      and KVM versions. This patch aims to addresses both sources of
      differences, allowing VMs to be migrated across CPUs and KVM versions
      without guest-visible changes to these MSRs. Note that cross-KVM-
      version migration is only supported from this point forward.
      
      When the VMX capability MSRs are restored, they are audited to check
      that the set of features advertised are a subset of what KVM and the
      CPU support.
      
      Since the VMX capability MSRs are read-only, they do not need to be on
      the default MSR save/restore lists. The userspace hypervisor can set
      the values of these MSRs or read them from KVM at VCPU creation time,
      and restore the same value after every save/restore.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      62cc6b9d
    • K
      KVM: x86: Add kvm_skip_emulated_instruction and use it. · 6affcbed
      Kyle Huey 提交于
      kvm_skip_emulated_instruction calls both
      kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
      skipping the emulated instruction and generating a trap if necessary.
      
      Replacing skip_emulated_instruction calls with
      kvm_skip_emulated_instruction is straightforward, except for:
      
      - ICEBP, which is already inside a trap, so avoid triggering another trap.
      - Instructions that can trigger exits to userspace, such as the IO insns,
        MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
        KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
        IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
        take precedence. The singlestep will be triggered again on the next
        instruction, which is the current behavior.
      - Task switch instructions which would require additional handling (e.g.
        the task switch bit) and are instead left alone.
      - Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
        which do not trigger singlestep traps as mentioned previously.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6affcbed
    • K
      KVM: x86: Add a return value to kvm_emulate_cpuid · 6a908b62
      Kyle Huey 提交于
      Once skipping the emulated instruction can potentially trigger an exit to
      userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
      propagate a return value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6a908b62
  13. 06 12月, 2016 1 次提交
  14. 02 12月, 2016 1 次提交
  15. 30 11月, 2016 4 次提交