1. 09 1月, 2017 3 次提交
    • J
      kvm: x86: mmu: Rename spte_is_locklessly_modifiable() · ea4114bc
      Junaid Shahid 提交于
      This change renames spte_is_locklessly_modifiable() to
      spte_can_locklessly_be_made_writable() to distinguish it from other
      forms of lockless modifications. The full set of lockless modifications
      is covered by spte_has_volatile_bits().
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea4114bc
    • D
      kvm: x86: reduce collisions in mmu_page_hash · 114df303
      David Matlack 提交于
      When using two-dimensional paging, the mmu_page_hash (which provides
      lookups for existing kvm_mmu_page structs), becomes imbalanced; with
      too many collisions in buckets 0 and 512. This has been seen to cause
      mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
      VMs with a large amount of RAM mapped with 4K pages.
      
      The current hash function uses the lower 10 bits of gfn to index into
      mmu_page_hash. When doing shadow paging, gfn is the address of the
      guest page table being shadow. These tables are 4K-aligned, which
      makes the low bits of gfn a good hash. However, with two-dimensional
      paging, no guest page tables are being shadowed, so gfn is the base
      address that is mapped by the table. Thus page tables (level=1) have
      a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
      etc. This means hashes will only differ in their 10th bit.
      
      hash_64() provides a better hash. For example, on a VM with ~200G
      (99458 direct=1 kvm_mmu_page structs):
      
      hash            max_mmu_page_hash_collisions
      --------------------------------------------
      low 10 bits     49847
      hash_64         105
      perfect         97
      
      While we're changing the hash, increase the table size by 4x to better
      support large VMs (further reduces number of collisions in 200G VM to
      29).
      
      Note that hash_64() does not provide a good distribution prior to commit
      ef703f49 ("Eliminate bad hash multipliers from hash_32() and
      hash_64()").
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      114df303
    • D
      kvm: x86: export maximum number of mmu_page_hash collisions · f3414bc7
      David Matlack 提交于
      Report the maximum number of mmu_page_hash collisions as a per-VM stat.
      This will make it easy to identify problems with the mmu_page_hash in
      the future.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f3414bc7
  2. 25 11月, 2016 1 次提交
    • T
      kvm: svm: Add support for additional SVM NPF error codes · 14727754
      Tom Lendacky 提交于
      AMD hardware adds two additional bits to aid in nested page fault handling.
      
      Bit 32 - NPF occurred while translating the guest's final physical address
      Bit 33 - NPF occurred while translating the guest page tables
      
      The guest page tables fault indicator can be used as an aid for nested
      virtualization. Using V0 for the host, V1 for the first level guest and
      V2 for the second level guest, when both V1 and V2 are using nested paging
      there are currently a number of unnecessary instruction emulations. When
      V2 is launched shadow paging is used in V1 for the nested tables of V2. As
      a result, KVM marks these pages as RO in the host nested page tables. When
      V2 exits and we resume V1, these pages are still marked RO.
      
      Every nested walk for a guest page table is treated as a user-level write
      access and this causes a lot of NPFs because the V1 page tables are marked
      RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
      sees a write to a read-only page, emulates the V1 instruction and unprotects
      the page (marking it RW). This patch looks for cases where we get a NPF due
      to a guest page table walk where the page was marked RO. It immediately
      unprotects the page and resumes the guest, leading to far fewer instruction
      emulations when nested virtualization is used.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      14727754
  3. 23 11月, 2016 1 次提交
  4. 04 11月, 2016 2 次提交
  5. 03 11月, 2016 2 次提交
  6. 19 8月, 2016 1 次提交
  7. 14 7月, 2016 5 次提交
    • P
      x86/kvm: Audit and remove any unnecessary uses of module.h · 1767e931
      Paul Gortmaker 提交于
      Historically a lot of these existed because we did not have
      a distinction between what was modular code and what was providing
      support to modules via EXPORT_SYMBOL and friends.  That changed
      when we forked out support for the latter into the export.h file.
      
      This means we should be able to reduce the usage of module.h
      in code that is obj-y Makefile or bool Kconfig.  In the case of
      kvm where it is modular, we can extend that to also include files
      that are building basic support functionality but not related
      to loading or registering the final module; such files also have
      no need whatsoever for module.h
      
      The advantage in removing such instances is that module.h itself
      sources about 15 other headers; adding significantly to what we feed
      cpp, and it can obscure what headers we are effectively using.
      
      Since module.h was the source for init.h (for __init) and for
      export.h (for EXPORT_SYMBOL) we consider each instance for the
      presence of either and replace as needed.
      
      Several instances got replaced with moduleparam.h since that was
      really all that was required for those particular files.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1767e931
    • B
      kvm: mmu: track read permission explicitly for shadow EPT page tables · d95c5568
      Bandan Das 提交于
      To support execute only mappings on behalf of L1 hypervisors,
      reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit
      set.
      
      For the nested EPT case, we assumed that the U bit was always set
      since there was no equivalent in EPT page tables.  Strictly
      speaking, this was not necessary because handle_ept_violation
      never set PFERR_USER_MASK in the error code (uf=0 in the
      parlance of update_permission_bitmask).  We now have to set
      both U and UF correctly, respectively in FNAME(gpte_access)
      and in handle_ept_violation.
      
      Also in handle_ept_violation bit 3 of the exit qualification is
      not enough to detect a present PTE; all three bits 3-5 have to
      be checked.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d95c5568
    • B
      kvm: mmu: don't set the present bit unconditionally · ffb128c8
      Bandan Das 提交于
      To support execute only mappings on behalf of L1
      hypervisors, we need to teach set_spte() to honor all three of
      L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
      that will be set for non-EPT shadow paging and clear for EPT.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffb128c8
    • B
      kvm: mmu: remove is_present_gpte() · 812f30b2
      Bandan Das 提交于
      We have two versions of the above function.
      To prevent confusion and bugs in the future, remove
      the non-FNAME version entirely and replace all calls
      with the actual check.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      812f30b2
    • B
      kvm: mmu: extend the is_present check to 32 bits · 8d5cf161
      Bandan Das 提交于
      This is safe because this function is called
      on host controlled page table and non-present/non-MMIO
      sptes never use bits 1..31. For the EPT case, this
      ensures that cases where only the execute bit is set
      is marked valid.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d5cf161
  8. 14 6月, 2016 1 次提交
  9. 02 6月, 2016 1 次提交
    • N
      KVM: x86: avoid write-tearing of TDP · b19ee2ff
      Nadav Amit 提交于
      In theory, nothing prevents the compiler from write-tearing PTEs, or
      split PTE writes. These partially-modified PTEs can be fetched by other
      cores and cause mayhem. I have not really encountered such case in
      real-life, but it does seem possible.
      
      For example, the compiler may try to do something creative for
      kvm_set_pte_rmapp() and perform multiple writes to the PTE.
      Signed-off-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b19ee2ff
  10. 06 5月, 2016 1 次提交
    • A
      mm: thp: kvm: fix memory corruption in KVM with THP enabled · 127393fb
      Andrea Arcangeli 提交于
      After the THP refcounting change, obtaining a compound pages from
      get_user_pages() no longer allows us to assume the entire compound page
      is immediately mappable from a secondary MMU.
      
      A secondary MMU doesn't want to call get_user_pages() more than once for
      each compound page, in order to know if it can map the whole compound
      page.  So a secondary MMU needs to know from a single get_user_pages()
      invocation when it can map immediately the entire compound page to avoid
      a flood of unnecessary secondary MMU faults and spurious
      atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
      users).
      
      Ideally instead of the page->_mapcount < 1 check, get_user_pages()
      should return the granularity of the "page" mapping in the "mm" passed
      to get_user_pages().  However it's non trivial change to pass the "pmd"
      status belonging to the "mm" walked by get_user_pages up the stack (up
      to the caller of get_user_pages).  So the fix just checks if there is
      not a single pte mapping on the page returned by get_user_pages, and in
      turn if the caller can assume that the whole compound page is mapped in
      the current "mm" (in a pmd_trans_huge()).  In such case the entire
      compound page is safe to map into the secondary MMU without additional
      get_user_pages() calls on the surrounding tail/head pages.  In addition
      of being faster, not having to run other get_user_pages() calls also
      reduces the memory footprint of the secondary MMU fault in case the pmd
      split happened as result of memory pressure.
      
      Without this fix after a MADV_DONTNEED (like invoked by QEMU during
      postcopy live migration or balloning) or after generic swapping (with a
      failure in split_huge_page() that would only result in pmd splitting and
      not a physical page split), KVM would map the whole compound page into
      the shadow pagetables, despite regular faults or userfaults (like
      UFFDIO_COPY) may map regular pages into the primary MMU as result of the
      pte faults, leading to the guest mode and userland mode going out of
      sync and not working on the same memory at all times.
      
      Any other secondary MMU notifier manager (KVM is just one of the many
      MMU notifier users) will need the same information if it doesn't want to
      run a flood of get_user_pages_fast and it can support multiple
      granularity in the secondary MMU mappings, so I think it is justified to
      be exposed not just to KVM.
      
      The other option would be to move transparent_hugepage_adjust to
      mm/huge_memory.c but that currently has all kind of KVM data structures
      in it, so it's definitely not a cut-and-paste work, so I couldn't do a
      fix as cleaner as this one for 4.6.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Li, Liang Z" <liang.z.li@intel.com>
      Cc: Amit Shah <amit.shah@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      127393fb
  11. 20 4月, 2016 1 次提交
  12. 01 4月, 2016 1 次提交
  13. 31 3月, 2016 1 次提交
  14. 22 3月, 2016 3 次提交
  15. 10 3月, 2016 1 次提交
    • P
      KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 · 5f0b8199
      Paolo Bonzini 提交于
      KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
      CR0.WP=1.  These pages' SPTEs flip continuously between two states:
      U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
      and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
      When guest EFER has the NX bit cleared, the reserved bit check thinks
      that the latter state is invalid; teach it that the smep_andnot_wp case
      will also use the NX bit of SPTEs.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.inel.com>
      Fixes: c258b62bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f0b8199
  16. 08 3月, 2016 8 次提交
  17. 04 3月, 2016 3 次提交
    • X
      KVM: MMU: check kvm_mmu_pages and mmu_page_path indices · e23d3fef
      Xiao Guangrong 提交于
      Give a special invalid index to the root of the walk, so that we
      can check the consistency of kvm_mmu_pages and mmu_page_path.
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      [Extracted from a bigger patch proposed by Guangrong. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e23d3fef
    • P
      KVM: MMU: Fix ubsan warnings · 0a47cd85
      Paolo Bonzini 提交于
      kvm_mmu_pages_init is doing some really yucky stuff.  It is setting
      up a sentinel for mmu_page_clear_parents; however, because of a) the
      way levels are numbered starting from 1 and b) the way mmu_page_path
      sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be
      out of bounds.  This is harmless because the code overwrites up to the
      first two elements of parents->idx and these are initialized, and
      because the sentinel is not needed in this case---mmu_page_clear_parents
      exits anyway when it gets to the end of the array.  However ubsan
      complains, and everyone else should too.
      
      This fix does three things.  First it makes the mmu_page_path arrays
      PT64_ROOT_LEVEL elements in size, so that we can write to them without
      checking the level in advance.  Second it disintegrates kvm_mmu_pages_init
      between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and
      for_each_sp (to place the NULL sentinel at the end of the current path).
      This is okay because the mmu_page_path is only used in
      mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within
      a for_each_sp iterator, and hence always after a call to mmu_pages_next.
      Third it changes mmu_pages_clear_parents to just use the sentinel to
      stop iteration, without checking the bounds on level.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Reported-by: NMike Krinkin <krinkin.m.u@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a47cd85
    • P
      KVM: MMU: cleanup handle_abnormal_pfn · 798e88b3
      Paolo Bonzini 提交于
      The goto and temporary variable are unnecessary, just use return
      statements.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      798e88b3
  18. 03 3月, 2016 4 次提交