1. 12 8月, 2017 1 次提交
    • W
      KVM: MMU: Fix softlockup due to mmu_lock is held too long · 42bcbebf
      Wanpeng Li 提交于
      watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
       irq event stamp: 20532
       hardirqs last  enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
       hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
       softirqs last  enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
       softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
       CPU: 5 PID: 3089 Comm: warn_test Tainted: G           OE   4.13.0-rc3+ #8
       RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
       Call Trace:
        make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
        kvm_mmu_load+0x1cf/0x410 [kvm]
        kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
        ? __this_cpu_preempt_check+0x13/0x20
      
      This can be reproduced readily by ept=N and running syzkaller tests since
      many syzkaller testcases don't setup any memory regions. However, if ept=Y
      rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
      extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
      which just hide the issue.
      
      I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
      so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
      to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
      It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
      softlockup.
      
      This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
      according to whether or not there is mmu page zapped.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42bcbebf
  2. 10 8月, 2017 2 次提交
    • P
      kvm: nVMX: Add support for fast unprotection of nested guest page tables · eebed243
      Paolo Bonzini 提交于
      This is the same as commit 14727754 ("kvm: svm: Add support for
      additional SVM NPF error codes", 2016-11-23), but for Intel processors.
      In this case, the exit qualification field's bit 8 says whether the
      EPT violation occurred while translating the guest's final physical
      address or rather while translating the guest page tables.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eebed243
    • B
      KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guest · 64531a3b
      Brijesh Singh 提交于
      Commit 14727754 ("kvm: svm: Add support for additional SVM NPF error
      codes", 2016-11-23) added a new error code to aid nested page fault
      handling.  The commit unprotects (kvm_mmu_unprotect_page) the page when
      we get a NPF due to guest page table walk where the page was marked RO.
      
      However, if an L0->L2 shadow nested page table can also be marked read-only
      when a page is read only in L1's nested page table.  If such a page
      is accessed by L2 while walking page tables it can cause a nested
      page fault (page table walks are write accesses).  However, after
      kvm_mmu_unprotect_page we may get another page fault, and again in an
      endless stream.
      
      To cover this use case, we qualify the new error_code check with
      vcpu->arch.mmu_direct_map so that the error_code check would run on L1
      guest, and not the L2 guest.  This avoids hitting the above scenario.
      
      Fixes: 14727754
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64531a3b
  3. 07 8月, 2017 1 次提交
  4. 14 7月, 2017 2 次提交
  5. 03 7月, 2017 4 次提交
    • P
      x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12 · 995f00a6
      Peter Feiner 提交于
      EPT A/D was enabled in the vmcs02 EPTP regardless of the vmcs12's EPTP
      value. The problem is that enabling A/D changes the behavior of L2's
      x86 page table walks as seen by L1. With A/D enabled, x86 page table
      walks are always treated as EPT writes.
      
      Commit ae1e2d10 ("kvm: nVMX: support EPT accessed/dirty bits",
      2017-03-30) tried to work around this problem by clearing the write
      bit in the exit qualification for EPT violations triggered by page
      walks.  However, that fixup introduced the opposite bug: page-table walks
      that actually set x86 A/D bits were *missing* the write bit in the exit
      qualification.
      
      This patch fixes the problem by disabling EPT A/D in the shadow MMU
      when EPT A/D is disabled in vmcs12's EPTP.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      995f00a6
    • P
      kvm: x86: mmu: allow A/D bits to be disabled in an mmu · ac8d57e5
      Peter Feiner 提交于
      Adds the plumbing to disable A/D bits in the MMU based on a new role
      bit, ad_disabled. When A/D is disabled, the MMU operates as though A/D
      aren't available (i.e., using access tracking faults instead).
      
      To avoid SP -> kvm_mmu_page.role.ad_disabled lookups all over the
      place, A/D disablement is now stored in the SPTE. This state is stored
      in the SPTE by tweaking the use of SPTE_SPECIAL_MASK for access
      tracking. Rather than just setting SPTE_SPECIAL_MASK when an
      access-tracking SPTE is non-present, we now always set
      SPTE_SPECIAL_MASK for access-tracking SPTEs.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      [Use role.ad_disabled even for direct (non-shadow) EPT page tables.  Add
       documentation and a few MMU_WARN_ONs. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d57e5
    • P
      x86: kvm: mmu: make spte mmio mask more explicit · dcdca5fe
      Peter Feiner 提交于
      Specify both a mask (i.e., bits to consider) and a value (i.e.,
      pattern of bits that indicates a special PTE) for mmio SPTEs. On
      Intel, this lets us pack even more information into the
      (SPTE_SPECIAL_MASK | EPT_VMX_RWX_MASK) mask we use for access
      tracking liberating all (SPTE_SPECIAL_MASK | (non-misconfigured-RWX))
      values.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dcdca5fe
    • P
      x86: kvm: mmu: dead code thanks to access tracking · ce00053b
      Peter Feiner 提交于
      The MMU always has hardware A bits or access tracking support, thus
      it's unnecessary to handle the scenario where we have neither.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce00053b
  6. 11 6月, 2017 1 次提交
    • W
      KVM: async_pf: avoid async pf injection when in guest mode · 9bc1f09f
      Wanpeng Li 提交于
       INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
             Not tainted 4.12.0-rc4+ #8
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       gnome-terminal- D    0  1734   1015 0x00000000
       Call Trace:
        __schedule+0x3cd/0xb30
        schedule+0x40/0x90
        kvm_async_pf_task_wait+0x1cc/0x270
        ? __vfs_read+0x37/0x150
        ? prepare_to_swait+0x22/0x70
        do_async_page_fault+0x77/0xb0
        ? do_async_page_fault+0x77/0xb0
        async_page_fault+0x28/0x30
      
      This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
      and then gives stress to memory on L1, I can observed this hang on L1 when
      at least ~70% swap area is occupied on L0.
      
      This is due to async pf was injected to L2 which should be injected to L1,
      L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
      actually), and L1 guest starts accumulating tasks stuck in D state in
      kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
      
      This patch fixes the hang by doing async pf when executing L1 guest.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9bc1f09f
  7. 09 5月, 2017 1 次提交
  8. 07 4月, 2017 1 次提交
    • P
      kvm: nVMX: support EPT accessed/dirty bits · ae1e2d10
      Paolo Bonzini 提交于
      Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
      thing to change is that, when EPT accessed and dirty bits are not in use,
      VMX treats accesses to guest paging structures as data reads.  When they
      are in use (bit 6 of EPTP is set), they are treated as writes and the
      corresponding EPT dirty bit is set.  The MMU didn't know this detail,
      so this patch adds it.
      
      We also have to fix up the exit qualification.  It may be wrong because
      KVM sets bit 6 but the guest might not.
      
      L1 emulates EPT A/D bits using write permissions, so in principle it may
      be possible for EPT A/D bits to be used by L1 even though not available
      in hardware.  The problem is that guest page-table walks will be treated
      as reads rather than writes, so they would not cause an EPT violation.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Fixed typo in walk_addr_generic() comment and changed bit clear +
       conditional-set pattern in handle_ept_violation() to conditional-clear]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ae1e2d10
  9. 02 3月, 2017 1 次提交
  10. 28 2月, 2017 1 次提交
  11. 27 1月, 2017 4 次提交
  12. 09 1月, 2017 7 次提交
  13. 25 11月, 2016 1 次提交
    • T
      kvm: svm: Add support for additional SVM NPF error codes · 14727754
      Tom Lendacky 提交于
      AMD hardware adds two additional bits to aid in nested page fault handling.
      
      Bit 32 - NPF occurred while translating the guest's final physical address
      Bit 33 - NPF occurred while translating the guest page tables
      
      The guest page tables fault indicator can be used as an aid for nested
      virtualization. Using V0 for the host, V1 for the first level guest and
      V2 for the second level guest, when both V1 and V2 are using nested paging
      there are currently a number of unnecessary instruction emulations. When
      V2 is launched shadow paging is used in V1 for the nested tables of V2. As
      a result, KVM marks these pages as RO in the host nested page tables. When
      V2 exits and we resume V1, these pages are still marked RO.
      
      Every nested walk for a guest page table is treated as a user-level write
      access and this causes a lot of NPFs because the V1 page tables are marked
      RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
      sees a write to a read-only page, emulates the V1 instruction and unprotects
      the page (marking it RW). This patch looks for cases where we get a NPF due
      to a guest page table walk where the page was marked RO. It immediately
      unprotects the page and resumes the guest, leading to far fewer instruction
      emulations when nested virtualization is used.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      14727754
  14. 23 11月, 2016 1 次提交
  15. 04 11月, 2016 2 次提交
  16. 03 11月, 2016 2 次提交
  17. 19 8月, 2016 1 次提交
  18. 14 7月, 2016 5 次提交
    • P
      x86/kvm: Audit and remove any unnecessary uses of module.h · 1767e931
      Paul Gortmaker 提交于
      Historically a lot of these existed because we did not have
      a distinction between what was modular code and what was providing
      support to modules via EXPORT_SYMBOL and friends.  That changed
      when we forked out support for the latter into the export.h file.
      
      This means we should be able to reduce the usage of module.h
      in code that is obj-y Makefile or bool Kconfig.  In the case of
      kvm where it is modular, we can extend that to also include files
      that are building basic support functionality but not related
      to loading or registering the final module; such files also have
      no need whatsoever for module.h
      
      The advantage in removing such instances is that module.h itself
      sources about 15 other headers; adding significantly to what we feed
      cpp, and it can obscure what headers we are effectively using.
      
      Since module.h was the source for init.h (for __init) and for
      export.h (for EXPORT_SYMBOL) we consider each instance for the
      presence of either and replace as needed.
      
      Several instances got replaced with moduleparam.h since that was
      really all that was required for those particular files.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1767e931
    • B
      kvm: mmu: track read permission explicitly for shadow EPT page tables · d95c5568
      Bandan Das 提交于
      To support execute only mappings on behalf of L1 hypervisors,
      reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit
      set.
      
      For the nested EPT case, we assumed that the U bit was always set
      since there was no equivalent in EPT page tables.  Strictly
      speaking, this was not necessary because handle_ept_violation
      never set PFERR_USER_MASK in the error code (uf=0 in the
      parlance of update_permission_bitmask).  We now have to set
      both U and UF correctly, respectively in FNAME(gpte_access)
      and in handle_ept_violation.
      
      Also in handle_ept_violation bit 3 of the exit qualification is
      not enough to detect a present PTE; all three bits 3-5 have to
      be checked.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d95c5568
    • B
      kvm: mmu: don't set the present bit unconditionally · ffb128c8
      Bandan Das 提交于
      To support execute only mappings on behalf of L1
      hypervisors, we need to teach set_spte() to honor all three of
      L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
      that will be set for non-EPT shadow paging and clear for EPT.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffb128c8
    • B
      kvm: mmu: remove is_present_gpte() · 812f30b2
      Bandan Das 提交于
      We have two versions of the above function.
      To prevent confusion and bugs in the future, remove
      the non-FNAME version entirely and replace all calls
      with the actual check.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      812f30b2
    • B
      kvm: mmu: extend the is_present check to 32 bits · 8d5cf161
      Bandan Das 提交于
      This is safe because this function is called
      on host controlled page table and non-present/non-MMIO
      sptes never use bits 1..31. For the EPT case, this
      ensures that cases where only the execute bit is set
      is marked valid.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d5cf161
  19. 14 6月, 2016 1 次提交
  20. 02 6月, 2016 1 次提交
    • N
      KVM: x86: avoid write-tearing of TDP · b19ee2ff
      Nadav Amit 提交于
      In theory, nothing prevents the compiler from write-tearing PTEs, or
      split PTE writes. These partially-modified PTEs can be fetched by other
      cores and cause mayhem. I have not really encountered such case in
      real-life, but it does seem possible.
      
      For example, the compiler may try to do something creative for
      kvm_set_pte_rmapp() and perform multiple writes to the PTE.
      Signed-off-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b19ee2ff