1. 12 1月, 2021 1 次提交
  2. 23 10月, 2020 1 次提交
  3. 22 10月, 2020 1 次提交
  4. 28 9月, 2020 1 次提交
  5. 12 9月, 2020 1 次提交
  6. 22 8月, 2020 1 次提交
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  7. 13 8月, 2020 1 次提交
  8. 10 7月, 2020 1 次提交
  9. 09 7月, 2020 1 次提交
  10. 02 7月, 2020 1 次提交
  11. 10 6月, 2020 2 次提交
    • M
      mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5
      Michel Lespinasse 提交于
      This change converts the existing mmap_sem rwsem calls to use the new mmap
      locking API instead.
      
      The change is generated using coccinelle with the following rule:
      
      // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
      
      @@
      expression mm;
      @@
      (
      -init_rwsem
      +mmap_init_lock
      |
      -down_write
      +mmap_write_lock
      |
      -down_write_killable
      +mmap_write_lock_killable
      |
      -down_write_trylock
      +mmap_write_trylock
      |
      -up_write
      +mmap_write_unlock
      |
      -downgrade_write
      +mmap_write_downgrade
      |
      -down_read
      +mmap_read_lock
      |
      -down_read_killable
      +mmap_read_lock_killable
      |
      -down_read_trylock
      +mmap_read_trylock
      |
      -up_read
      +mmap_read_unlock
      )
      -(&mm->mmap_sem)
      +(mm)
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Liam Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ying Han <yinghan@google.com>
      Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ed45c5
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  12. 09 6月, 2020 1 次提交
    • S
      mm/gup.c: convert to use get_user_{page|pages}_fast_only() · dadbb612
      Souptick Joarder 提交于
      API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
      align with pin_user_pages_fast_only().
      
      As part of this we will get rid of write parameter.  Instead caller will
      pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
      existing functionality of the API.
      
      All the callers are changed to pass FOLL_WRITE.
      
      Also introduce get_user_page_fast_only(), and use it in a few places
      that hard-code nr_pages to 1.
      
      Updated the documentation of the API.
      Signed-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Suchanek <msuchanek@suse.de>
      Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dadbb612
  13. 08 6月, 2020 1 次提交
    • E
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata 提交于
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
      
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
      
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      include/linux/mmu_notifier.h:
      
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951Signed-off-by: NEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e649b3f0
  14. 05 6月, 2020 1 次提交
  15. 04 6月, 2020 1 次提交
    • P
      KVM: let kvm_destroy_vm_debugfs clean up vCPU debugfs directories · d56f5136
      Paolo Bonzini 提交于
      After commit 63d04348 ("KVM: x86: move kvm_create_vcpu_debugfs after
      last failure point") we are creating the pre-vCPU debugfs files
      after the creation of the vCPU file descriptor.  This makes it
      possible for userspace to reach kvm_vcpu_release before
      kvm_create_vcpu_debugfs has finished.  The vcpu->debugfs_dentry
      then does not have any associated inode anymore, and this causes
      a NULL-pointer dereference in debugfs_create_file.
      
      The solution is simply to avoid removing the files; they are
      cleaned up when the VM file descriptor is closed (and that must be
      after KVM_CREATE_VCPU returns).  We can stop storing the dentry
      in struct kvm_vcpu too, because it is not needed anywhere after
      kvm_create_vcpu_debugfs returns.
      
      Reported-by: syzbot+705f4401d5a93a59b87d@syzkaller.appspotmail.com
      Fixes: 63d04348 ("KVM: x86: move kvm_create_vcpu_debugfs after last failure point")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d56f5136
  16. 01 6月, 2020 4 次提交
  17. 16 5月, 2020 4 次提交
  18. 14 5月, 2020 1 次提交
    • D
      kvm: Replace vcpu->swait with rcuwait · da4ad88c
      Davidlohr Bueso 提交于
      The use of any sort of waitqueue (simple or regular) for
      wait/waking vcpus has always been an overkill and semantically
      wrong. Because this is per-vcpu (which is blocked) there is
      only ever a single waiting vcpu, thus no need for any sort of
      queue.
      
      As such, make use of the rcuwait primitive, with the following
      considerations:
      
        - rcuwait already provides the proper barriers that serialize
        concurrent waiter and waker.
      
        - Task wakeup is done in rcu read critical region, with a
        stable task pointer.
      
        - Because there is no concurrency among waiters, we need
        not worry about rcuwait_wait_event() calls corrupting
        the wait->task. As a consequence, this saves the locking
        done in swait when modifying the queue. This also applies
        to per-vcore wait for powerpc kvm-hv.
      
      The x86 tscdeadline_latency test mentioned in 8577370f
      ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
      latency is reduced by around 15-20% with this change.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: linux-mips@vger.kernel.org
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Message-Id: <20200424054837.5138-6-dave@stgolabs.net>
      [Avoid extra logic changes. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da4ad88c
  19. 08 5月, 2020 1 次提交
  20. 25 4月, 2020 1 次提交
  21. 21 4月, 2020 4 次提交
  22. 16 4月, 2020 1 次提交
  23. 31 3月, 2020 1 次提交
  24. 26 3月, 2020 1 次提交
  25. 17 3月, 2020 6 次提交
    • S
      KVM: Drop largepages_enabled and its accessor/mutator · 600087b6
      Sean Christopherson 提交于
      Drop largepages_enabled, kvm_largepages_enabled() and
      kvm_disable_largepages() now that all users are gone.
      
      Note, largepages_enabled was an x86-only flag that got left in common
      KVM code when KVM gained support for multiple architectures.
      
      No functional change intended.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      600087b6
    • P
      KVM: Drop gfn_to_pfn_atomic() · 2bde08f9
      Peter Xu 提交于
      It's never used anywhere now.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bde08f9
    • J
      KVM: x86: enable dirty log gradually in small chunks · 3c9bd400
      Jay Zhou 提交于
      It could take kvm->mmu_lock for an extended period of time when
      enabling dirty log for the first time. The main cost is to clear
      all the D-bits of last level SPTEs. This situation can benefit from
      manual dirty log protect as well, which can reduce the mmu_lock
      time taken. The sequence is like this:
      
      1. Initialize all the bits of the dirty bitmap to 1 when enabling
         dirty log for the first time
      2. Only write protect the huge pages
      3. KVM_GET_DIRTY_LOG returns the dirty bitmap info
      4. KVM_CLEAR_DIRTY_LOG will clear D-bit for each of the leaf level
         SPTEs gradually in small chunks
      
      Under the Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz environment,
      I did some tests with a 128G windows VM and counted the time taken
      of memory_global_dirty_log_start, here is the numbers:
      
      VM Size        Before    After optimization
      128G           460ms     10ms
      Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c9bd400
    • S
      KVM: Dynamically size memslot array based on number of used slots · 36947254
      Sean Christopherson 提交于
      Now that the memslot logic doesn't assume memslots are always non-NULL,
      dynamically size the array of memslots instead of unconditionally
      allocating memory for the maximum number of memslots.
      
      Note, because a to-be-deleted memslot must first be invalidated, the
      array size cannot be immediately reduced when deleting a memslot.
      However, consecutive deletions will realize the memory savings, i.e.
      a second deletion will trim the entry.
      Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36947254
    • S
      KVM: Terminate memslot walks via used_slots · 0577d1ab
      Sean Christopherson 提交于
      Refactor memslot handling to treat the number of used slots as the de
      facto size of the memslot array, e.g. return NULL from id_to_memslot()
      when an invalid index is provided instead of relying on npages==0 to
      detect an invalid memslot.  Rework the sorting and walking of memslots
      in advance of dynamically sizing memslots to aid bisection and debug,
      e.g. with luck, a bug in the refactoring will bisect here and/or hit a
      WARN instead of randomly corrupting memory.
      
      Alternatively, a global null/invalid memslot could be returned, i.e. so
      callers of id_to_memslot() don't have to explicitly check for a NULL
      memslot, but that approach runs the risk of introducing difficult-to-
      debug issues, e.g. if the global null slot is modified.  Constifying
      the return from id_to_memslot() to combat such issues is possible, but
      would require a massive refactoring of arch specific code and would
      still be susceptible to casting shenanigans.
      
      Add function comments to update_memslots() and search_memslots() to
      explicitly (and loudly) state how memslots are sorted.
      
      Opportunistically stuff @hva with a non-canonical value when deleting a
      private memslot on x86 to detect bogus usage of the freed slot.
      
      No functional change intended.
      Tested-by: NChristoffer Dall <christoffer.dall@arm.com>
      Tested-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0577d1ab
    • S
      KVM: Ensure validity of memslot with respect to kvm_get_dirty_log() · 2a49f61d
      Sean Christopherson 提交于
      Rework kvm_get_dirty_log() so that it "returns" the associated memslot
      on success.  A future patch will rework memslot handling such that
      id_to_memslot() can return NULL, returning the memslot makes it more
      obvious that the validity of the memslot has been verified, i.e.
      precludes the need to add validity checks in the arch code that are
      technically unnecessary.
      
      To maintain ordering in s390, move the call to kvm_arch_sync_dirty_log()
      from s390's kvm_vm_ioctl_get_dirty_log() to the new kvm_get_dirty_log().
      This is a nop for PPC, the only other arch that doesn't select
      KVM_GENERIC_DIRTYLOG_READ_PROTECT, as its sync_dirty_log() is empty.
      
      Ideally, moving the sync_dirty_log() call would be done in a separate
      patch, but it can't be done in a follow-on patch because that would
      temporarily break s390's ordering.  Making the move in a preparatory
      patch would be functionally correct, but would create an odd scenario
      where the moved sync_dirty_log() would operate on a "different" memslot
      due to consuming the result of a different id_to_memslot().  The
      memslot couldn't actually be different as slots_lock is held, but the
      code is confusing enough as it is, i.e. moving sync_dirty_log() in this
      patch is the lesser of all evils.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a49f61d