1. 16 5月, 2020 3 次提交
    • D
      kvm: add halt-polling cpu usage stats · cb953129
      David Matlack 提交于
      Two new stats for exposing halt-polling cpu usage:
      halt_poll_success_ns
      halt_poll_fail_ns
      
      Thus sum of these 2 stats is the total cpu time spent polling. "success"
      means the VCPU polled until a virtual interrupt was delivered. "fail"
      means the VCPU had to schedule out (either because the maximum poll time
      was reached or it needed to yield the CPU).
      
      To avoid touching every arch's kvm_vcpu_stat struct, only update and
      export halt-polling cpu usage stats if we're on x86.
      
      Exporting cpu usage as a u64 and in nanoseconds means we will overflow at
      ~500 years, which seems reasonably large.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NJon Cargille <jcargill@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200508182240.68440-1-jcargill@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb953129
    • W
      KVM: VMX: Optimize posted-interrupt delivery for timer fastpath · 379a3c8e
      Wanpeng Li 提交于
      While optimizing posted-interrupt delivery especially for the timer
      fastpath scenario, I measured kvm_x86_ops.deliver_posted_interrupt()
      to introduce substantial latency because the processor has to perform
      all vmentry tasks, ack the posted interrupt notification vector,
      read the posted-interrupt descriptor etc.
      
      This is not only slow, it is also unnecessary when delivering an
      interrupt to the current CPU (as is the case for the LAPIC timer) because
      PIR->IRR and IRR->RVI synchronization is already performed on vmentry
      Therefore skip kvm_vcpu_trigger_posted_interrupt in this case, and
      instead do vmx_sync_pir_to_irr() on the EXIT_FASTPATH_REENTER_GUEST
      fastpath as well.
      Tested-by: NHaiwei Li <lihaiwei@tencent.com>
      Cc: Haiwei Li <lihaiwei@tencent.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1588055009-12677-6-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      379a3c8e
    • P
      KVM: No need to retry for hva_to_pfn_remapped() · 5b494aea
      Peter Xu 提交于
      hva_to_pfn_remapped() calls fixup_user_fault(), which has already
      handled the retry gracefully.  Even if "unlocked" is set to true, it
      means that we've got a VM_FAULT_RETRY inside fixup_user_fault(),
      however the page fault has already retried and we should have the pfn
      set correctly.  No need to do that again.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20200416155906.267462-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b494aea
  2. 14 5月, 2020 1 次提交
    • D
      kvm: Replace vcpu->swait with rcuwait · da4ad88c
      Davidlohr Bueso 提交于
      The use of any sort of waitqueue (simple or regular) for
      wait/waking vcpus has always been an overkill and semantically
      wrong. Because this is per-vcpu (which is blocked) there is
      only ever a single waiting vcpu, thus no need for any sort of
      queue.
      
      As such, make use of the rcuwait primitive, with the following
      considerations:
      
        - rcuwait already provides the proper barriers that serialize
        concurrent waiter and waker.
      
        - Task wakeup is done in rcu read critical region, with a
        stable task pointer.
      
        - Because there is no concurrency among waiters, we need
        not worry about rcuwait_wait_event() calls corrupting
        the wait->task. As a consequence, this saves the locking
        done in swait when modifying the queue. This also applies
        to per-vcore wait for powerpc kvm-hv.
      
      The x86 tscdeadline_latency test mentioned in 8577370f
      ("KVM: Use simple waitqueue for vcpu->wq") shows that, on avg,
      latency is reduced by around 15-20% with this change.
      
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: kvmarm@lists.cs.columbia.edu
      Cc: linux-mips@vger.kernel.org
      Reviewed-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Message-Id: <20200424054837.5138-6-dave@stgolabs.net>
      [Avoid extra logic changes. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da4ad88c
  3. 08 5月, 2020 1 次提交
  4. 25 4月, 2020 1 次提交
  5. 21 4月, 2020 4 次提交
  6. 16 4月, 2020 1 次提交
  7. 31 3月, 2020 1 次提交
  8. 26 3月, 2020 1 次提交
  9. 17 3月, 2020 17 次提交
  10. 12 2月, 2020 1 次提交
  11. 05 2月, 2020 1 次提交
    • Z
      KVM: fix overflow of zero page refcount with ksm running · 7df003c8
      Zhuang Yanying 提交于
      We are testing Virtual Machine with KSM on v5.4-rc2 kernel,
      and found the zero_page refcount overflow.
      The cause of refcount overflow is increased in try_async_pf
      (get_user_page) without being decreased in mmu_set_spte()
      while handling ept violation.
      In kvm_release_pfn_clean(), only unreserved page will call
      put_page. However, zero page is reserved.
      So, as well as creating and destroy vm, the refcount of
      zero page will continue to increase until it overflows.
      
      step1:
      echo 10000 > /sys/kernel/pages_to_scan/pages_to_scan
      echo 1 > /sys/kernel/pages_to_scan/run
      echo 1 > /sys/kernel/pages_to_scan/use_zero_pages
      
      step2:
      just create several normal qemu kvm vms.
      And destroy it after 10s.
      Repeat this action all the time.
      
      After a long period of time, all domains hang because
      of the refcount of zero page overflow.
      
      Qemu print error log as follow:
       …
       error: kvm run failed Bad address
       EAX=00006cdc EBX=00000008 ECX=80202001 EDX=078bfbfd
       ESI=ffffffff EDI=00000000 EBP=00000008 ESP=00006cc4
       EIP=000efd75 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
       ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
       CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
       SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
       DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
       FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
       GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
       LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
       TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
       GDT=     000f7070 00000037
       IDT=     000f70ae 00000000
       CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
       DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
       DR6=00000000ffff0ff0 DR7=0000000000000400
       EFER=0000000000000000
       Code=00 01 00 00 00 e9 e8 00 00 00 c7 05 4c 55 0f 00 01 00 00 00 <8b> 35 00 00 01 00 8b 3d 04 00 01 00 b8 d8 d3 00 00 c1 e0 08 0c ea a3 00 00 01 00 c7 05 04
       …
      
      Meanwhile, a kernel warning is departed.
      
       [40914.836375] WARNING: CPU: 3 PID: 82067 at ./include/linux/mm.h:987 try_get_page+0x1f/0x30
       [40914.836412] CPU: 3 PID: 82067 Comm: CPU 0/KVM Kdump: loaded Tainted: G           OE     5.2.0-rc2 #5
       [40914.836415] RIP: 0010:try_get_page+0x1f/0x30
       [40914.836417] Code: 40 00 c3 0f 1f 84 00 00 00 00 00 48 8b 47 08 a8 01 75 11 8b 47 34 85 c0 7e 10 f0 ff 47 34 b8 01 00 00 00 c3 48 8d 78 ff eb e9 <0f> 0b 31 c0 c3 66 90 66 2e 0f 1f 84 00 0
       0 00 00 00 48 8b 47 08 a8
       [40914.836418] RSP: 0018:ffffb4144e523988 EFLAGS: 00010286
       [40914.836419] RAX: 0000000080000000 RBX: 0000000000000326 RCX: 0000000000000000
       [40914.836420] RDX: 0000000000000000 RSI: 00004ffdeba10000 RDI: ffffdf07093f6440
       [40914.836421] RBP: ffffdf07093f6440 R08: 800000424fd91225 R09: 0000000000000000
       [40914.836421] R10: ffff9eb41bfeebb8 R11: 0000000000000000 R12: ffffdf06bbd1e8a8
       [40914.836422] R13: 0000000000000080 R14: 800000424fd91225 R15: ffffdf07093f6440
       [40914.836423] FS:  00007fb60ffff700(0000) GS:ffff9eb4802c0000(0000) knlGS:0000000000000000
       [40914.836425] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [40914.836426] CR2: 0000000000000000 CR3: 0000002f220e6002 CR4: 00000000003626e0
       [40914.836427] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [40914.836427] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [40914.836428] Call Trace:
       [40914.836433]  follow_page_pte+0x302/0x47b
       [40914.836437]  __get_user_pages+0xf1/0x7d0
       [40914.836441]  ? irq_work_queue+0x9/0x70
       [40914.836443]  get_user_pages_unlocked+0x13f/0x1e0
       [40914.836469]  __gfn_to_pfn_memslot+0x10e/0x400 [kvm]
       [40914.836486]  try_async_pf+0x87/0x240 [kvm]
       [40914.836503]  tdp_page_fault+0x139/0x270 [kvm]
       [40914.836523]  kvm_mmu_page_fault+0x76/0x5e0 [kvm]
       [40914.836588]  vcpu_enter_guest+0xb45/0x1570 [kvm]
       [40914.836632]  kvm_arch_vcpu_ioctl_run+0x35d/0x580 [kvm]
       [40914.836645]  kvm_vcpu_ioctl+0x26e/0x5d0 [kvm]
       [40914.836650]  do_vfs_ioctl+0xa9/0x620
       [40914.836653]  ksys_ioctl+0x60/0x90
       [40914.836654]  __x64_sys_ioctl+0x16/0x20
       [40914.836658]  do_syscall_64+0x5b/0x180
       [40914.836664]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
       [40914.836666] RIP: 0033:0x7fb61cb6bfc7
      Signed-off-by: NLinFeng <linfeng23@huawei.com>
      Signed-off-by: NZhuang Yanying <ann.zhuangyanying@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7df003c8
  12. 31 1月, 2020 2 次提交
  13. 28 1月, 2020 6 次提交
    • S
      KVM: Play nice with read-only memslots when querying host page size · 42cde48b
      Sean Christopherson 提交于
      Avoid the "writable" check in __gfn_to_hva_many(), which will always fail
      on read-only memslots due to gfn_to_hva() assuming writes.  Functionally,
      this allows x86 to create large mappings for read-only memslots that
      are backed by HugeTLB mappings.
      
      Note, the changelog for commit 05da4558 ("KVM: MMU: large page
      support") states "If the largepage contains write-protected pages, a
      large pte is not used.", but "write-protected" refers to pages that are
      temporarily read-only, e.g. read-only memslots didn't even exist at the
      time.
      
      Fixes: 4d8b81ab ("KVM: introduce readonly memslot")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Redone using kvm_vcpu_gfn_to_memslot_prot. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42cde48b
    • S
      KVM: Use vcpu-specific gva->hva translation when querying host page size · f9b84e19
      Sean Christopherson 提交于
      Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the
      correct set of memslots is used when handling x86 page faults in SMM.
      
      Fixes: 54bf36aa ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f9b84e19
    • S
      mm: thp: KVM: Explicitly check for THP when populating secondary MMU · 005ba37c
      Sean Christopherson 提交于
      Add a helper, is_transparent_hugepage(), to explicitly check whether a
      compound page is a THP and use it when populating KVM's secondary MMU.
      The explicit check fixes a bug where a remapped compound page, e.g. for
      an XDP Rx socket, is mapped into a KVM guest and is mistaken for a THP,
      which results in KVM incorrectly creating a huge page in its secondary
      MMU.
      
      Fixes: 936a5fe6 ("thp: kvm mmu transparent hugepage support")
      Reported-by: syzbot+c9d1fb51ac9d0d10c39d@syzkaller.appspotmail.com
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      005ba37c
    • S
      KVM: Return immediately if __kvm_gfn_to_hva_cache_init() fails · dc9ce71e
      Sean Christopherson 提交于
      Check the result of __kvm_gfn_to_hva_cache_init() and return immediately
      instead of relying on the kvm_is_error_hva() check to detect errors so
      that it's abundantly clear KVM intends to immediately bail on an error.
      
      Note, the hva check is still mandatory to handle errors on subqeuesnt
      calls with the same generation.  Similarly, always return -EFAULT on
      error so that multiple (bad) calls for a given generation will get the
      same result, e.g. on an illegal gfn wrap, propagating the return from
      __kvm_gfn_to_hva_cache_init() would cause the initial call to return
      -EINVAL and subsequent calls to return -EFAULT.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dc9ce71e
    • S
      KVM: Clean up __kvm_gfn_to_hva_cache_init() and its callers · 6ad1e29f
      Sean Christopherson 提交于
      Barret reported a (technically benign) bug where nr_pages_avail can be
      accessed without being initialized if gfn_to_hva_many() fails.
      
        virt/kvm/kvm_main.c:2193:13: warning: 'nr_pages_avail' may be
        used uninitialized in this function [-Wmaybe-uninitialized]
      
      Rather than simply squashing the warning by initializing nr_pages_avail,
      fix the underlying issues by reworking __kvm_gfn_to_hva_cache_init() to
      return immediately instead of continuing on.  Now that all callers check
      the result and/or bail immediately on a bad hva, there's no need to
      explicitly nullify the memslot on error.
      Reported-by: NBarret Rhoden <brho@google.com>
      Fixes: f1b9dd5e ("kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init")
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ad1e29f
    • S
      KVM: Check for a bad hva before dropping into the ghc slow path · fcfbc617
      Sean Christopherson 提交于
      When reading/writing using the guest/host cache, check for a bad hva
      before checking for a NULL memslot, which triggers the slow path for
      handing cross-page accesses.  Because the memslot is nullified on error
      by __kvm_gfn_to_hva_cache_init(), if the bad hva is encountered after
      crossing into a new page, then the kvm_{read,write}_guest() slow path
      could potentially write/access the first chunk prior to detecting the
      bad hva.
      
      Arguably, performing a partial access is semantically correct from an
      architectural perspective, but that behavior is certainly not intended.
      In the original implementation, memslot was not explicitly nullified
      and therefore the partial access behavior varied based on whether the
      memslot itself was null, or if the hva was simply bad.  The current
      behavior was introduced as a seemingly unintentional side effect in
      commit f1b9dd5e ("kvm: Disallow wraparound in
      kvm_gfn_to_hva_cache_init"), which justified the change with "since some
      callers don't check the return code from this function, it sit seems
      prudent to clear ghc->memslot in the event of an error".
      
      Regardless of intent, the partial access is dependent on _not_ checking
      the result of the cache initialization, which is arguably a bug in its
      own right, at best simply weird.
      
      Fixes: 8f964525 ("KVM: Allow cross page reads and writes from cached translations.")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Andrew Honig <ahonig@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fcfbc617