1. 21 1月, 2020 2 次提交
  2. 09 1月, 2020 2 次提交
    • S
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 736c291c
      Sean Christopherson 提交于
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      736c291c
    • M
      KVM: get rid of var page in kvm_set_pfn_dirty() · d29c03a5
      Miaohe Lin 提交于
      We can get rid of unnecessary var page in
      kvm_set_pfn_dirty() , thus make code style
      similar with kvm_set_pfn_accessed().
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d29c03a5
  3. 13 12月, 2019 1 次提交
  4. 07 12月, 2019 1 次提交
  5. 06 12月, 2019 3 次提交
  6. 23 11月, 2019 1 次提交
  7. 15 11月, 2019 3 次提交
  8. 14 11月, 2019 1 次提交
  9. 12 11月, 2019 1 次提交
    • S
      KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa
      Sean Christopherson 提交于
      Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
      instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
      like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
      pages, e.g. put pages grabbed via gup().  But for flows such as setting
      A/D bits or shifting refcounts for transparent huge pages, KVM needs to
      to avoid processing ZONE_DEVICE pages as the flows in question lack the
      underlying machinery for proper handling of ZONE_DEVICE pages.
      
      This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
      when running a KVM guest backed with /dev/dax memory, as KVM straight up
      doesn't put any references to ZONE_DEVICE pages acquired by gup().
      
      Note, Dan Williams proposed an alternative solution of doing put_page()
      on ZONE_DEVICE pages immediately after gup() in order to simplify the
      auditing needed to ensure is_zone_device_page() is called if and only if
      the backing device is pinned (via gup()).  But that approach would break
      kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
      unmap() when accessing guest memory, unlike KVM's secondary MMU, which
      coordinates with mmu_notifier invalidations to avoid creating stale
      page references, i.e. doesn't rely on pages being pinned.
      
      [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
      Analyzed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a78986aa
  10. 11 11月, 2019 2 次提交
    • P
      KVM: fix placement of refcount initialization · e2d3fcaf
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
         =============================
         WARNING: suspicious RCU usage
         -----------------------------
         ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!
      
         other info that might help us debug this:
      
         rcu_scheduler_active = 2, debug_locks = 1
         no locks held by repro_11/12688.
      
         stack backtrace:
         Call Trace:
          dump_stack+0x7d/0xc5
          lockdep_rcu_suspicious+0x123/0x170
          kvm_dev_ioctl+0x9a9/0x1260 [kvm]
          do_vfs_ioctl+0x1a1/0xfb0
          ksys_ioctl+0x6d/0x80
          __x64_sys_ioctl+0x73/0xb0
          do_syscall_64+0x108/0xaa0
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit a97b0e77 (kvm: call kvm_arch_destroy_vm if vm creation fails)
      sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
      fails, we need to decrease this count.  By moving it earlier, we can push
      the decrease to out_err_no_arch_destroy_vm without introducing yet another
      error label.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000
      
      Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
      Fixes: a97b0e77 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
      Cc: Jim Mattson <jmattson@google.com>
      Analyzed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2d3fcaf
    • P
      KVM: Fix NULL-ptr deref after kvm_create_vm fails · 8a44119a
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
          kasan: CONFIG_KASAN_INLINE enabled
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
          RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
          Call Trace:
           kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
           kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
           vfs_ioctl fs/ioctl.c:46 [inline]
           file_ioctl fs/ioctl.c:509 [inline]
           do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
           ksys_ioctl+0x62/0x90 fs/ioctl.c:713
           __do_sys_ioctl fs/ioctl.c:720 [inline]
           __se_sys_ioctl fs/ioctl.c:718 [inline]
           __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
           do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
      initialization, NULL will be returned instead of error code, NULL will not be intercepted
      in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
      fixes it.
      
      Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
      was also reported by syzkaller:
      
       wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
       __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
       synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
       synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
       kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
       kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
       kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
       kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]
      
      so do it.
      
      Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
      Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
      Fixes: 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a44119a
  11. 08 11月, 2019 3 次提交
  12. 05 11月, 2019 1 次提交
  13. 04 11月, 2019 1 次提交
  14. 31 10月, 2019 1 次提交
  15. 29 10月, 2019 3 次提交
  16. 25 10月, 2019 1 次提交
  17. 22 10月, 2019 9 次提交
    • S
      KVM: Add separate helper for putting borrowed reference to kvm · 149487bd
      Sean Christopherson 提交于
      Add a new helper, kvm_put_kvm_no_destroy(), to handle putting a borrowed
      reference[*] to the VM when installing a new file descriptor fails.  KVM
      expects the refcount to remain valid in this case, as the in-progress
      ioctl() has an explicit reference to the VM.  The primary motiviation
      for the helper is to document that the 'kvm' pointer is still valid
      after putting the borrowed reference, e.g. to document that doing
      mutex(&kvm->lock) immediately after putting a ref to kvm isn't broken.
      
      [*] When exposing a new object to userspace via a file descriptor, e.g.
          a new vcpu, KVM grabs a reference to itself (the VM) prior to making
          the object visible to userspace to avoid prematurely freeing the VM
          in the scenario where userspace immediately closes file descriptor.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      149487bd
    • W
      KVM: Don't shrink/grow vCPU halt_poll_ns if host side polling is disabled · 44551b2f
      Wanpeng Li 提交于
      Don't waste cycles to shrink/grow vCPU halt_poll_ns if host
      side polling is disabled.
      Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44551b2f
    • S
      KVM: arm64: Provide VCPU attributes for stolen time · 58772e9a
      Steven Price 提交于
      Allow user space to inform the KVM host where in the physical memory
      map the paravirtualized time structures should be located.
      
      User space can set an attribute on the VCPU providing the IPA base
      address of the stolen time structure for that VCPU. This must be
      repeated for every VCPU in the VM.
      
      The address is given in terms of the physical address visible to
      the guest and must be 64 byte aligned. The guest will discover the
      address via a hypercall.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      58772e9a
    • S
      KVM: Allow kvm_device_ops to be const · 8538cb22
      Steven Price 提交于
      Currently a kvm_device_ops structure cannot be const without triggering
      compiler warnings. However the structure doesn't need to be written to
      and, by marking it const, it can be read-only in memory. Add some more
      const keywords to allow this.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      8538cb22
    • S
      KVM: arm64: Support stolen time reporting via shared structure · 8564d637
      Steven Price 提交于
      Implement the service call for configuring a shared structure between a
      VCPU and the hypervisor in which the hypervisor can write the time
      stolen from the VCPU's execution time by other tasks on the host.
      
      User space allocates memory which is placed at an IPA also chosen by user
      space. The hypervisor then updates the shared structure using
      kvm_put_guest() to ensure single copy atomicity of the 64-bit value
      reporting the stolen time in nanoseconds.
      
      Whenever stolen time is enabled by the guest, the stolen time counter is
      reset.
      
      The stolen time itself is retrieved from the sched_info structure
      maintained by the Linux scheduler code. We enable SCHEDSTATS when
      selecting KVM Kconfig to ensure this value is meaningful.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      8564d637
    • S
      KVM: arm64: Implement PV_TIME_FEATURES call · b48c1a45
      Steven Price 提交于
      This provides a mechanism for querying which paravirtualized time
      features are available in this hypervisor.
      
      Also add the header file which defines the ABI for the paravirtualized
      time features we're about to add.
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      b48c1a45
    • C
      KVM: arm/arm64: Factor out hypercall handling from PSCI code · 55009c6e
      Christoffer Dall 提交于
      We currently intertwine the KVM PSCI implementation with the general
      dispatch of hypercall handling, which makes perfect sense because PSCI
      is the only category of hypercalls we support.
      
      However, as we are about to support additional hypercalls, factor out
      this functionality into a separate hypercall handler file.
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      [steven.price@arm.com: rebased]
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NSteven Price <steven.price@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      55009c6e
    • C
      KVM: arm/arm64: Allow user injection of external data aborts · da345174
      Christoffer Dall 提交于
      In some scenarios, such as buggy guest or incorrect configuration of the
      VMM and firmware description data, userspace will detect a memory access
      to a portion of the IPA, which is not mapped to any MMIO region.
      
      For this purpose, the appropriate action is to inject an external abort
      to the guest.  The kernel already has functionality to inject an
      external abort, but we need to wire up a signal from user space that
      lets user space tell the kernel to do this.
      
      It turns out, we already have the set event functionality which we can
      perfectly reuse for this.
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      da345174
    • C
      KVM: arm/arm64: Allow reporting non-ISV data aborts to userspace · c726200d
      Christoffer Dall 提交于
      For a long time, if a guest accessed memory outside of a memslot using
      any of the load/store instructions in the architecture which doesn't
      supply decoding information in the ESR_EL2 (the ISV bit is not set), the
      kernel would print the following message and terminate the VM as a
      result of returning -ENOSYS to userspace:
      
        load/store instruction decoding not implemented
      
      The reason behind this message is that KVM assumes that all accesses
      outside a memslot is an MMIO access which should be handled by
      userspace, and we originally expected to eventually implement some sort
      of decoding of load/store instructions where the ISV bit was not set.
      
      However, it turns out that many of the instructions which don't provide
      decoding information on abort are not safe to use for MMIO accesses, and
      the remaining few that would potentially make sense to use on MMIO
      accesses, such as those with register writeback, are not used in
      practice.  It also turns out that fetching an instruction from guest
      memory can be a pretty horrible affair, involving stopping all CPUs on
      SMP systems, handling multiple corner cases of address translation in
      software, and more.  It doesn't appear likely that we'll ever implement
      this in the kernel.
      
      What is much more common is that a user has misconfigured his/her guest
      and is actually not accessing an MMIO region, but just hitting some
      random hole in the IPA space.  In this scenario, the error message above
      is almost misleading and has led to a great deal of confusion over the
      years.
      
      It is, nevertheless, ABI to userspace, and we therefore need to
      introduce a new capability that userspace explicitly enables to change
      behavior.
      
      This patch introduces KVM_CAP_ARM_NISV_TO_USER (NISV meaning Non-ISV)
      which does exactly that, and introduces a new exit reason to report the
      event to userspace.  User space can then emulate an exception to the
      guest, restart the guest, suspend the guest, or take any other
      appropriate action as per the policy of the running system.
      Reported-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
      Reviewed-by: NAlexander Graf <graf@amazon.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      c726200d
  18. 20 10月, 2019 3 次提交
    • M
      KVM: arm64: pmu: Reset sample period on overflow handling · 8c3252c0
      Marc Zyngier 提交于
      The PMU emulation code uses the perf event sample period to trigger
      the overflow detection. This works fine  for the *first* overflow
      handling, but results in a huge number of interrupts on the host,
      unrelated to the number of interrupts handled in the guest (a x20
      factor is pretty common for the cycle counter). On a slow system
      (such as a SW model), this can result in the guest only making
      forward progress at a glacial pace.
      
      It turns out that the clue is in the name. The sample period is
      exactly that: a period. And once the an overflow has occured,
      the following period should be the full width of the associated
      counter, instead of whatever the guest had initially programed.
      
      Reset the sample period to the architected value in the overflow
      handler, which now results in a number of host interrupts that is
      much closer to the number of interrupts in the guest.
      
      Fixes: b02386eb ("arm64: KVM: Add PMU overflow interrupt routing")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      8c3252c0
    • M
      KVM: arm64: pmu: Set the CHAINED attribute before creating the in-kernel event · 725ce669
      Marc Zyngier 提交于
      The current convention for KVM to request a chained event from the
      host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).
      
      But as it turns out, this bit gets set *after* we create the kernel
      event that backs our virtual counter, meaning that we never get
      a 64bit counter.
      
      Moving the setting to an earlier point solves the problem.
      
      Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      725ce669
    • M
      KVM: arm64: pmu: Fix cycle counter truncation · f4e23cf9
      Marc Zyngier 提交于
      When a counter is disabled, its value is sampled before the event
      is being disabled, and the value written back in the shadow register.
      
      In that process, the value gets truncated to 32bit, which is adequate
      for any counter but the cycle counter (defined as a 64bit counter).
      
      This obviously results in a corrupted counter, and things like
      "perf record -e cycles" not working at all when run in a guest...
      A similar, but less critical bug exists in kvm_pmu_get_counter_value.
      
      Make the truncation conditional on the counter not being the cycle
      counter, which results in a minor code reorganisation.
      
      Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Reported-by: NJulien Thierry <julien.thierry.kdev@gmail.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      f4e23cf9
  19. 01 10月, 2019 1 次提交