1. 15 11月, 2019 1 次提交
  2. 14 11月, 2019 1 次提交
  3. 12 11月, 2019 1 次提交
    • S
      KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa
      Sean Christopherson 提交于
      Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
      instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
      like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
      pages, e.g. put pages grabbed via gup().  But for flows such as setting
      A/D bits or shifting refcounts for transparent huge pages, KVM needs to
      to avoid processing ZONE_DEVICE pages as the flows in question lack the
      underlying machinery for proper handling of ZONE_DEVICE pages.
      
      This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
      when running a KVM guest backed with /dev/dax memory, as KVM straight up
      doesn't put any references to ZONE_DEVICE pages acquired by gup().
      
      Note, Dan Williams proposed an alternative solution of doing put_page()
      on ZONE_DEVICE pages immediately after gup() in order to simplify the
      auditing needed to ensure is_zone_device_page() is called if and only if
      the backing device is pinned (via gup()).  But that approach would break
      kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
      unmap() when accessing guest memory, unlike KVM's secondary MMU, which
      coordinates with mmu_notifier invalidations to avoid creating stale
      page references, i.e. doesn't rely on pages being pinned.
      
      [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
      Analyzed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a78986aa
  4. 11 11月, 2019 2 次提交
    • P
      KVM: fix placement of refcount initialization · e2d3fcaf
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
         =============================
         WARNING: suspicious RCU usage
         -----------------------------
         ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!
      
         other info that might help us debug this:
      
         rcu_scheduler_active = 2, debug_locks = 1
         no locks held by repro_11/12688.
      
         stack backtrace:
         Call Trace:
          dump_stack+0x7d/0xc5
          lockdep_rcu_suspicious+0x123/0x170
          kvm_dev_ioctl+0x9a9/0x1260 [kvm]
          do_vfs_ioctl+0x1a1/0xfb0
          ksys_ioctl+0x6d/0x80
          __x64_sys_ioctl+0x73/0xb0
          do_syscall_64+0x108/0xaa0
          entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit a97b0e77 (kvm: call kvm_arch_destroy_vm if vm creation fails)
      sets users_count to 1 before kvm_arch_init_vm(), however, if kvm_arch_init_vm()
      fails, we need to decrease this count.  By moving it earlier, we can push
      the decrease to out_err_no_arch_destroy_vm without introducing yet another
      error label.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=15209b84e00000
      
      Reported-by: syzbot+75475908cd0910f141ee@syzkaller.appspotmail.com
      Fixes: a97b0e77 ("kvm: call kvm_arch_destroy_vm if vm creation fails")
      Cc: Jim Mattson <jmattson@google.com>
      Analyzed-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2d3fcaf
    • P
      KVM: Fix NULL-ptr deref after kvm_create_vm fails · 8a44119a
      Paolo Bonzini 提交于
      Reported by syzkaller:
      
          kasan: CONFIG_KASAN_INLINE enabled
          kasan: GPF could be caused by NULL-ptr deref or user memory access
          general protection fault: 0000 [#1] PREEMPT SMP KASAN
          CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0
          RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121
          Call Trace:
           kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline]
           kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494
           vfs_ioctl fs/ioctl.c:46 [inline]
           file_ioctl fs/ioctl.c:509 [inline]
           do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696
           ksys_ioctl+0x62/0x90 fs/ioctl.c:713
           __do_sys_ioctl fs/ioctl.c:720 [inline]
           __se_sys_ioctl fs/ioctl.c:718 [inline]
           __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718
           do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Commit 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails
      initialization, NULL will be returned instead of error code, NULL will not be intercepted
      in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch
      fixes it.
      
      Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that
      was also reported by syzkaller:
      
       wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136
       __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921
       synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline]
       synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997
       kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212
       kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828
       kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579
       kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline]
      
      so do it.
      
      Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com
      Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com
      Fixes: 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a44119a
  5. 05 11月, 2019 1 次提交
  6. 04 11月, 2019 1 次提交
  7. 31 10月, 2019 1 次提交
  8. 25 10月, 2019 1 次提交
  9. 22 10月, 2019 1 次提交
  10. 20 10月, 2019 3 次提交
    • M
      KVM: arm64: pmu: Reset sample period on overflow handling · 8c3252c0
      Marc Zyngier 提交于
      The PMU emulation code uses the perf event sample period to trigger
      the overflow detection. This works fine  for the *first* overflow
      handling, but results in a huge number of interrupts on the host,
      unrelated to the number of interrupts handled in the guest (a x20
      factor is pretty common for the cycle counter). On a slow system
      (such as a SW model), this can result in the guest only making
      forward progress at a glacial pace.
      
      It turns out that the clue is in the name. The sample period is
      exactly that: a period. And once the an overflow has occured,
      the following period should be the full width of the associated
      counter, instead of whatever the guest had initially programed.
      
      Reset the sample period to the architected value in the overflow
      handler, which now results in a number of host interrupts that is
      much closer to the number of interrupts in the guest.
      
      Fixes: b02386eb ("arm64: KVM: Add PMU overflow interrupt routing")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      8c3252c0
    • M
      KVM: arm64: pmu: Set the CHAINED attribute before creating the in-kernel event · 725ce669
      Marc Zyngier 提交于
      The current convention for KVM to request a chained event from the
      host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED).
      
      But as it turns out, this bit gets set *after* we create the kernel
      event that backs our virtual counter, meaning that we never get
      a 64bit counter.
      
      Moving the setting to an earlier point solves the problem.
      
      Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      725ce669
    • M
      KVM: arm64: pmu: Fix cycle counter truncation · f4e23cf9
      Marc Zyngier 提交于
      When a counter is disabled, its value is sampled before the event
      is being disabled, and the value written back in the shadow register.
      
      In that process, the value gets truncated to 32bit, which is adequate
      for any counter but the cycle counter (defined as a 64bit counter).
      
      This obviously results in a corrupted counter, and things like
      "perf record -e cycles" not working at all when run in a guest...
      A similar, but less critical bug exists in kvm_pmu_get_counter_value.
      
      Make the truncation conditional on the counter not being the cycle
      counter, which results in a minor code reorganisation.
      
      Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters")
      Reviewed-by: NAndrew Murray <andrew.murray@arm.com>
      Reported-by: NJulien Thierry <julien.thierry.kdev@gmail.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      f4e23cf9
  11. 01 10月, 2019 1 次提交
  12. 18 9月, 2019 1 次提交
  13. 11 9月, 2019 1 次提交
  14. 09 9月, 2019 1 次提交
    • M
      KVM: arm/arm64: vgic: Allow more than 256 vcpus for KVM_IRQ_LINE · 92f35b75
      Marc Zyngier 提交于
      While parts of the VGIC support a large number of vcpus (we
      bravely allow up to 512), other parts are more limited.
      
      One of these limits is visible in the KVM_IRQ_LINE ioctl, which
      only allows 256 vcpus to be signalled when using the CPU or PPI
      types. Unfortunately, we've cornered ourselves badly by allocating
      all the bits in the irq field.
      
      Since the irq_type subfield (8 bit wide) is currently only taking
      the values 0, 1 and 2 (and we have been careful not to allow anything
      else), let's reduce this field to only 4 bits, and allocate the
      remaining 4 bits to a vcpu2_index, which acts as a multiplier:
      
        vcpu_id = 256 * vcpu2_index + vcpu_index
      
      With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2)
      allowing this to be discovered, it becomes possible to inject
      PPIs to up to 4096 vcpus. But please just don't.
      
      Whilst we're there, add a clarification about the use of KVM_IRQ_LINE
      on arm, which is not completely conditionned by KVM_CAP_IRQCHIP.
      Reported-by: NZenghui Yu <yuzenghui@huawei.com>
      Reviewed-by: NEric Auger <eric.auger@redhat.com>
      Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      92f35b75
  15. 28 8月, 2019 1 次提交
  16. 27 8月, 2019 1 次提交
  17. 25 8月, 2019 2 次提交
  18. 24 8月, 2019 1 次提交
    • A
      KVM: arm/arm64: VGIC: Properly initialise private IRQ affinity · 2e16f3e9
      Andre Przywara 提交于
      At the moment we initialise the target *mask* of a virtual IRQ to the
      VCPU it belongs to, even though this mask is only defined for GICv2 and
      quickly runs out of bits for many GICv3 guests.
      This behaviour triggers an UBSAN complaint for more than 32 VCPUs:
      ------
      [ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21
      [ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int'
      ------
      Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs
      dump is wrong, due to this very same problem.
      
      Because there is no requirement to create the VGIC device before the
      VCPUs (and QEMU actually does it the other way round), we can't safely
      initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch
      every private IRQ for each VCPU anyway later (in vgic_init()), we can
      just move the initialisation of those fields into there, where we
      definitely know the VGIC type.
      
      On the way make sure we really have either a VGICv2 or a VGICv3 device,
      since the existing code is just checking for "VGICv3 or not", silently
      ignoring the uninitialised case.
      Signed-off-by: NAndre Przywara <andre.przywara@arm.com>
      Reported-by: NDave Martin <dave.martin@arm.com>
      Tested-by: NJulien Grall <julien.grall@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      2e16f3e9
  19. 22 8月, 2019 1 次提交
    • A
      KVM: arm/arm64: Only skip MMIO insn once · 2113c5f6
      Andrew Jones 提交于
      If after an MMIO exit to userspace a VCPU is immediately run with an
      immediate_exit request, such as when a signal is delivered or an MMIO
      emulation completion is needed, then the VCPU completes the MMIO
      emulation and immediately returns to userspace. As the exit_reason
      does not get changed from KVM_EXIT_MMIO in these cases we have to
      be careful not to complete the MMIO emulation again, when the VCPU is
      eventually run again, because the emulation does an instruction skip
      (and doing too many skips would be a waste of guest code :-) We need
      to use additional VCPU state to track if the emulation is complete.
      As luck would have it, we already have 'mmio_needed', which even
      appears to be used in this way by other architectures already.
      
      Fixes: 0d640732 ("arm64: KVM: Skip MMIO insn after emulation")
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      2113c5f6
  20. 19 8月, 2019 12 次提交
  21. 09 8月, 2019 2 次提交
  22. 05 8月, 2019 3 次提交
    • M
      KVM: arm/arm64: Sync ICH_VMCR_EL2 back when about to block · 5eeaf10e
      Marc Zyngier 提交于
      Since commit commit 328e5664 ("KVM: arm/arm64: vgic: Defer
      touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or
      its GICv2 equivalent) loaded as long as we can, only syncing it
      back when we're scheduled out.
      
      There is a small snag with that though: kvm_vgic_vcpu_pending_irq(),
      which is indirectly called from kvm_vcpu_check_block(), needs to
      evaluate the guest's view of ICC_PMR_EL1. At the point were we
      call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever
      changes to PMR is not visible in memory until we do a vcpu_put().
      
      Things go really south if the guest does the following:
      
      	mov x0, #0	// or any small value masking interrupts
      	msr ICC_PMR_EL1, x0
      
      	[vcpu preempted, then rescheduled, VMCR sampled]
      
      	mov x0, #ff	// allow all interrupts
      	msr ICC_PMR_EL1, x0
      	wfi		// traps to EL2, so samping of VMCR
      
      	[interrupt arrives just after WFI]
      
      Here, the hypervisor's view of PMR is zero, while the guest has enabled
      its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no
      interrupts are pending (despite an interrupt being received) and we'll
      block for no reason. If the guest doesn't have a periodic interrupt
      firing once it has blocked, it will stay there forever.
      
      To avoid this unfortuante situation, let's resync VMCR from
      kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block()
      will observe the latest value of PMR.
      
      This has been found by booting an arm64 Linux guest with the pseudo NMI
      feature, and thus using interrupt priorities to mask interrupts instead
      of the usual PSTATE masking.
      
      Cc: stable@vger.kernel.org # 4.12
      Fixes: 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put")
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      5eeaf10e
    • G
      KVM: no need to check return value of debugfs_create functions · 3e7093d0
      Greg KH 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
      void instead of an integer, as we should not care at all about if this
      function actually does anything or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <x86@kernel.org>
      Cc: <kvm@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e7093d0
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae