1. 24 7月, 2020 1 次提交
  2. 10 7月, 2020 1 次提交
  3. 04 7月, 2020 3 次提交
  4. 30 6月, 2020 1 次提交
    • P
      KVM: x86: bit 8 of non-leaf PDPEs is not reserved · 5ecad245
      Paolo Bonzini 提交于
      Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
      page table entries.  Intel ignores it; AMD ignores it in PDEs and PDPEs, but
      reserves it in PML4Es.
      
      Probably, earlier versions of the AMD manual documented it as reserved in PDPEs
      as well, and that behavior made it into KVM as well as kvm-unit-tests; fix it.
      
      Cc: stable@vger.kernel.org
      Reported-by: NNadav Amit <namit@vmware.com>
      Fixes: a0c0feb5 ("KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD", 2014-09-03)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ecad245
  5. 29 6月, 2020 1 次提交
    • W
      KVM: X86: Fix async pf caused null-ptr-deref · 9d3c447c
      Wanpeng Li 提交于
      Syzbot reported that:
      
        CPU: 1 PID: 6780 Comm: syz-executor153 Not tainted 5.7.0-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:__apic_accept_irq+0x46/0xb80
        Call Trace:
         kvm_arch_async_page_present+0x7de/0x9e0
         kvm_check_async_pf_completion+0x18d/0x400
         kvm_arch_vcpu_ioctl_run+0x18bf/0x69f0
         kvm_vcpu_ioctl+0x46a/0xe20
         ksys_ioctl+0x11a/0x180
         __x64_sys_ioctl+0x6f/0xb0
         do_syscall_64+0xf6/0x7d0
         entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      The testcase enables APF mechanism in MSR_KVM_ASYNC_PF_EN with ASYNC_PF_INT
      enabled w/o setting MSR_KVM_ASYNC_PF_INT before, what's worse, interrupt
      based APF 'page ready' event delivery depends on in kernel lapic, however,
      we didn't bail out when lapic is not in kernel during guest setting
      MSR_KVM_ASYNC_PF_EN which causes the null-ptr-deref in host later.
      This patch fixes it.
      
      Reported-by: syzbot+1bf777dfdde86d64b89b@syzkaller.appspotmail.com
      Fixes: 2635b5c4 (KVM: x86: interrupt based APF 'page ready' event delivery)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1593426391-8231-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9d3c447c
  6. 23 6月, 2020 8 次提交
    • S
      KVM: VMX: Remove vcpu_vmx's defunct copy of host_pkru · e4553b49
      Sean Christopherson 提交于
      Remove vcpu_vmx.host_pkru, which got left behind when PKRU support was
      moved to common x86 code.
      
      No functional change intended.
      
      Fixes: 37486135 ("KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200617034123.25647-1-sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e4553b49
    • M
      KVM: x86: allow TSC to differ by NTP correction bounds without TSC scaling · 26769f96
      Marcelo Tosatti 提交于
      The Linux TSC calibration procedure is subject to small variations
      (its common to see +-1 kHz difference between reboots on a given CPU, for example).
      
      So migrating a guest between two hosts with identical processor can fail, in case
      of a small variation in calibrated TSC between them.
      
      Without TSC scaling, the current kernel interface will either return an error
      (if user_tsc_khz <= tsc_khz) or enable TSC catchup mode.
      
      This change enables the following TSC tolerance check to
      accept KVM_SET_TSC_KHZ within tsc_tolerance_ppm (which is 250ppm by default).
      
              /*
               * Compute the variation in TSC rate which is acceptable
               * within the range of tolerance and decide if the
               * rate being applied is within that bounds of the hardware
               * rate.  If so, no scaling or compensation need be done.
               */
              thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
              thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
              if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
                      pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
                      use_scaling = 1;
              }
      
      NTP daemon in the guest can correct this difference (NTP can correct upto 500ppm).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      Message-Id: <20200616114741.GA298183@fuller.cnet>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      26769f96
    • X
      KVM: X86: Fix MSR range of APIC registers in X2APIC mode · bf10bd0b
      Xiaoyao Li 提交于
      Only MSR address range 0x800 through 0x8ff is architecturally reserved
      and dedicated for accessing APIC registers in x2APIC mode.
      
      Fixes: 0105d1a5 ("KVM: x2apic interface to lapic")
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200616073307.16440-1-xiaoyao.li@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf10bd0b
    • S
      KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL · bf09fb6c
      Sean Christopherson 提交于
      Remove support for context switching between the guest's and host's
      desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
      required for correct functionality, e.g. KVM intercepts reads and writes
      to the MSR, and the latency effects of the settings controlled by the
      MSR are not architecturally visible.
      
      As a general rule, KVM should not allow the guest to control power
      management settings unless explicitly enabled by userspace, e.g. see
      KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
      can improve the performance of SMT siblings.  A devious guest could
      disable C0.2 so as to improve the performance of their workloads at the
      detriment to workloads running in the host or on other VMs.
      
      Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
      condition where updates from the host may cause KVM to enter the guest
      with the incorrect value.  Because updates are are propagated to all
      CPUs via IPI (SMP function callback), the value in hardware may be
      stale with respect to the cached value and KVM could enter the guest
      with the wrong value in hardware.  As above, the guest can't observe the
      bad value, but it's a weird and confusing wart in the implementation.
      
      Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
      lists.  Using the lists is only necessary for MSRs that are required for
      correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
      old hardware, or for MSRs that need to-the-uop precision, e.g. perf
      related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
      kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
      vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
      expensive than direct RDMSR/WRMSR.
      
      Furthermore, even if giving the guest control of the MSR is legitimate,
      e.g. in pass-through scenarios, it's not clear that the benefits would
      outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
      roundtrip costs ~250 cycles, and if the guest diverged from the host
      that cost would be paid on every run of the guest.  In other words, if
      there is a legitimate use case then it should be enabled by a new
      per-VM capability.
      
      Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
      correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
      UMWAIT and UMONITOR.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Cc: Jingqi Liu <jingqi.liu@intel.com>
      Cc: Tao Xu <tao3.xu@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bf09fb6c
    • S
      KVM: nVMX: Plumb L2 GPA through to PML emulation · 2dbebf7a
      Sean Christopherson 提交于
      Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all
      intents and purposes is vmx_write_pml_buffer(), instead of having the
      latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS.  If the dirty bit
      update is the result of KVM emulation (rare for L2), then the GPA in the
      VMCS may be stale and/or hold a completely unrelated GPA.
      
      Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2dbebf7a
    • V
      KVM: x86/mmu: Avoid mixing gpa_t with gfn_t in walk_addr_generic() · 312d16c7
      Vitaly Kuznetsov 提交于
      translate_gpa() returns a GPA, assigning it to 'real_gfn' seems obviously
      wrong. There is no real issue because both 'gpa_t' and 'gfn_t' are u64 and
      we don't use the value in 'real_gfn' as a GFN, we do
      
       real_gfn = gpa_to_gfn(real_gfn);
      
      instead. 'If you see a "buffalo" sign on an elephant's cage, do not trust
      your eyes', but let's fix it for good.
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200622151435.752560-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      312d16c7
    • P
      KVM: LAPIC: ensure APIC map is up to date on concurrent update requests · 44d52717
      Paolo Bonzini 提交于
      The following race can cause lost map update events:
      
               cpu1                            cpu2
      
                                      apic_map_dirty = true
        ------------------------------------------------------------
                                      kvm_recalculate_apic_map:
                                           pass check
                                               mutex_lock(&kvm->arch.apic_map_lock);
                                               if (!kvm->arch.apic_map_dirty)
                                           and in process of updating map
        -------------------------------------------------------------
          other calls to
             apic_map_dirty = true         might be too late for affected cpu
        -------------------------------------------------------------
                                           apic_map_dirty = false
        -------------------------------------------------------------
          kvm_recalculate_apic_map:
          bail out on
            if (!kvm->arch.apic_map_dirty)
      
      To fix it, record the beginning of an update of the APIC map in
      apic_map_dirty.  If another APIC map change switches apic_map_dirty
      back to DIRTY during the update, kvm_recalculate_apic_map should not
      make it CLEAN, and the other caller will go through the slow path.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44d52717
    • I
      kvm: lapic: fix broken vcpu hotplug · af28dfac
      Igor Mammedov 提交于
      Guest fails to online hotplugged CPU with error
        smpboot: do_boot_cpu failed(-1) to wakeup CPU#4
      
      It's caused by the fact that kvm_apic_set_state(), which used to call
      recalculate_apic_map() unconditionally and pulled hotplugged CPU into
      apic map, is updating map conditionally on state changes.  In this case
      the APIC map is not considered dirty and the is not updated.
      
      Fix the issue by forcing unconditional update from kvm_apic_set_state(),
      like it used to be.
      
      Fixes: 4abaffce ("KVM: LAPIC: Recalculate apic map in batch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Message-Id: <20200622160830.426022-1-imammedo@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af28dfac
  7. 19 6月, 2020 1 次提交
    • V
      Revert "KVM: VMX: Micro-optimize vmexit time when not exposing PMU" · 49097762
      Vitaly Kuznetsov 提交于
      Guest crashes are observed on a Cascade Lake system when 'perf top' is
      launched on the host, e.g.
      
       BUG: unable to handle kernel paging request at fffffe0000073038
       PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120
       Oops: 0000 [#1] SMP PTI
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380
      ...
       Call Trace:
        serial8250_console_write+0xfe/0x1f0
        call_console_drivers.constprop.0+0x9d/0x120
        console_unlock+0x1ea/0x460
      
      Call traces are different but the crash is imminent. The problem was
      blindly bisected to the commit 041bc42c ("KVM: VMX: Micro-optimize
      vmexit time when not exposing PMU"). It was also confirmed that the
      issue goes away if PMU is exposed to the guest.
      
      With some instrumentation of the guest we can see what is being switched
      (when we do atomic_switch_perf_msrs()):
      
       vmx_vcpu_run: switching 2 msrs
       vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f
       vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2
      
      The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame.
      Regardless of whether PMU is exposed to the guest or not, PEBS needs to
      be disabled upon switch.
      
      This reverts commit 041bc42c.
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200619094046.654019-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49097762
  8. 16 6月, 2020 1 次提交
  9. 15 6月, 2020 1 次提交
    • Q
      kvm/svm: disable KCSAN for svm_vcpu_run() · b95273f1
      Qian Cai 提交于
      For some reasons, running a simple qemu-kvm command with KCSAN will
      reset AMD hosts. It turns out svm_vcpu_run() could not be instrumented.
      Disable it for now.
      
       # /usr/libexec/qemu-kvm -name ubuntu-18.04-server-cloudimg -cpu host
      	-smp 2 -m 2G -hda ubuntu-18.04-server-cloudimg.qcow2
      
      === console output ===
      Kernel 5.6.0-next-20200408+ on an x86_64
      
      hp-dl385g10-05 login:
      
      <...host reset...>
      
      HPE ProLiant System BIOS A40 v1.20 (03/09/2018)
      (C) Copyright 1982-2018 Hewlett Packard Enterprise Development LP
      Early system initialization, please wait...
      Signed-off-by: NQian Cai <cai@lca.pw>
      Message-Id: <20200415153709.1559-1-cai@lca.pw>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b95273f1
  10. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  11. 12 6月, 2020 4 次提交
  12. 11 6月, 2020 3 次提交
  13. 10 6月, 2020 1 次提交
  14. 09 6月, 2020 1 次提交
  15. 08 6月, 2020 4 次提交
    • E
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata 提交于
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
      
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
      
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      include/linux/mmu_notifier.h:
      
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951Signed-off-by: NEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e649b3f0
    • P
      KVM: SVM: fix calls to is_intercept · fb7333df
      Paolo Bonzini 提交于
      is_intercept takes an INTERCEPT_* constant, not SVM_EXIT_*; because
      of this, the compiler was removing the body of the conditionals,
      as if is_intercept returned 0.
      
      This unveils a latent bug: when clearing the VINTR intercept,
      int_ctl must also be changed in the L1 VMCB (svm->nested.hsave),
      just like the intercept itself is also changed in the L1 VMCB.
      Otherwise V_IRQ remains set and, due to the VINTR intercept being clear,
      we get a spurious injection of a vector 0 interrupt on the next
      L2->L1 vmexit.
      Reported-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb7333df
    • V
      Revert "KVM: x86: work around leak of uninitialized stack contents" · 25597f64
      Vitaly Kuznetsov 提交于
      handle_vmptrst()/handle_vmread() stopped injecting #PF unconditionally
      and switched to nested_vmx_handle_memory_failure() which just kills the
      guest with KVM_EXIT_INTERNAL_ERROR in case of MMIO access, zeroing
      'exception' in kvm_write_guest_virt_system() is not needed anymore.
      
      This reverts commit 541ab2ae.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200605115906.532682-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      25597f64
    • V
      KVM: VMX: Properly handle kvm_read/write_guest_virt*() result · 7a35e515
      Vitaly Kuznetsov 提交于
      Syzbot reports the following issue:
      
      WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
       kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
      Call Trace:
      ...
      RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
       nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
       handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
       handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
       vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067
      
      'exception' we're trying to inject with kvm_inject_emulated_page_fault()
      comes from:
      
        nested_vmx_get_vmptr()
         kvm_read_guest_virt()
           kvm_read_guest_virt_helper()
             vcpu->arch.walk_mmu->gva_to_gpa()
      
      but it is only set when GVA to GPA conversion fails. In case it doesn't but
      we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
      nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
      'exception'. This happen when the argument is MMIO.
      
      Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
      KVM code where kvm_read/write_guest_virt*() return result is mishandled.
      VMX instructions along with INVPCID have the same issue. This was already
      noticed before, e.g. see commit 541ab2ae ("KVM: x86: work around
      leak of uninitialized stack contents") but was never fully fixed.
      
      KVM could've handled the request correctly by going to userspace and
      performing I/O but there doesn't seem to be a good need for such requests
      in the first place.
      
      Introduce vmx_handle_memory_failure() as an interim solution.
      
      Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
      KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
      needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
      to have a good enum describing this tristate, just add "int *ret" to
      nested_vmx_get_vmptr() interface to pass the information.
      
      Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a35e515
  16. 05 6月, 2020 8 次提交