1. 22 9月, 2021 36 次提交
    • M
      KVM: x86: nVMX: re-evaluate emulation_required on nested VM exit · dbab610a
      Maxim Levitsky 提交于
      If L1 had invalid state on VM entry (can happen on SMM transactions
      when we enter from real mode, straight to nested guest),
      
      then after we load 'host' state from VMCS12, the state has to become
      valid again, but since we load the segment registers with
      __vmx_set_segment we weren't always updating emulation_required.
      
      Update emulation_required explicitly at end of load_vmcs12_host_state.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dbab610a
    • M
      KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry · c8607e4a
      Maxim Levitsky 提交于
      It is possible that when non root mode is entered via special entry
      (!from_vmentry), that is from SMM or from loading the nested state,
      the L2 state could be invalid in regard to non unrestricted guest mode,
      but later it can become valid.
      
      (for example when RSM emulation restores segment registers from SMRAM)
      
      Thus delay the check to VM entry, where we will check this and fail.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-7-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c8607e4a
    • M
      KVM: x86: VMX: synthesize invalid VM exit when emulating invalid guest state · c42dec14
      Maxim Levitsky 提交于
      Since no actual VM entry happened, the VM exit information is stale.
      To avoid this, synthesize an invalid VM guest state VM exit.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c42dec14
    • M
      KVM: x86: nSVM: refactor svm_leave_smm and smm_enter_smm · 136a55c0
      Maxim Levitsky 提交于
      Use return statements instead of nested if, and fix error
      path to free all the maps that were allocated.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      136a55c0
    • M
      KVM: x86: SVM: call KVM_REQ_GET_NESTED_STATE_PAGES on exit from SMM mode · e85d3e7b
      Maxim Levitsky 提交于
      Currently the KVM_REQ_GET_NESTED_STATE_PAGES on SVM only reloads PDPTRs,
      and MSR bitmap, with former not really needed for SMM as SMM exit code
      reloads them again from SMRAM'S CR3, and later happens to work
      since MSR bitmap isn't modified while in SMM.
      
      Still it is better to be consistient with VMX.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-5-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e85d3e7b
    • M
      KVM: x86: reset pdptrs_from_userspace when exiting smm · 37687c40
      Maxim Levitsky 提交于
      When exiting SMM, pdpts are loaded again from the guest memory.
      
      This fixes a theoretical bug, when exit from SMM triggers entry to the
      nested guest which re-uses some of the migration
      code which uses this flag as a workaround for a legacy userspace.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37687c40
    • M
      KVM: x86: nSVM: restore the L1 host state prior to resuming nested guest on SMM exit · e2e6e449
      Maxim Levitsky 提交于
      Otherwise guest entry code might see incorrect L1 state (e.g paging state).
      
      Fixes: 37be407b ("KVM: nSVM: Fix L1 state corruption upon return from SMM")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-3-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e2e6e449
    • V
      KVM: nVMX: Filter out all unsupported controls when eVMCS was activated · 8d68bad6
      Vitaly Kuznetsov 提交于
      Windows Server 2022 with Hyper-V role enabled failed to boot on KVM when
      enlightened VMCS is advertised. Debugging revealed there are two exposed
      secondary controls it is not happy with: SECONDARY_EXEC_ENABLE_VMFUNC and
      SECONDARY_EXEC_SHADOW_VMCS. These controls are known to be unsupported,
      as there are no corresponding fields in eVMCSv1 (see the comment above
      EVMCS1_UNSUPPORTED_2NDEXEC definition).
      
      Previously, commit 31de3d25 ("x86/kvm/hyper-v: move VMX controls
      sanitization out of nested_enable_evmcs()") introduced the required
      filtering mechanism for VMX MSRs but for some reason put only known
      to be problematic (and not full EVMCS1_UNSUPPORTED_* lists) controls
      there.
      
      Note, Windows Server 2022 seems to have gained some sanity check for VMX
      MSRs: it doesn't even try to launch a guest when there's something it
      doesn't like, nested_evmcs_check_controls() mechanism can't catch the
      problem.
      
      Let's be bold this time and instead of playing whack-a-mole just filter out
      all unsupported controls from VMX MSRs.
      
      Fixes: 31de3d25 ("x86/kvm/hyper-v: move VMX controls sanitization out of nested_enable_evmcs()")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210907163530.110066-1-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d68bad6
    • S
      KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs · 0bbc2ca8
      Sean Christopherson 提交于
      Check for a NULL cpumask_var_t when kicking multiple vCPUs via
      cpumask_available(), which performs a !NULL check if and only if cpumasks
      are configured to be allocated off-stack.  This is a meaningless
      optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
      importantly helps document that the NULL check is necessary even though
      all callers pass in a local variable.
      
      No functional change intended.
      
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210827092516.1027264-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0bbc2ca8
    • S
      KVM: Clean up benign vcpu->cpu data races when kicking vCPUs · 85b64045
      Sean Christopherson 提交于
      Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
      is read exactly once, and by ensuring the vCPU is booted from guest mode
      if kvm_arch_vcpu_should_kick() returns true.  Fix a similar race in
      kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
      kvm_request_needs_ipi() returns true.
      
      Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
      kvm_request_needs_ipi()) means the target vCPU could get migrated (change
      vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
      reading vcpu->mode.  If that happens, the kick/IPI will be sent to the
      old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
      
      Although failing to kick the vCPU is not exactly ideal, practically
      speaking it cannot cause a functional issue unless there is also a bug in
      the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
      behavior.
      
      The purpose of sending an IPI is purely to get a vCPU into the host (or
      out of reading SPTEs) so that the vCPU can recognize a change in state,
      e.g. a KVM_REQ_* request.  If vCPU's handling of the state change is
      required for correctness, KVM must ensure either the vCPU sees the change
      before entering the guest, or that the sender sees the vCPU as running in
      guest mode.  All architectures handle this by (a) sending the request
      before calling kvm_vcpu_kick() and (b) checking for requests _after_
      setting vcpu->mode.
      
      x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
      ensure it kicks and waits for vCPUs that started reading SPTEs _before_
      MMU changes were finalized, but any vCPU that starts reading after MMU
      changes were finalized will see the new state and can continue on
      uninterrupted.
      
      For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
      x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
      upon for functional correctness, e.g. in the dirty log case, userspace
      cannot assume it has a 100% complete log if vCPUs are still running.
      
      All that said, eliminate the benign race since the cost of doing so is an
      "extra" atomic cmpxchg() in the case where the target vCPU is loaded by
      the current pCPU or is not loaded at all.  I.e. the kick will be skipped
      due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
      opposed to the kick being skipped because of the cpu checks.
      
      Keep the "cpu != me" checks even though they appear useless/impossible at
      first glance.  x86 processes guest IPI writes in a fast path that runs in
      IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE.  And
      calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
      READING_SHADOW_PAGE_TABLES is perfectly reasonable.
      
      Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
      persists, e.g. the vCPU could exit guest mode and get offlined between
      the cpu_online() check and the sending of smp_send_reschedule().  But,
      the online check appears to exist only to avoid a WARN in x86's
      native_smp_send_reschedule() that fires if the target CPU is not online.
      The reschedule WARN exists because CPU offlining takes the CPU out of the
      scheduling pool, i.e. the WARN is intended to detect the case where the
      kernel attempts to schedule a task on an offline CPU.  The actual sending
      of the IPI is a non-issue as at worst it will simpy be dropped on the
      floor.  In other words, KVM's usurping of the reschedule IPI could
      theoretically trigger a WARN if the stars align, but there will be no
      loss of functionality.
      
      [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
      
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: 97222cc8 ("KVM: Emulate local APIC in kernel")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210827092516.1027264-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85b64045
    • V
      KVM: x86: Fix stack-out-of-bounds memory access from ioapic_write_indirect() · 2f9b68f5
      Vitaly Kuznetsov 提交于
      KASAN reports the following issue:
      
       BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
       Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
      
       CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G               X --------- ---
       Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
       Call Trace:
        dump_stack+0xa5/0xe6
        print_address_description.constprop.0+0x18/0x130
        ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
        __kasan_report.cold+0x7f/0x114
        ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
        kasan_report+0x38/0x50
        kasan_check_range+0xf5/0x1d0
        kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
        kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
        ? kvm_arch_exit+0x110/0x110 [kvm]
        ? sched_clock+0x5/0x10
        ioapic_write_indirect+0x59f/0x9e0 [kvm]
        ? static_obj+0xc0/0xc0
        ? __lock_acquired+0x1d2/0x8c0
        ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
      
      The problem appears to be that 'vcpu_bitmap' is allocated as a single long
      on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
      the lower 16 bits of it with bitmap_zero() for no particular reason (my
      guess would be that 'bitmap' and 'vcpu_bitmap' variables in
      kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
      16-bit long, the later should accommodate all possible vCPUs).
      
      Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
      Fixes: 9a2ae9f6 ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
      Reported-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210827092516.1027264-7-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f9b68f5
    • D
      KVM: selftests: Create a separate dirty bitmap per slot · 7c236b81
      David Matlack 提交于
      The calculation to get the per-slot dirty bitmap was incorrect leading
      to a buffer overrun. Fix it by splitting out the dirty bitmap into a
      separate bitmap per slot.
      
      Fixes: 609e6202 ("KVM: selftests: Support multiple slots in dirty_log_perf_test")
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Message-Id: <20210917173657.44011-4-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c236b81
    • D
      KVM: selftests: Refactor help message for -s backing_src · 9f2fc555
      David Matlack 提交于
      All selftests that support the backing_src option were printing their
      own description of the flag and then calling backing_src_help() to dump
      the list of available backing sources. Consolidate the flag printing in
      backing_src_help() to align indentation, reduce duplicated strings, and
      improve consistency across tests.
      
      Note: Passing "-s" to backing_src_help is unnecessary since every test
      uses the same flag. However I decided to keep it for code readability
      at the call sites.
      
      While here this opportunistically fixes the incorrectly interleaved
      printing -x help message and list of backing source types in
      dirty_log_perf_test.
      
      Fixes: 609e6202 ("KVM: selftests: Support multiple slots in dirty_log_perf_test")
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20210917173657.44011-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9f2fc555
    • D
      KVM: selftests: Change backing_src flag to -s in demand_paging_test · a1e638da
      David Matlack 提交于
      Every other KVM selftest uses -s for the backing_src, so switch
      demand_paging_test to match.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20210917173657.44011-2-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1e638da
    • P
      KVM: SEV: Allow some commands for mirror VM · 5b92b6ca
      Peter Gonda 提交于
      A mirrored SEV-ES VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
      setup its vCPUs and have them measured, and their VMSAs encrypted. Without
      this change, it is impossible to have mirror VMs as part of SEV-ES VMs.
      
      Also allow the guest status check and debugging commands since they do
      not change any guest state.
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Nathan Tempelman <natet@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Steve Rutherford <srutherford@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 54526d1f ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
      Message-Id: <20210921150345.2221634-3-pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b92b6ca
    • P
      KVM: SEV: Update svm_vm_copy_asid_from for SEV-ES · f43c887c
      Peter Gonda 提交于
      For mirroring SEV-ES the mirror VM will need more then just the ASID.
      The FD and the handle are required to all the mirror to call psp
      commands. The mirror VM will need to call KVM_SEV_LAUNCH_UPDATE_VMSA to
      setup its vCPUs' VMSAs for SEV-ES.
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Nathan Tempelman <natet@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Steve Rutherford <srutherford@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 54526d1f ("KVM: x86: Support KVM VMs sharing SEV context", 2021-04-21)
      Message-Id: <20210921150345.2221634-2-pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f43c887c
    • C
      KVM: nVMX: Fix nested bus lock VM exit · 24a996ad
      Chenyi Qiang 提交于
      Nested bus lock VM exits are not supported yet. If L2 triggers bus lock
      VM exit, it will be directed to L1 VMM, which would cause unexpected
      behavior. Therefore, handle L2's bus lock VM exits in L0 directly.
      
      Fixes: fe6b6bc8 ("KVM: VMX: Enable bus lock VM exit")
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20210914095041.29764-1-chenyi.qiang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24a996ad
    • S
      KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry · 94c245a2
      Sean Christopherson 提交于
      Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is
      shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU
      that's guaranteed to exist).  Using kvm_get_vcpu() to find vCPU works,
      but it's a rather odd and suboptimal method to check the index of a given
      vCPU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210910183220.2397812-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94c245a2
    • S
      KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor · 4eeef242
      Sean Christopherson 提交于
      Read vcpu->vcpu_idx directly instead of bouncing through the one-line
      wrapper, kvm_vcpu_get_idx(), and drop the wrapper.  The wrapper is a
      remnant of the original implementation and serves no purpose; remove it
      before it gains more users.
      
      Back when kvm_vcpu_get_idx() was added by commit 497d72d8 ("KVM: Add
      kvm_vcpu_get_idx to get vcpu index in kvm->vcpus"), the implementation
      was more than just a simple wrapper as vcpu->vcpu_idx did not exist and
      retrieving the index meant walking over the vCPU array to find the given
      vCPU.
      
      When vcpu_idx was introduced by commit 8750e72a ("KVM: remember
      position in kvm->vcpus array"), the helper was left behind, likely to
      avoid extra thrash (but even then there were only two users, the original
      arm usage having been removed at some point in the past).
      
      No functional change intended.
      Suggested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210910183220.2397812-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4eeef242
    • H
      kvm: fix wrong exception emulation in check_rdtsc · e9337c84
      Hou Wenlong 提交于
      According to Intel's SDM Vol2 and AMD's APM Vol3, when
      CR4.TSD is set, use rdtsc/rdtscp instruction above privilege
      level 0 should trigger a #GP.
      
      Fixes: d7eb8203 ("KVM: SVM: Add intercept checks for remaining group7 instructions")
      Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com>
      Message-Id: <1297c0dd3f1bb47a6d089f850b629c7aa0247040.1629257115.git.houwenlong93@linux.alibaba.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9337c84
    • S
      KVM: SEV: Pin guest memory for write for RECEIVE_UPDATE_DATA · 50c03801
      Sean Christopherson 提交于
      Require the target guest page to be writable when pinning memory for
      RECEIVE_UPDATE_DATA.  Per the SEV API, the PSP writes to guest memory:
      
        The result is then encrypted with GCTX.VEK and written to the memory
        pointed to by GUEST_PADDR field.
      
      Fixes: 15fb7de1 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
      Cc: stable@vger.kernel.org
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210914210951.2994260-2-seanjc@google.com>
      Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NPeter Gonda <pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      50c03801
    • M
      KVM: SVM: fix missing sev_decommission in sev_receive_start · f1815e0a
      Mingwei Zhang 提交于
      DECOMMISSION the current SEV context if binding an ASID fails after
      RECEIVE_START.  Per AMD's SEV API, RECEIVE_START generates a new guest
      context and thus needs to be paired with DECOMMISSION:
      
           The RECEIVE_START command is the only command other than the LAUNCH_START
           command that generates a new guest context and guest handle.
      
      The missing DECOMMISSION can result in subsequent SEV launch failures,
      as the firmware leaks memory and might not able to allocate more SEV
      guest contexts in the future.
      
      Note, LAUNCH_START suffered the same bug, but was previously fixed by
      commit 934002cd ("KVM: SVM: Call SEV Guest Decommission if ASID
      binding fails").
      
      Cc: Alper Gun <alpergun@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: David Rienjes <rientjes@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: John Allen <john.allen@amd.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Acked-by: NBrijesh Singh <brijesh.singh@amd.com>
      Fixes: af43cbbf ("KVM: SVM: Add support for KVM_SEV_RECEIVE_START command")
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210912181815.3899316-1-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1815e0a
    • P
      KVM: SEV: Acquire vcpu mutex when updating VMSA · bb18a677
      Peter Gonda 提交于
      The update-VMSA ioctl touches data stored in struct kvm_vcpu, and
      therefore should not be performed concurrently with any VCPU ioctl
      that might cause KVM or the processor to use the same data.
      
      Adds vcpu mutex guard to the VMSA updating code. Refactors out
      __sev_launch_update_vmsa() function to deal with per vCPU parts
      of sev_launch_update_vmsa().
      
      Fixes: ad73109a ("KVM: SVM: Provide support to launch and run an SEV-ES guest")
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20210915171755.3773766-1-pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb18a677
    • S
      KVM: do not shrink halt_poll_ns below grow_start · ae232ea4
      Sergey Senozhatsky 提交于
      grow_halt_poll_ns() ignores values between 0 and
      halt_poll_ns_grow_start (10000 by default). However,
      when we shrink halt_poll_ns we may fall way below
      halt_poll_ns_grow_start and endup with halt_poll_ns
      values that don't make a lot of sense: like 1 or 9,
      or 19.
      
      VCPU1 trace (halt_poll_ns_shrink equals 2):
      
      VCPU1 grow 10000
      VCPU1 shrink 5000
      VCPU1 shrink 2500
      VCPU1 shrink 1250
      VCPU1 shrink 625
      VCPU1 shrink 312
      VCPU1 shrink 156
      VCPU1 shrink 78
      VCPU1 shrink 39
      VCPU1 shrink 19
      VCPU1 shrink 9
      VCPU1 shrink 4
      
      Mirror what grow_halt_poll_ns() does and set halt_poll_ns
      to 0 as soon as new shrink-ed halt_poll_ns value falls
      below halt_poll_ns_grow_start.
      Signed-off-by: NSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210902031100.252080-1-senozhatsky@chromium.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae232ea4
    • Y
      KVM: nVMX: fix comments of handle_vmon() · ed7023a1
      Yu Zhang 提交于
      "VMXON pointer" is saved in vmx->nested.vmxon_ptr since
      commit 3573e22c ("KVM: nVMX: additional checks on
      vmxon region"). Also, handle_vmptrld() & handle_vmclear()
      now have logic to check the VMCS pointer against the VMXON
      pointer.
      
      So just remove the obsolete comments of handle_vmon().
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Message-Id: <20210908171731.18885-1-yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed7023a1
    • H
      KVM: x86: Handle SRCU initialization failure during page track init · eb7511bf
      Haimin Zhang 提交于
      Check the return of init_srcu_struct(), which can fail due to OOM, when
      initializing the page track mechanism.  Lack of checking leads to a NULL
      pointer deref found by a modified syzkaller.
      Reported-by: NTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com>
      Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
      [Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eb7511bf
    • S
      KVM: VMX: Remove defunct "nr_active_uret_msrs" field · cd36ae87
      Sean Christopherson 提交于
      Remove vcpu_vmx.nr_active_uret_msrs and its associated comment, which are
      both defunct now that KVM keeps the list constant and instead explicitly
      tracks which entries need to be loaded into hardware.
      
      No functional change intended.
      
      Fixes: ee9d22e0 ("KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting list")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210908002401.1947049-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd36ae87
    • O
      selftests: KVM: Align SMCCC call with the spec in steal_time · 01f91acb
      Oliver Upton 提交于
      The SMC64 calling convention passes a function identifier in w0 and its
      parameters in x1-x17. Given this, there are two deviations in the
      SMC64 call performed by the steal_time test: the function identifier is
      assigned to a 64 bit register and the parameter is only 32 bits wide.
      
      Align the call with the SMCCC by using a 32 bit register to handle the
      function identifier and increasing the parameter width to 64 bits.
      Suggested-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Message-Id: <20210921171121.2148982-3-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      01f91acb
    • O
      selftests: KVM: Fix check for !POLLIN in demand_paging_test · 90b54129
      Oliver Upton 提交于
      The logical not operator applies only to the left hand side of a bitwise
      operator. As such, the check for POLLIN not being set in revents wrong.
      Fix it by adding parentheses around the bitwise expression.
      
      Fixes: 4f72180e ("KVM: selftests: Add demand paging content to the demand paging test")
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20210921171121.2148982-2-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90b54129
    • S
      KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT · 03a6e840
      Sean Christopherson 提交于
      Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
      Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
      RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
      likely escaped notice because no firmware/kernel puts its page tables root
      at PA=0, let alone relies on INIT to get the desired CR3 for such page
      tables.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03a6e840
    • S
      KVM: x86: Mark all registers as avail/dirty at vCPU creation · 7117003f
      Sean Christopherson 提交于
      Mark all registers as available and dirty at vCPU creation, as the vCPU has
      obviously not been loaded into hardware, let alone been given the chance to
      be modified in hardware.  On SVM, reading from "uninitialized" hardware is
      a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
      hardware does not allow for arbitrary field encoding schemes.
      
      On VMX, backing memory for VMCSes is also zero allocated, but true
      initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
      architectural specification technically allows CPU implementations to
      encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
      the inverted value of every field, which would result in VMREAD to a
      zero-allocated field returns all ones.
      
      In practice, only the AR_BYTES fields are known to be manipulated by
      hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
      does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
      other words, this is technically a bug fix, but practically speakings it's
      a glorified nop.
      
      Failure to mark registers as available has been a lurking bug for quite
      some time.  The original register caching supported only GPRs (+RIP, which
      is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
      worked because the two cacheable registers, RIP and RSP, are generally
      speaking not read as side effects in other flows.
      
      Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on
      demand") was the first instance of failure to mark regs available.  While
      _just_ marking CR3 available during vCPU creation wouldn't have fixed the
      VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
      unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
      reading GUEST_CR3 when it's available would have avoided VMREAD to a
      technically-uninitialized VMCS.
      
      Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand")
      Fixes: 6de4f3ad ("KVM: Cache pdptrs")
      Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()")
      Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields")
      Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0")
      Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4")
      Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
      Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7117003f
    • S
      KVM: selftests: Remove __NR_userfaultfd syscall fallback · 2da4a235
      Sean Christopherson 提交于
      Revert the __NR_userfaultfd syscall fallback added for KVM selftests now
      that x86's unistd_{32,63}.h overrides are under uapi/ and thus not in
      KVM selftests' search path, i.e. now that KVM gets x86 syscall numbers
      from the installed kernel headers.
      
      No functional change intended.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901203030.1292304-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2da4a235
    • S
      KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs · 61e52f16
      Sean Christopherson 提交于
      Add a test to verify an rseq's CPU ID is updated correctly if the task is
      migrated while the kernel is handling KVM_RUN.  This is a regression test
      for a bug introduced by commit 72c3c0fe ("x86/kvm: Use generic xfer
      to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
      without updating rseq, leading to a stale CPU ID and other badness.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Message-Id: <20210901203030.1292304-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61e52f16
    • S
      tools: Move x86 syscall number fallbacks to .../uapi/ · de5f4213
      Sean Christopherson 提交于
      Move unistd_{32,64}.h from x86/include/asm to x86/include/uapi/asm so
      that tools/selftests that install kernel headers, e.g. KVM selftests, can
      include non-uapi tools headers, e.g. to get 'struct list_head', without
      effectively overriding the installed non-tool uapi headers.
      
      Swapping KVM's search order, e.g. to search the kernel headers before
      tool headers, is not a viable option as doing results in linux/type.h and
      other core headers getting pulled from the kernel headers, which do not
      have the kernel-internal typedefs that are used through tools, including
      many files outside of selftests/kvm's control.
      
      Prior to commit cec07f53 ("perf tools: Move syscall number fallbacks
      from perf-sys.h to tools/arch/x86/include/asm/"), the handcoded numbers
      were actual fallbacks, i.e. overriding unistd_{32,64}.h from the kernel
      headers was unintentional.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901203030.1292304-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de5f4213
    • S
      entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume() · a68de80f
      Sean Christopherson 提交于
      Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
      that the two function are always called back-to-back by architectures
      that have rseq.  The rseq helper is stubbed out for architectures that
      don't support rseq, i.e. this is a nop across the board.
      
      Note, tracehook_notify_resume() is horribly named and arguably does not
      belong in tracehook.h as literally every line of code in it has nothing
      to do with tracing.  But, that's been true since commit a42c6ded
      ("move key_repace_session_keyring() into tracehook_notify_resume()")
      first usurped tracehook_notify_resume() back in 2012.  Punt cleaning that
      mess up to future patches.
      
      No functional change intended.
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901203030.1292304-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a68de80f
    • S
      KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest · 8646e536
      Sean Christopherson 提交于
      Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
      transferring to a KVM guest, which is roughly equivalent to an exit to
      userspace and processes many of the same pending actions.  While the task
      cannot be in an rseq critical section as the KVM path is reachable only
      by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
      critical section still apply, e.g. the current CPU needs to be updated if
      the task is migrated.
      
      Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
      and other badness in userspace VMMs that use rseq in combination with KVM,
      e.g. due to the CPU ID being stale after task migration.
      
      Fixes: 72c3c0fe ("x86/kvm: Use generic xfer to guest work function")
      Reported-by: NPeter Foley <pefoley@google.com>
      Bisected-by: NDoug Evans <dje@google.com>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210901203030.1292304-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8646e536
  2. 20 9月, 2021 4 次提交
    • L
      Linux 5.15-rc2 · e4e737bb
      Linus Torvalds 提交于
      e4e737bb
    • L
      pci_iounmap'2: Electric Boogaloo: try to make sense of it all · 316e8d79
      Linus Torvalds 提交于
      Nathan Chancellor reports that the recent change to pci_iounmap in
      commit 9caea000 ("parisc: Declare pci_iounmap() parisc version only
      when CONFIG_PCI enabled") causes build errors on arm64.
      
      It took me about two hours to convince myself that I think I know what
      the logic of that mess of #ifdef's in the <asm-generic/io.h> header file
      really aim to do, and rewrite it to be easier to follow.
      
      Famous last words.
      
      Anyway, the code has now been lifted from that grotty header file into
      lib/pci_iomap.c, and has fairly extensive comments about what the logic
      is.  It also avoids indirecting through another confusing (and badly
      named) helper function that has other preprocessor config conditionals.
      
      Let's see what odd architecture did something else strange in this area
      to break things.  But my arm64 cross build is clean.
      
      Fixes: 9caea000 ("parisc: Declare pci_iounmap() parisc version only when CONFIG_PCI enabled")
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Ulrich Teichert <krypton@ulrich-teichert.org>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      316e8d79
    • L
      Merge tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 20621d2f
      Linus Torvalds 提交于
      Pull x86 fixes from Borislav Petkov:
      
       - Prevent a infinite loop in the MCE recovery on return to user space,
         which was caused by a second MCE queueing work for the same page and
         thereby creating a circular work list.
      
       - Make kern_addr_valid() handle existing PMD entries, which are marked
         not present in the higher level page table, correctly instead of
         blindly dereferencing them.
      
       - Pass a valid address to sanitize_phys(). This was caused by the
         mixture of inclusive and exclusive ranges. memtype_reserve() expect
         'end' being exclusive, but sanitize_phys() wants it inclusive. This
         worked so far, but with end being the end of the physical address
         space the fail is exposed.
      
       - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs
         exceed the previous maximum.
      
      * tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mce: Avoid infinite loop for copy from user recovery
        x86/mm: Fix kern_addr_valid() to cope with existing but not present entries
        x86/platform: Increase maximum GPIO number for X86_64
        x86/pat: Pass valid address to sanitize_phys()
      20621d2f
    • L
      Merge tag 'perf-urgent-2021-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · fec30362
      Linus Torvalds 提交于
      Pull perf event fix from Thomas Gleixner:
       "A single fix for the perf core where a value read with READ_ONCE() was
        checked and then reread which makes all the checks invalid. Reuse the
        already read value instead"
      
      * tag 'perf-urgent-2021-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        events: Reuse value read using READ_ONCE instead of re-reading it
      fec30362