1. 11 2月, 2022 6 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • S
      KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd
      Sean Christopherson 提交于
      Use slightly more verbose names for the so called "memory encrypt",
      a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
      super short kvm_x86_ops names and SVM's more verbose, but non-conforming
      names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
      to fill svm_x86_ops.
      
      Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
      reflect its true nature, as it really is a full fledged ioctl() of its
      own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
      the ioctl() is a gateway to more than just memory encryption, and because
      its underlying purpose to support Confidential VMs, which can be provided
      without memory encryption, e.g. if the TCB of the guest includes the host
      kernel but not host userspace, or by isolation in hardware without
      encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
      further is undeseriable, and short of creating alises for all related
      ioctl()s, which introduces a different flavor of divergence, KVM is stuck
      with the nomenclature.
      
      Defer renaming SVM's functions to a future commit as there are additional
      changes needed to make SVM fully conforming and to match reality (looking
      at you, svm_vm_copy_asid_from()).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d004cd
    • S
      KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53
      Sean Christopherson 提交于
      Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
      its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
      superfluous export from KVM x86.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      872e0c53
    • S
      KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names · e27bc044
      Sean Christopherson 提交于
      Rename a variety of kvm_x86_op function pointers so that preferred name
      for vendor implementations follows the pattern <vendor>_<function>, e.g.
      rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
      allow vendor implementations to be wired up via the KVM_X86_OP macro.
      
      In many cases, VMX and SVM "disagree" on the preferred name, though in
      reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
      the kvm_x86_ops name.  Justification for using the VMX nomenclature:
      
        - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
          event that has already been "set" in e.g. the vIRR.  SVM's relevant
          VMCB field is even named event_inj, and KVM's stat is irq_injections.
      
        - prepare_guest_switch => prepare_switch_to_guest because the former is
          ambiguous, e.g. it could mean switching between multiple guests,
          switching from the guest to host, etc...
      
        - update_pi_irte => pi_update_irte to allow for matching match the rest
          of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
      
        - start_assignment => pi_start_assignment to again follow VMX's posted
          interrupt naming scheme, and to provide context for what bit of code
          might care about an otherwise undescribed "assignment".
      
      The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
      Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
      wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
      variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
      appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
      that change for another time as the Hyper-V hooks always start as NULL,
      i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
      names requires an astounding amount of churn.
      
      VMX and SVM function names are intentionally left as is to minimize the
      diff.  Both VMX and SVM will need to rename even more functions in order
      to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
      inevitable.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e27bc044
    • J
      KVM: x86: Remove unused "vcpu" of kvm_scale_tsc() · 62711e5a
      Jinrong Liang 提交于
      The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-18-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62711e5a
  2. 01 2月, 2022 1 次提交
  3. 27 1月, 2022 2 次提交
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • S
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson 提交于
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
  4. 25 1月, 2022 1 次提交
  5. 20 1月, 2022 2 次提交
    • S
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson 提交于
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • S
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson 提交于
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
  6. 15 1月, 2022 1 次提交
    • K
      kvm: x86: Disable interception for IA32_XFD on demand · b5274b1b
      Kevin Tian 提交于
      Always intercepting IA32_XFD causes non-negligible overhead when this
      register is updated frequently in the guest.
      
      Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
      with a non-zero value.
      
      Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
      with the software states in fpstate and the per-cpu xfd cache. This
      leads to two additional changes accordingly:
      
        - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
          software states back in-sync with the MSR, before handle_exit_irqoff()
          is called.
      
        - Always trap #NM once write interception is disabled for IA32_XFD.
          The #NM exception is rare if the guest doesn't use dynamic
          features. Otherwise, there is at most one exception per guest
          task given a dynamic feature.
      
      p.s. We have confirmed that SDM is being revised to say that
      when setting IA32_XFD[18] the AMX register state is not guaranteed
      to be preserved. This clarification avoids adding mess for a creative
      guest which sets IA32_XFD[18]=1 before saving active AMX state to
      its own storage.
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5274b1b
  7. 07 1月, 2022 4 次提交
    • M
      KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests · 405329fc
      Michael Roth 提交于
      Normally guests will set up CR3 themselves, but some guests, such as
      kselftests, and potentially CONFIG_PVH guests, rely on being booted
      with paging enabled and CR3 initialized to a pre-allocated page table.
      
      Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
      VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
      is too late, since it will have switched over to using the VMSA page
      prior to that point, with the VMSA CR3 copied from the VMCB initial
      CR3 value: 0.
      
      Address this by sync'ing the CR3 value into the VMCB save area
      immediately when KVM_SET_SREGS* is issued so it will find it's way into
      the initial VMSA.
      Suggested-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NMichael Roth <michael.roth@amd.com>
      Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
      [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
       new hook. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      405329fc
    • D
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery · 14243b38
      David Woodhouse 提交于
      This adds basic support for delivering 2 level event channels to a guest.
      
      Initially, it only supports delivery via the IRQ routing table, triggered
      by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
      function which will use the pre-mapped shared_info page if it already
      exists and is still valid, while the slow path through the irqfd_inject
      workqueue will remap the shared_info page if necessary.
      
      It sets the bits in the shared_info page but not the vcpu_info; that is
      deferred to __kvm_xen_has_interrupt() which raises the vector to the
      appropriate vCPU.
      
      Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14243b38
    • D
      KVM: x86/xen: Maintain valid mapping of Xen shared_info page · 1cfc9c4b
      David Woodhouse 提交于
      Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping
      of the Xen shared_info page so that it can be accessed in atomic context.
      
      Note that we do not participate in dirty tracking for the shared info
      page and we do not explicitly mark it dirty every single tim we deliver
      an event channel interrupts. We wouldn't want to do that even if we *did*
      have a valid vCPU context with which to do so.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-4-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1cfc9c4b
    • L
      KVM: x86/pmu: Add pmc->intr to refactor kvm_perf_overflow{_intr}() · 40ccb96d
      Like Xu 提交于
      Depending on whether intr should be triggered or not, KVM registers
      two different event overflow callbacks in the perf_event context.
      
      The code skeleton of these two functions is very similar, so
      the pmc->intr can be stored into pmc from pmc_reprogram_counter()
      which provides smaller instructions footprint against the
      u-architecture branch predictor.
      
      The __kvm_perf_overflow() can be called in non-nmi contexts
      and a flag is needed to distinguish the caller context and thus
      avoid a check on kvm_is_in_guest(), otherwise we might get
      warnings from suspicious RCU or check_preemption_disabled().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-5-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      40ccb96d
  8. 20 12月, 2021 1 次提交
    • M
      KVM: x86: Always set kvm_run->if_flag · c5063551
      Marc Orr 提交于
      The kvm_run struct's if_flag is a part of the userspace/kernel API. The
      SEV-ES patches failed to set this flag because it's no longer needed by
      QEMU (according to the comment in the source code). However, other
      hypervisors may make use of this flag. Therefore, set the flag for
      guests with encrypted registers (i.e., with guest_state_protected set).
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Message-Id: <20211209155257.128747-1-marcorr@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      c5063551
  9. 10 12月, 2021 1 次提交
    • V
      KVM: x86: Wait for IPIs to be delivered when handling Hyper-V TLB flush hypercall · 1ebfaa11
      Vitaly Kuznetsov 提交于
      Prior to commit 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use
      tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH |
      KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs
      and KVM_REQ_TLB_FLUSH is/was defined as:
      
       (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
      
      so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that
      "This call guarantees that by the time control returns back to the
      caller, the observable effects of all flushes on the specified virtual
      processors have occurred." and without KVM_REQUEST_WAIT there's a small
      chance that the vCPU making the TLB flush will resume running before
      all IPIs get delivered to other vCPUs and a stale mapping can get read
      there.
      
      Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST:
      kvm_hv_flush_tlb() is the sole caller which uses it for
      kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where
      KVM_REQUEST_WAIT makes a difference.
      
      Cc: stable@kernel.org
      Fixes: 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211209102937.584397-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ebfaa11
  10. 08 12月, 2021 10 次提交
  11. 02 12月, 2021 1 次提交
  12. 18 11月, 2021 1 次提交
    • M
      KVM: x86/mmu: include EFER.LMA in extended mmu role · b8453cdc
      Maxim Levitsky 提交于
      Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the
      guest root level and is not reflected in kvm_mmu_page_role.level when TDP
      is in use.  When simply running the guest, it is impossible for EFER.LMA
      and kvm_mmu.root_level to get out of sync, as the guest cannot transition
      from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without
      first bouncing through a different MMU context.  And stuffing guest state
      via KVM_SET_SREGS{,2} also ensures a full MMU context reset.
      
      However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to
      set guest state when migrating the VM while L2 is active, the vCPU state
      will reflect L2, not L1.  If L1 is using TDP for L2, then root_mmu will
      have been configured using L2's state, despite not being used for L2.  If
      L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will
      be configured for guest PAE paging, but will match the mmu_role for 64-bit
      paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit.
      
      Alternatively, the root_mmu's role could be invalidated after a successful
      KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu,
      i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily
      tricky, and not even needed if L1 and L2 do have the same role (e.g., they
      are both 64-bit guests and run with the same CR4).
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b8453cdc
  13. 17 11月, 2021 5 次提交
  14. 11 11月, 2021 4 次提交
    • V
      KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS · da1bfd52
      Vitaly Kuznetsov 提交于
      KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
      VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
      either num_online_cpus() or num_present_cpus(), s390 has multiple
      constants depending on hardware features. On x86, KVM reports an
      arbitrary value of '710' which is supposed to be the maximum tested
      value but it's possible to test all KVM_MAX_VCPUS even when there are
      less physical CPUs available.
      
      Drop the arbitrary '710' value and return num_online_cpus() on x86 as
      well. The recommendation will match other architectures and will mean
      'no CPU overcommit'.
      
      For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
      when the requested vCPU number exceeds it. The static limit of '710'
      is quite weird as smaller systems with just a few physical CPUs should
      certainly "recommend" less.
      Suggested-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da1bfd52
    • P
      KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES · 760849b1
      Paul Durrant 提交于
      Currently when kvm_update_cpuid_runtime() runs, it assumes that the
      KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
      however, if Hyper-V support is enabled. In this case the KVM leaves will
      be offset.
      
      This patch introdues as new 'kvm_cpuid_base' field into struct
      kvm_vcpu_arch to track the location of the KVM leaves and function
      kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
      leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
      definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
      target the correct leaf.
      
      NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
            into processor.h to avoid having duplicate code for the iteration
            over possible hypervisor base leaves.
      Signed-off-by: NPaul Durrant <pdurrant@amazon.com>
      Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      760849b1
    • M
      KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active · cae72dcc
      Maxim Levitsky 提交于
      KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
      standard kvm's inject_pending_event, and not via APICv/AVIC.
      
      Since this is a debug feature, just inhibit APICv/AVIC while
      KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.
      
      Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cae72dcc
    • D
      KVM: x86: Fix recording of guest steal time / preempted status · 7e2175eb
      David Woodhouse 提交于
      In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
      not missed") we switched to using a gfn_to_pfn_cache for accessing the
      guest steal time structure in order to allow for an atomic xchg of the
      preempted field. This has a couple of problems.
      
      Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
      atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
      guest vCPU using an IOMEM page for its steal time would never have its
      preempted field set.
      
      Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
      should have been. There are two stages to the GFN->PFN conversion;
      first the GFN is converted to a userspace HVA, and then that HVA is
      looked up in the process page tables to find the underlying host PFN.
      Correct invalidation of the latter would require being hooked up to the
      MMU notifiers, but that doesn't happen---so it just keeps mapping and
      unmapping the *wrong* PFN after the userspace page tables change.
      
      In the !IOMEM case at least the stale page *is* pinned all the time it's
      cached, so it won't be freed and reused by anyone else while still
      receiving the steal time updates. The map/unmap dance only takes care
      of the KVM administrivia such as marking the page dirty.
      
      Until the gfn_to_pfn cache handles the remapping automatically by
      integrating with the MMU notifiers, we might as well not get a
      kernel mapping of it, and use the perfectly serviceable userspace HVA
      that we already have.  We just need to implement the atomic xchg on
      the userspace address with appropriate exception handling, which is
      fairly trivial.
      
      Cc: stable@vger.kernel.org
      Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
      [I didn't entirely agree with David's assessment of the
       usefulness of the gfn_to_pfn cache, and integrated the outcome
       of the discussion in the above commit message. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e2175eb