1. 11 2月, 2022 8 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU during KVM_CLEAR_DIRTY_LOG · cb00a70b
      David Matlack 提交于
      When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not
      write-protected when dirty logging is enabled on the memslot. Instead
      they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for
      the first time and only for the specific sub-region being cleared.
      
      Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to
      write-protecting to avoid causing write-protection faults on vCPU
      threads. This also allows userspace to smear the cost of huge page
      splitting across multiple ioctls, rather than splitting the entire
      memslot as is the case when initially-all-set is not used.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-17-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb00a70b
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled · a3fe5dbd
      David Matlack 提交于
      When dirty logging is enabled without initially-all-set, try to split
      all huge pages in the memslot down to 4KB pages so that vCPUs do not
      have to take expensive write-protection faults to split huge pages.
      
      Eager page splitting is best-effort only. This commit only adds the
      support for the TDP MMU, and even there splitting may fail due to out
      of memory conditions. Failures to split a huge page is fine from a
      correctness standpoint because KVM will always follow up splitting by
      write-protecting any remaining huge pages.
      
      Eager page splitting moves the cost of splitting huge pages off of the
      vCPU threads and onto the thread enabling dirty logging on the memslot.
      This is useful because:
      
       1. Splitting on the vCPU thread interrupts vCPUs execution and is
          disruptive to customers whereas splitting on VM ioctl threads can
          run in parallel with vCPU execution.
      
       2. Splitting all huge pages at once is more efficient because it does
          not require performing VM-exit handling or walking the page table for
          every 4KiB page in the memslot, and greatly reduces the amount of
          contention on the mmu_lock.
      
      For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
      per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
      all of their memory after dirty logging is enabled decreased by 95% from
      2.94s to 0.14s.
      
      Eager Page Splitting is over 100x more efficient than the current
      implementation of splitting on fault under the read lock. For example,
      taking the same workload as above, Eager Page Splitting reduced the CPU
      required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
      * 96 vCPU threads) to only 1.55 CPU-seconds.
      
      Eager page splitting does increase the amount of time it takes to enable
      dirty logging since it has split all huge pages. For example, the time
      it took to enable dirty logging in the 96GiB region of the
      aforementioned test increased from 0.001s to 1.55s.
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3fe5dbd
    • S
      KVM: x86: Use more verbose names for mem encrypt kvm_x86_ops hooks · 03d004cd
      Sean Christopherson 提交于
      Use slightly more verbose names for the so called "memory encrypt",
      a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current
      super short kvm_x86_ops names and SVM's more verbose, but non-conforming
      names.  This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP()
      to fill svm_x86_ops.
      
      Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better
      reflect its true nature, as it really is a full fledged ioctl() of its
      own.  Ideally, the hook would be named confidential_vm_ioctl() or so, as
      the ioctl() is a gateway to more than just memory encryption, and because
      its underlying purpose to support Confidential VMs, which can be provided
      without memory encryption, e.g. if the TCB of the guest includes the host
      kernel but not host userspace, or by isolation in hardware without
      encrypting memory.  But, diverging from KVM_MEMORY_ENCRYPT_OP even
      further is undeseriable, and short of creating alises for all related
      ioctl()s, which introduces a different flavor of divergence, KVM is stuck
      with the nomenclature.
      
      Defer renaming SVM's functions to a future commit as there are additional
      changes needed to make SVM fully conforming and to match reality (looking
      at you, svm_vm_copy_asid_from()).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03d004cd
    • S
      KVM: x86: Move get_cs_db_l_bits() helper to SVM · 872e0c53
      Sean Christopherson 提交于
      Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that
      its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a
      superfluous export from KVM x86.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-16-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      872e0c53
    • S
      KVM: x86: Use static_call() for copy/move encryption context ioctls() · 7ad02ef0
      Sean Christopherson 提交于
      Define and use static_call()s for .vm_{copy,move}_enc_context_from(),
      mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
      KVM_X86_OP in vendor code to wire up the implementation.  Any performance
      gains eeked out by using static_call() is a happy bonus and not the
      primary motiviation.
      
      Opportunistically refactor the code to reduce indentation and keep line
      lengths reasonable, and to be consistent when wrapping versus running
      a bit over the 80 char soft limit.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-12-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ad02ef0
    • S
      KVM: x86: Use static_call() for .vcpu_deliver_sipi_vector() · a0941a64
      Sean Christopherson 提交于
      Define and use a static_call() for kvm_x86_ops.vcpu_deliver_sipi_vector(),
      mostly so that the op is defined in kvm-x86-ops.h.  This will allow using
      KVM_X86_OP in vendor code to wire up the implementation.  Any performance
      gains eeked out by using static_call() is a happy bonus and not the
      primary motiviation.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a0941a64
    • S
      KVM: x86: Rename kvm_x86_ops pointers to align w/ preferred vendor names · e27bc044
      Sean Christopherson 提交于
      Rename a variety of kvm_x86_op function pointers so that preferred name
      for vendor implementations follows the pattern <vendor>_<function>, e.g.
      rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run().  This will
      allow vendor implementations to be wired up via the KVM_X86_OP macro.
      
      In many cases, VMX and SVM "disagree" on the preferred name, though in
      reality it's VMX and x86 that disagree as SVM blindly prepended _svm to
      the kvm_x86_ops name.  Justification for using the VMX nomenclature:
      
        - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an
          event that has already been "set" in e.g. the vIRR.  SVM's relevant
          VMCB field is even named event_inj, and KVM's stat is irq_injections.
      
        - prepare_guest_switch => prepare_switch_to_guest because the former is
          ambiguous, e.g. it could mean switching between multiple guests,
          switching from the guest to host, etc...
      
        - update_pi_irte => pi_update_irte to allow for matching match the rest
          of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>().
      
        - start_assignment => pi_start_assignment to again follow VMX's posted
          interrupt naming scheme, and to provide context for what bit of code
          might care about an otherwise undescribed "assignment".
      
      The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to
      Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's
      wrong.  x86, VMX, and SVM all use flush_tlb, and even common KVM is on a
      variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more
      appropriate name for the Hyper-V hooks would be flush_remote_tlbs.  Leave
      that change for another time as the Hyper-V hooks always start as NULL,
      i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all
      names requires an astounding amount of churn.
      
      VMX and SVM function names are intentionally left as is to minimize the
      diff.  Both VMX and SVM will need to rename even more functions in order
      to fully utilize KVM_X86_OPS, i.e. an additional patch for each is
      inevitable.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220128005208.4008533-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e27bc044
    • J
      KVM: x86: Remove unused "vcpu" of kvm_scale_tsc() · 62711e5a
      Jinrong Liang 提交于
      The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used,
      so remove it. No functional change intended.
      Signed-off-by: NJinrong Liang <cloudliang@tencent.com>
      Message-Id: <20220125095909.38122-18-cloudliang@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62711e5a
  2. 09 2月, 2022 1 次提交
  3. 01 2月, 2022 1 次提交
  4. 28 1月, 2022 1 次提交
  5. 27 1月, 2022 2 次提交
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • S
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson 提交于
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
  6. 25 1月, 2022 1 次提交
  7. 20 1月, 2022 2 次提交
    • S
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson 提交于
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • S
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson 提交于
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
  8. 16 1月, 2022 1 次提交
  9. 15 1月, 2022 10 次提交
  10. 12 1月, 2022 2 次提交
  11. 08 1月, 2022 2 次提交
  12. 07 1月, 2022 5 次提交
  13. 31 12月, 2021 1 次提交
  14. 30 12月, 2021 1 次提交
  15. 28 12月, 2021 1 次提交
  16. 23 12月, 2021 1 次提交