1. 25 5月, 2022 3 次提交
    • L
      KVM: set_msr_mce: Permit guests to ignore single-bit ECC errors · 0471a7bd
      Lev Kujawski 提交于
      Certain guest operating systems (e.g., UNIXWARE) clear bit 0 of
      MC1_CTL to ignore single-bit ECC data errors.  Single-bit ECC data
      errors are always correctable and thus are safe to ignore because they
      are informational in nature rather than signaling a loss of data
      integrity.
      
      Prior to this patch, these guests would crash upon writing MC1_CTL,
      with resultant error messages like the following:
      
      error: kvm run failed Operation not permitted
      EAX=fffffffe EBX=fffffffe ECX=00000404 EDX=ffffffff
      ESI=ffffffff EDI=00000001 EBP=fffdaba4 ESP=fffdab20
      EIP=c01333a5 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      CS =0100 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
      SS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      DS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      FS =0000 00000000 ffffffff 00c00000
      GS =0000 00000000 ffffffff 00c00000
      LDT=0118 c1026390 00000047 00008200 DPL=0 LDT
      TR =0110 ffff5af0 00000067 00008b00 DPL=0 TSS32-busy
      GDT=     ffff5020 000002cf
      IDT=     ffff52f0 000007ff
      CR0=8001003b CR2=00000000 CR3=0100a000 CR4=00000230
      DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
      DR6=ffff0ff0 DR7=00000400
      EFER=0000000000000000
      Code=08 89 01 89 51 04 c3 8b 4c 24 08 8b 01 8b 51 04 8b 4c 24 04 <0f>
      30 c3 f7 05 a4 6d ff ff 10 00 00 00 74 03 0f 31 c3 33 c0 33 d2 c3 8d
      74 26 00 0f 31 c3
      Signed-off-by: NLev Kujawski <lkujaw@member.fsf.org>
      Message-Id: <20220521081511.187388-1-lkujaw@member.fsf.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0471a7bd
    • S
      KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd
      Sean Christopherson 提交于
      Whenever x86_decode_emulated_instruction() detects a breakpoint, it
      returns the value that kvm_vcpu_check_breakpoint() writes into its
      pass-by-reference second argument.  Unfortunately this is completely
      bogus because the expected outcome of x86_decode_emulated_instruction
      is an EMULATION_* value.
      
      Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
      a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
      and x86_emulate_instruction() is called without having decoded the
      instruction.  This causes various havoc from running with a stale
      emulation context.
      
      The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
      before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
      emulation with decoding") introduced x86_decode_emulated_instruction().
      The other caller of the function does not need breakpoint checks,
      because it is invoked as part of a vmexit and the processor has already
      checked those before executing the instruction that #GP'd.
      
      This fixes CVE-2022-1852.
      Reported-by: NQiuhao Li <qiuhao@sysec.org>
      Reported-by: NGaoning Pan <pgn@zju.edu.cn>
      Reported-by: NYongkang Jia <kangel@zju.edu.cn>
      Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311032801.3467418-2-seanjc@google.com>
      [Rewrote commit message according to Qiuhao's report, since a patch
       already existed to fix the bug. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fee060cd
    • W
      KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351
      Wanpeng Li 提交于
      In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
      tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
      was moved after guest_exit_irqoff() because invoking tracepoints within
      kvm_guest_enter/kvm_guest_exit caused a lockdep splat.
      
      These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
      Move context tracking where it belongs", 2020-07-09) restricted
      the RCU extended quiescent state to be closer to vmentry/vmexit.
      Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
      because it will be reported even if vcpu_enter_guest causes multiple
      vmentries via the IPI/Timer fast paths, and it allows the removal of
      advance_expire_delta.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0ac5351
  2. 12 5月, 2022 4 次提交
  3. 02 5月, 2022 1 次提交
  4. 30 4月, 2022 5 次提交
    • S
      KVM: SVM: Introduce trace point for the slow-path of avic_kic_target_vcpus · 9f084f7c
      Suravee Suthikulpanit 提交于
      This can help identify potential performance issues when handles
      AVIC incomplete IPI due vCPU not running.
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9f084f7c
    • P
      KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d
      Paolo Bonzini 提交于
      direct_map is always equal to the direct field of the root page's role:
      
      - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
      copied from cpu_role.base.direct
      
      - for TDP, it is always true and root_role.direct is also always true
      
      - for shadow TDP, it is always false and root_role.direct is also always
      false
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      347a0d0d
    • S
      KVM: x86: Clean up and document nested #PF workaround · 6819af75
      Sean Christopherson 提交于
      Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
      with an explicit, common workaround in kvm_inject_emulated_page_fault().
      Aside from being a hack, the current approach is brittle and incomplete,
      e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
      and nVMX fails to apply the workaround when VMX is intercepting #PF due
      to allow_smaller_maxphyaddr=1.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6819af75
    • P
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini 提交于
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • S
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson 提交于
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
  5. 22 4月, 2022 6 次提交
    • M
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang 提交于
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NSean Christpherson <seanjc@google.com>
      Reported-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • S
      KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled · 0047fb33
      Sean Christopherson 提交于
      Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
      disabled at the module level to avoid having to acquire the mutex and
      potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
      never be lifted, so piling on more inhibits is unnecessary.
      
      Fixes: cae72dcc ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0047fb33
    • S
      KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race · 423ecfea
      Sean Christopherson 提交于
      Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
      in-kernel local APIC and APICv enabled at the module level.  Consuming
      kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
      race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
      before the vCPU is fully onlined, i.e. it won't get the request made to
      "all" vCPUs.  If APICv is globally inhibited between setting apicv_active
      and onlining the vCPU, the vCPU will end up running with APICv enabled
      and trigger KVM's sanity check.
      
      Mark APICv as active during vCPU creation if APICv is enabled at the
      module level, both to be optimistic about it's final state, e.g. to avoid
      additional VMWRITEs on VMX, and because there are likely bugs lurking
      since KVM checks apicv_active in multiple vCPU creation paths.  While
      keeping the current behavior of consuming kvm_apicv_activated() is
      arguably safer from a regression perspective, force apicv_active so that
      vCPU creation runs with deterministic state and so that if there are bugs,
      they are found sooner than later, i.e. not when some crazy race condition
      is hit.
      
        WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Modules linked in:
        CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
        RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Call Trace:
         <TASK>
         vcpu_run arch/x86/kvm/x86.c:10039 [inline]
         kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
         kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
      call to KVM_SET_GUEST_DEBUG.
      
        r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
        r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
        ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
        r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
        r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
        ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
        ioctl$KVM_RUN(r2, 0xae80, 0x0)
      Reported-by: NGaoning Pan <pgn@zju.edu.cn>
      Reported-by: NYongkang Jia <kangel@zju.edu.cn>
      Fixes: 8df14af4 ("kvm: x86: Add support for dynamic APICv activation")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      423ecfea
    • S
      KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled · 80f0497c
      Sean Christopherson 提交于
      Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
      module param.  A recent refactoring to add a wrapper for setting/clearing
      inhibits unintentionally changed the flag, probably due to a copy+paste
      goof.
      
      Fixes: 4f4c4a3e ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80f0497c
    • S
      KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused · 2031f287
      Sean Christopherson 提交于
      Add wrappers to acquire/release KVM's SRCU lock when stashing the index
      in vcpu->src_idx, along with rudimentary detection of illegal usage,
      e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
      SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
      unnoticed for quite some time and only cause problems when the nested
      lock happens to get a different index.
      
      Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
      likely yell so loudly that it will bring the kernel to its knees.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NFabiano Rosas <farosas@linux.ibm.com>
      Message-Id: <20220415004343.2203171-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2031f287
    • S
      KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io() · 2d089356
      Sean Christopherson 提交于
      Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
      lock in kvm_arch_vcpu_ioctl_run().  More importantly, don't overwrite
      vcpu->srcu_idx.  If the index acquired by complete_emulated_io() differs
      from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
      leak a lock and hang if/when synchronize_srcu() is invoked for the
      relevant grace period.
      
      Fixes: 8d25b7be ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220415004343.2203171-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d089356
  6. 14 4月, 2022 5 次提交
  7. 12 4月, 2022 1 次提交
    • V
      KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPU · 42dcbe7d
      Vitaly Kuznetsov 提交于
      The following WARN is triggered from kvm_vm_ioctl_set_clock():
       WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G        W  O      5.16.0.stable #20
       Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
       RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm]
       ...
       Call Trace:
        <TASK>
        ? kvm_write_guest+0x114/0x120 [kvm]
        kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm]
        kvm_arch_vm_ioctl+0xa26/0xc50 [kvm]
        ? schedule+0x4e/0xc0
        ? __cond_resched+0x1a/0x50
        ? futex_wait+0x166/0x250
        ? __send_signal+0x1f1/0x3d0
        kvm_vm_ioctl+0x747/0xda0 [kvm]
        ...
      
      The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if
      mark_page_dirty() is called without an active vCPU") but the change seems
      to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's
      no real need to actually write to guest memory to invalidate TSC page, this
      can be done by the first vCPU which goes through kvm_guest_time_update().
      Reported-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>
      42dcbe7d
  8. 05 4月, 2022 1 次提交
    • S
      KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480
      Sean Christopherson 提交于
      Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
      -1 is technically undefined behavior when its value is read out by
      param_get_bool(), as boolean values are supposed to be '0' or '1'.
      
      Alternatively, KVM could define a custom getter for the param, but the
      auto value doesn't depend on the vendor module in any way, and printing
      "auto" would be unnecessarily unfriendly to the user.
      
      In addition to fixing the undefined behavior, resolving the auto value
      also fixes the scenario where the auto value resolves to N and no vendor
      module is loaded.  Previously, -1 would result in Y being printed even
      though KVM would ultimately disable the mitigation.
      
      Rename the existing MMU module init/exit helpers to clarify that they're
      invoked with respect to the vendor module, and add comments to document
      why KVM has two separate "module init" flows.
      
        =========================================================================
        UBSAN: invalid-load in kernel/params.c:320:33
        load of value 255 is not a valid value for type '_Bool'
        CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         ubsan_epilogue+0x5/0x40
         __ubsan_handle_load_invalid_value.cold+0x43/0x48
         param_get_bool.cold+0xf/0x14
         param_attr_show+0x55/0x80
         module_attr_show+0x1c/0x30
         sysfs_kf_seq_show+0x93/0xc0
         seq_read_iter+0x11c/0x450
         new_sync_read+0x11b/0x1a0
         vfs_read+0xf0/0x190
         ksys_read+0x5f/0xe0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        =========================================================================
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Cc: stable@vger.kernel.org
      Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220331221359.3912754-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d0e8480
  9. 02 4月, 2022 14 次提交
    • J
      KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_state · 945024d7
      Jon Kohler 提交于
      kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit,
      part of which is managing memory protection key state. The latest
      arch.pkru is updated with a rdpkru, and if that doesn't match the base
      host_pkru (which about 70% of the time), we issue a __write_pkru.
      
      To improve performance, implement the following optimizations:
       1. Reorder if conditions prior to wrpkru in both
          kvm_load_{guest|host}_xsave_state.
      
          Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is
          checked first, which when instrumented in our environment appeared
          to be always true and less overall work than kvm_read_cr4_bits.
      
          For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
          one position. When instrumented, I saw this be true roughly ~70% of
          the time vs the other conditions which were almost always true.
          With this change, we will avoid 3rd condition check ~30% of the time.
      
       2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS,
          as if the user compiles out this feature, we should not have
          these branches at all.
      Signed-off-by: NJon Kohler <jon@nutanix.com>
      Message-Id: <20220324004439.6709-1-jon@nutanix.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      945024d7
    • M
      KVM: x86: allow per cpu apicv inhibit reasons · d5fa597e
      Maxim Levitsky 提交于
      Add optional callback .vcpu_get_apicv_inhibit_reasons returning
      extra inhibit reasons that prevent APICv from working on this vCPU.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5fa597e
    • S
      KVM: x86: Don't snapshot "max" TSC if host TSC is constant · 741e511b
      Sean Christopherson 提交于
      Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the
      host TSC is constant, in which case the actual TSC frequency will never
      change and thus capturing the "max" TSC during initialization is
      unnecessary, KVM can simply use tsc_khz during VM creation.
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting max_tsc_khz and using that to set a VM's default TSC
      frequency can lead to KVM thinking it needs to manually scale the guest's
      TSC if refining the TSC completes after KVM snapshots tsc_khz.  The
      actual frequency never changes, only the kernel's calculation of what
      that frequency is changes.  On systems without hardware TSC scaling, this
      either puts KVM into "always catchup" mode (extremely inefficient), or
      prevents creating VMs altogether.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside
      of the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can
      be hit if and only if userspace is able to create a VM before TSC
      refinement completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some
      uses cases and not taking a snapshot is arguably correct irrespective of
      the race with refinement.
      
      [ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all
               vCPUs get the same frequency even if we hit the race. ]
      
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Anton Romanov <romanton@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-3-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741e511b
    • D
      KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl. · ffbb61d0
      David Woodhouse 提交于
      This sets the default TSC frequency for subsequently created vCPUs.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20220225145304.36166-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbb61d0
    • D
      KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SEND · 661a20fa
      David Woodhouse 提交于
      At the end of the patch series adding this batch of event channel
      acceleration features, finally add the feature bit which advertises
      them and document it all.
      
      For SCHEDOP_poll we need to wake a polling vCPU when a given port
      is triggered, even when it's masked — and we want to implement that
      in the kernel, for efficiency. So we want the kernel to know that it
      has sole ownership of event channel delivery. Thus, we allow
      userspace to make the 'promise' by setting the corresponding feature
      bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll
      bypass later, we will do so only if that promise has been made by
      userspace.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-16-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      661a20fa
    • D
      KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_ID · 942c2490
      David Woodhouse 提交于
      In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we
      need to be aware of the Xen CPU numbering.
      
      This looks a lot like the Hyper-V handling of vpidx, for obvious reasons.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-12-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      942c2490
    • D
      KVM: x86/xen: Support direct injection of event channel events · 35025735
      David Woodhouse 提交于
      This adds a KVM_XEN_HVM_EVTCHN_SEND ioctl which allows direct injection
      of events given an explicit { vcpu, port, priority } in precisely the
      same form that those fields are given in the IRQ routing table.
      
      Userspace is currently able to inject 2-level events purely by setting
      the bits in the shared_info and vcpu_info, but FIFO event channels are
      harder to deal with; we will need the kernel to take sole ownership of
      delivery when we support those.
      
      A patch advertising this feature with a new bit in the KVM_CAP_XEN_HVM
      ioctl will be added in a subsequent patch.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-9-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35025735
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_time_info · 69d413cf
      David Woodhouse 提交于
      This switches the final pvclock to kvm_setup_pvclock_pfncache() and now
      the old kvm_setup_pvclock_page() can be removed.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-7-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      69d413cf
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for vcpu_info · 7caf9571
      David Woodhouse 提交于
      Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the
      index bits in the target vCPU's evtchn_pending_sel, because it only has
      a userspace virtual address with which to do so. It just sets them in
      the kernel, and kvm_xen_has_interrupt() then completes the delivery to
      the actual vcpu_info structure when the vCPU runs.
      
      Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full
      delivery in the common case.
      
      Clean up the fallback case too, by moving the deferred delivery out into
      a separate kvm_xen_inject_pending_events() function which isn't ever
      called in atomic contexts as __kvm_xen_has_interrupt() is.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-6-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7caf9571
    • D
      KVM: x86: Use gfn_to_pfn_cache for pv_time · 916d3608
      David Woodhouse 提交于
      Add a new kvm_setup_guest_pvclock() which parallels the existing
      kvm_setup_pvclock_page(). The latter will be removed once we convert
      all users to the gfn_to_pfn_cache version.
      
      Using the new cache, we can potentially let kvm_set_guest_paused() set
      the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate
      to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      916d3608
    • D
      KVM: x86/xen: Use gfn_to_pfn_cache for runstate area · a795cd43
      David Woodhouse 提交于
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220303154127.202856-4-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a795cd43
    • O
      KVM: x86: Allow userspace to opt out of hypercall patching · f1a9761f
      Oliver Upton 提交于
      KVM handles the VMCALL/VMMCALL instructions very strangely. Even though
      both of these instructions really should #UD when executed on the wrong
      vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the
      guest's instruction with the appropriate instruction for the vendor.
      Nonetheless, older guest kernels without commit c1118b36 ("x86: kvm:
      use alternatives for VMCALL vs. VMMCALL if kernel text is read-only")
      do not patch in the appropriate instruction using alternatives, likely
      motivating KVM's intervention.
      
      Add a quirk allowing userspace to opt out of hypercall patching. If the
      quirk is disabled, KVM synthesizes a #UD in the guest.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220316005538.2282772-2-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1a9761f
    • M
      KVM: x86: SVM: fix tsc scaling when the host doesn't support it · 88099313
      Maxim Levitsky 提交于
      It was decided that when TSC scaling is not supported,
      the virtual MSR_AMD64_TSC_RATIO should still have the default '1.0'
      value.
      
      However in this case kvm_max_tsc_scaling_ratio is not set,
      which breaks various assumptions.
      
      Fix this by always calculating kvm_max_tsc_scaling_ratio regardless of
      host support.  For consistency, do the same for VMX.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220322172449.235575-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88099313
    • H
      KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr · ac8d6cad
      Hou Wenlong 提交于
      If MSR access is rejected by MSR filtering,
      kvm_set_msr()/kvm_get_msr() would return KVM_MSR_RET_FILTERED,
      and the return value is only handled well for rdmsr/wrmsr.
      However, some instruction emulation and state transition also
      use kvm_set_msr()/kvm_get_msr() to do msr access but may trigger
      some unexpected results if MSR access is rejected, E.g. RDPID
      emulation would inject a #UD but RDPID wouldn't cause a exit
      when RDPID is supported in hardware and ENABLE_RDTSCP is set.
      And it would also cause failure when load MSR at nested entry/exit.
      Since msr filtering is based on MSR bitmap, it is better to only
      do MSR filtering for rdmsr/wrmsr.
      Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
      Message-Id: <2b2774154f7532c96a6f04d71c82a8bec7d9e80b.1646655860.git.houwenlong.hwl@antgroup.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ac8d6cad