1. 25 5月, 2022 4 次提交
    • S
      KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd
      Sean Christopherson 提交于
      Whenever x86_decode_emulated_instruction() detects a breakpoint, it
      returns the value that kvm_vcpu_check_breakpoint() writes into its
      pass-by-reference second argument.  Unfortunately this is completely
      bogus because the expected outcome of x86_decode_emulated_instruction
      is an EMULATION_* value.
      
      Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
      a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
      and x86_emulate_instruction() is called without having decoded the
      instruction.  This causes various havoc from running with a stale
      emulation context.
      
      The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
      before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
      emulation with decoding") introduced x86_decode_emulated_instruction().
      The other caller of the function does not need breakpoint checks,
      because it is invoked as part of a vmexit and the processor has already
      checked those before executing the instruction that #GP'd.
      
      This fixes CVE-2022-1852.
      Reported-by: NQiuhao Li <qiuhao@sysec.org>
      Reported-by: NGaoning Pan <pgn@zju.edu.cn>
      Reported-by: NYongkang Jia <kangel@zju.edu.cn>
      Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311032801.3467418-2-seanjc@google.com>
      [Rewrote commit message according to Qiuhao's report, since a patch
       already existed to fix the bug. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fee060cd
    • A
      KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak · d22d2474
      Ashish Kalra 提交于
      For some sev ioctl interfaces, the length parameter that is passed maybe
      less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data
      that PSP firmware returns. In this case, kmalloc will allocate memory
      that is the size of the input rather than the size of the data.
      Since PSP firmware doesn't fully overwrite the allocated buffer, these
      sev ioctl interface may return uninitialized kernel slab memory.
      Reported-by: NAndy Nguyen <theflow@google.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Suggested-by: NPeter Gonda <pgonda@google.com>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Fixes: eaf78265 ("KVM: SVM: Move SEV code to separate file")
      Fixes: 2c07ded0 ("KVM: SVM: add support for SEV attestation command")
      Fixes: 4cfdd47d ("KVM: SVM: Add KVM_SEV SEND_START command")
      Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
      Fixes: eba04b20 ("KVM: x86: Account a variety of miscellaneous allocations")
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: NPeter Gonda <pgonda@google.com>
      Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d22d2474
    • S
      x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) · d187ba53
      Sean Christopherson 提交于
      Set the starting uABI size of KVM's guest FPU to 'struct kvm_xsave',
      i.e. to KVM's historical uABI size.  When saving FPU state for usersapce,
      KVM (well, now the FPU) sets the FP+SSE bits in the XSAVE header even if
      the host doesn't support XSAVE.  Setting the XSAVE header allows the VM
      to be migrated to a host that does support XSAVE without the new host
      having to handle FPU state that may or may not be compatible with XSAVE.
      
      Setting the uABI size to the host's default size results in out-of-bounds
      writes (setting the FP+SSE bits) and data corruption (that is thankfully
      caught by KASAN) when running on hosts without XSAVE, e.g. on Core2 CPUs.
      
      WARN if the default size is larger than KVM's historical uABI size; all
      features that can push the FPU size beyond the historical size must be
      opt-in.
      
        ==================================================================
        BUG: KASAN: slab-out-of-bounds in fpu_copy_uabi_to_guest_fpstate+0x86/0x130
        Read of size 8 at addr ffff888011e33a00 by task qemu-build/681
        CPU: 1 PID: 681 Comm: qemu-build Not tainted 5.18.0-rc5-KASAN-amd64 #1
        Hardware name:  /DG35EC, BIOS ECG3510M.86A.0118.2010.0113.1426 01/13/2010
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x45
         print_report.cold+0x45/0x575
         kasan_report+0x9b/0xd0
         fpu_copy_uabi_to_guest_fpstate+0x86/0x130
         kvm_arch_vcpu_ioctl+0x72a/0x1c50 [kvm]
         kvm_vcpu_ioctl+0x47f/0x7b0 [kvm]
         __x64_sys_ioctl+0x5de/0xc90
         do_syscall_64+0x31/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        Allocated by task 0:
        (stack is not available)
        The buggy address belongs to the object at ffff888011e33800
         which belongs to the cache kmalloc-512 of size 512
        The buggy address is located 0 bytes to the right of
         512-byte region [ffff888011e33800, ffff888011e33a00)
        The buggy address belongs to the physical page:
        page:0000000089cd4adb refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11e30
        head:0000000089cd4adb order:2 compound_mapcount:0 compound_pincount:0
        flags: 0x4000000000010200(slab|head|zone=1)
        raw: 4000000000010200 dead000000000100 dead000000000122 ffff888001041c80
        raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
        Memory state around the buggy address:
         ffff888011e33900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff888011e33980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffff888011e33a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                           ^
         ffff888011e33a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
         ffff888011e33b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        ==================================================================
        Disabling lock debugging due to kernel taint
      
      Fixes: be50b206 ("kvm: x86: Add support for getting/setting expanded xstate buffer")
      Fixes: c60427dd ("x86/fpu: Add uabi_size to guest_fpu")
      Reported-by: NZdenek Kaspar <zkaspar82@gmail.com>
      Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NZdenek Kaspar <zkaspar82@gmail.com>
      Message-Id: <20220504001219.983513-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d187ba53
    • W
      KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351
      Wanpeng Li 提交于
      In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
      tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
      was moved after guest_exit_irqoff() because invoking tracepoints within
      kvm_guest_enter/kvm_guest_exit caused a lockdep splat.
      
      These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
      Move context tracking where it belongs", 2020-07-09) restricted
      the RCU extended quiescent state to be closer to vmentry/vmexit.
      Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
      because it will be reported even if vcpu_enter_guest causes multiple
      vmentries via the IPI/Timer fast paths, and it allows the removal of
      advance_expire_delta.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0ac5351
  2. 20 5月, 2022 10 次提交
  3. 17 5月, 2022 1 次提交
    • M
      KVM: arm64: Fix hypercall bitmap writeback when vcpus have already run · 528ada28
      Marc Zyngier 提交于
      We generally want to disallow hypercall bitmaps being changed
      once vcpus have already run. But we must allow the write if
      the written value is unchanged so that userspace can rewrite
      the register file on reboot, for example.
      
      Without this, a QEMU-based VM will fail to reboot correctly.
      
      The original code was correct, and it is me that introduced
      the regression.
      
      Fixes: 05714cab ("KVM: arm64: Setup a framework for hypercall bitmap firmware registers")
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      528ada28
  4. 16 5月, 2022 5 次提交
  5. 15 5月, 2022 5 次提交
  6. 13 5月, 2022 1 次提交
  7. 12 5月, 2022 14 次提交
    • V
      KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmaps · 6ba1e04f
      Vipin Sharma 提交于
      Avoid calling handlers on empty rmap entries and skip to the next non
      empty rmap entry.
      
      Empty rmap entries are noop in handlers.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220502220347.174664-1-vipinsh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ba1e04f
    • K
      KVM: VMX: Include MKTME KeyID bits in shadow_zero_check · 3c5c3245
      Kai Huang 提交于
      Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
      never be set to SPTE.  Set shadow_me_value to 0 and shadow_me_mask to
      include all MKTME KeyID bits to include them to shadow_zero_check.
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c5c3245
    • K
      KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2
      Kai Huang 提交于
      Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
      high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
      Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
      KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
      host kernel since they can only be accessed by software running inside a
      new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
      any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
      legitimate for KVM to set any KeyID bits in SPTE which maps guest
      memory.
      
      KVM maintains shadow_zero_check bits to represent which bits must be
      zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
      to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
      the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
      shadow_zero_check.  So initializing shadow_me_mask to represent all
      MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
      to shadow_zero_check).
      
      Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
      and repurpose shadow_me_mask as 'all possible memory encryption bits'.
      The new schematic of them will be:
      
       - shadow_me_value: the memory encryption bit(s) that will be set to the
         SPTE (the original shadow_me_mask).
       - shadow_me_mask: all possible memory encryption bits (which is a super
         set of shadow_me_value).
       - For now, shadow_me_value is supposed to be set by SVM and VMX
         respectively, and it is a constant during KVM's life time.  This
         perhaps doesn't fit MKTME but for now host kernel doesn't support it
         (and perhaps will never do).
       - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
         in shadow_me_value.
      
      Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
      Replace shadow_me_mask with shadow_me_value in almost all code paths,
      except the one in PT64_PERM_MASK, which is used by need_remote_flush()
      to determine whether remote TLB flush is needed.  This should still use
      shadow_me_mask as any encryption bit change should need a TLB flush.
      And for AMD, move initializing shadow_me_value/shadow_me_mask from
      kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e54f1ff2
    • K
      KVM: x86/mmu: Rename reset_rsvds_bits_mask() · c919e881
      Kai Huang 提交于
      Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make
      it clearer that it resets the reserved bits check for guest's page table
      entries.
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c919e881
    • P
    • S
      KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e
      Sean Christopherson 提交于
      Expand and clean up the page fault stats.  The current stats are at best
      incomplete, and at worst misleading.  Differentiate between faults that
      are actually fixed vs those that result in an MMIO SPTE being created,
      track faults that are spurious, faults that trigger emulation, faults
      that that are fixed in the fast path, and last but not least, track the
      number of faults that are taken.
      
      Note, the number of faults that require emulation for write-protected
      shadow pages can roughly be calculated by subtracting the number of MMIO
      SPTEs created from the overall number of faults that trigger emulation.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1075d41e
    • S
      KVM: x86/mmu: Use IS_ENABLED() to avoid RETPOLINE for TDP page faults · 8d5265b1
      Sean Christopherson 提交于
      Use IS_ENABLED() instead of an #ifdef to activate the anti-RETPOLINE fast
      path for TDP page faults.  The generated code is identical, and the #ifdef
      makes it dangerously difficult to extend the logic (guess who forgot to
      add an "else" inside the #ifdef and ran through the page fault handler
      twice).
      
      No functional or binary change intented.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d5265b1
    • S
      KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b
      Sean Christopherson 提交于
      Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
      of the page fault handling collateral that was in mmu.h purely for the
      async #PF handler into mmu_internal.h, where it belongs.  This will allow
      kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
      expose those enums outside of the MMU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a009d5b
    • S
      KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns" · 5276c616
      Sean Christopherson 提交于
      Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and
      kvm_faultin_pfn() to signal that the page fault handler should continue
      doing its thing.  Aside from being gross and inefficient, using a boolean
      return to signal continue vs. stop makes it extremely difficult to add
      more helpers and/or move existing code to a helper.
      
      E.g. hypothetically, if nested MMUs were to gain a separate page fault
      handler in the future, everything up to the "is self-modifying PTE" check
      can be shared by all shadow MMUs, but communicating up the stack whether
      to continue on or stop becomes a nightmare.
      
      More concretely, proposed support for private guest memory ran into a
      similar issue, where it'll be forced to forego a helper in order to yield
      sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com.
      
      No functional change intended.
      
      Cc: David Matlack <dmatlack@google.com>
      Cc: Chao Peng <chao.p.peng@linux.intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5276c616
    • S
      KVM: x86/mmu: Drop exec/NX check from "page fault can be fast" · 5c64aba5
      Sean Christopherson 提交于
      Tweak the "page fault can be fast" logic to explicitly check for !PRESENT
      faults in the access tracking case, and drop the exec/NX check that
      becomes redundant as a result.  No sane hardware will generate an access
      that is both an instruct fetch and a write, i.e. it's a waste of cycles.
      If hardware goes off the rails, or KVM runs under a misguided hypervisor,
      spuriously running throught fast path is benign (KVM has been uknowingly
      being doing exactly that for years).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c64aba5
    • S
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson 提交于
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54275f74
    • L
      KVM: VMX: clean up pi_wakeup_handler · 91ab933f
      Li RongQing 提交于
      Passing per_cpu() to list_for_each_entry() causes the macro to be
      evaluated N+1 times for N sleeping vCPUs.  This is a very small
      inefficiency, and the code is cleaner if the address of the per-CPU
      variable is loaded earlier.  Do this for both the list and the spinlock.
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91ab933f
    • M
      KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be
      Maxim Levitsky 提交于
      This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
      relies makes L1 install two different shadow pages under same spte, and one of
      them is leaked.
      
      Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      33fbe6be
    • S
      arm64: Enable repeat tlbi workaround on KRYO4XX gold CPUs · 51f559d6
      Shreyas K K 提交于
      Add KRYO4XX gold/big cores to the list of CPUs that need the
      repeat TLBI workaround. Apply this to the affected
      KRYO4XX cores (rcpe to rfpe).
      
      The variant and revision bits are implementation defined and are
      different from the their Cortex CPU counterparts on which they are
      based on, i.e., (r0p0 to r3p0) is equivalent to (rcpe to rfpe).
      Signed-off-by: NShreyas K K <quic_shrekk@quicinc.com>
      Reviewed-by: NSai Prakash Ranjan <quic_saipraka@quicinc.com>
      Link: https://lore.kernel.org/r/20220512110134.12179-1-quic_shrekk@quicinc.comSigned-off-by: NWill Deacon <will@kernel.org>
      51f559d6