1. 25 2月, 2022 7 次提交
    • W
      x86/kvm: Don't use PV TLB/yield when mwait is advertised · 40cd58db
      Wanpeng Li 提交于
      MWAIT is advertised in host is not overcommitted scenario, however, PV
      TLB/sched yield should be enabled in host overcommitted scenario. Let's
      add the MWAIT checking when enabling PV TLB/sched yield.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1645777780-2581-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      40cd58db
    • P
      Merge tag 'kvmarm-fixes-5.17-4' of... · ece32a75
      Paolo Bonzini 提交于
      Merge tag 'kvmarm-fixes-5.17-4' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.17, take #4
      
      - Correctly synchronise PMR and co on PSCI CPU_SUSPEND
      
      - Skip tests that depend on GICv3 when the HW isn't available
      ece32a75
    • M
      KVM: selftests: aarch64: Skip tests if we can't create a vgic-v3 · 456f89e0
      Mark Brown 提交于
      The arch_timer and vgic_irq kselftests assume that they can create a
      vgic-v3, using the library function vgic_v3_setup() which aborts with a
      test failure if it is not possible to do so. Since vgic-v3 can only be
      instantiated on systems where the host has GICv3 this leads to false
      positives on older systems where that is not the case.
      
      Fix this by changing vgic_v3_setup() to return an error if the vgic can't
      be instantiated and have the callers skip if this happens. We could also
      exit flagging a skip in vgic_v3_setup() but this would prevent future test
      cases conditionally deciding which GIC to use or generally doing more
      complex output.
      Signed-off-by: NMark Brown <broonie@kernel.org>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Tested-by: NRicardo Koller <ricarkol@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220223131624.1830351-1-broonie@kernel.org
      456f89e0
    • S
      Revert "KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()" · 1a715810
      Sean Christopherson 提交于
      Revert back to refreshing vmcs.HOST_CR3 immediately prior to VM-Enter.
      The PCID (ASID) part of CR3 can be bumped without KVM being scheduled
      out, as the kernel will switch CR3 during __text_poke(), e.g. in response
      to a static key toggling.  If switch_mm_irqs_off() chooses a new ASID for
      the mm associate with KVM, KVM will do VM-Enter => VM-Exit with a stale
      vmcs.HOST_CR3.
      
      Add a comment to explain why KVM must wait until VM-Enter is imminent to
      refresh vmcs.HOST_CR3.
      
      The following splat was captured by stashing vmcs.HOST_CR3 in kvm_vcpu
      and adding a WARN in load_new_mm_cr3() to fire if a new ASID is being
      loaded for the KVM-associated mm while KVM has a "running" vCPU:
      
        static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
        {
      	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
      
      	...
      
      	WARN(vcpu && (vcpu->cr3 & GENMASK(11, 0)) != (new_mm_cr3 & GENMASK(11, 0)) &&
      	     (vcpu->cr3 & PHYSICAL_PAGE_MASK) == (new_mm_cr3 & PHYSICAL_PAGE_MASK),
      	     "KVM is hosed, loading CR3 = %lx, vmcs.HOST_CR3 = %lx", new_mm_cr3, vcpu->cr3);
        }
      
        ------------[ cut here ]------------
        KVM is hosed, loading CR3 = 8000000105393004, vmcs.HOST_CR3 = 105393003
        WARNING: CPU: 4 PID: 20717 at arch/x86/mm/tlb.c:291 load_new_mm_cr3+0x82/0xe0
        Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel
        CPU: 4 PID: 20717 Comm: stable Tainted: G        W         5.17.0-rc3+ #747
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:load_new_mm_cr3+0x82/0xe0
        RSP: 0018:ffffc9000489fa98 EFLAGS: 00010082
        RAX: 0000000000000000 RBX: 8000000105393004 RCX: 0000000000000027
        RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277d1b788
        RBP: 0000000000000004 R08: ffff888277d1b780 R09: ffffc9000489f8b8
        R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
        R13: ffff88810678a800 R14: 0000000000000004 R15: 0000000000000c33
        FS:  00007fa9f0e72700(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 00000001001b5003 CR4: 0000000000172ea0
        Call Trace:
         <TASK>
         switch_mm_irqs_off+0x1cb/0x460
         __text_poke+0x308/0x3e0
         text_poke_bp_batch+0x168/0x220
         text_poke_finish+0x1b/0x30
         arch_jump_label_transform_apply+0x18/0x30
         static_key_slow_inc_cpuslocked+0x7c/0x90
         static_key_slow_inc+0x16/0x20
         kvm_lapic_set_base+0x116/0x190
         kvm_set_apic_base+0xa5/0xe0
         kvm_set_msr_common+0x2f4/0xf60
         vmx_set_msr+0x355/0xe70 [kvm_intel]
         kvm_set_msr_ignored_check+0x91/0x230
         kvm_emulate_wrmsr+0x36/0x120
         vmx_handle_exit+0x609/0x6c0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x146f/0x1b80
         kvm_vcpu_ioctl+0x279/0x690
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      This reverts commit 15ad9762.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Reported-by: NWanpeng Li <kernellwp@gmail.com>
      Cc: Lai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Acked-by: NLai Jiangshan <jiangshanlai@gmail.com>
      Message-Id: <20220224191917.3508476-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a715810
    • S
      Revert "KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs()" · bca06b85
      Sean Christopherson 提交于
      Undo a nested VMX fix as a step toward reverting the commit it fixed,
      15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()"),
      as the underlying premise that "host CR3 in the vcpu thread can only be
      changed when scheduling" is wrong.
      
      This reverts commit a9f2705e.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220224191917.3508476-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bca06b85
    • M
      KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non... · e910a53f
      Maxim Levitsky 提交于
      KVM: x86: nSVM: disallow userspace setting of MSR_AMD64_TSC_RATIO to non default value when tsc scaling disabled
      
      If nested tsc scaling is disabled, MSR_AMD64_TSC_RATIO should
      never have non default value.
      
      Due to way nested tsc scaling support was implmented in qemu,
      it would set this msr to 0 when nested tsc scaling was disabled.
      Ignore that value for now, as it causes no harm.
      
      Fixes: 5228eb96 ("KVM: x86: nSVM: implement nested TSC scaling")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220223115649.319134-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e910a53f
    • L
      KVM: x86/mmu: make apf token non-zero to fix bug · 6f3c1fc5
      Liang Zhang 提交于
      In current async pagefault logic, when a page is ready, KVM relies on
      kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
      a READY event to the Guest. This function test token value of struct
      kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
      READY event is finished by Guest. If value is zero meaning that a READY
      event is done, so the KVM can deliver another.
      But the kvm_arch_setup_async_pf() may produce a valid token with zero
      value, which is confused with previous mention and may lead the loss of
      this READY event.
      
      This bug may cause task blocked forever in Guest:
       INFO: task stress:7532 blocked for more than 1254 seconds.
             Not tainted 5.10.0 #16
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       task:stress          state:D stack:    0 pid: 7532 ppid:  1409
       flags:0x00000080
       Call Trace:
        __schedule+0x1e7/0x650
        schedule+0x46/0xb0
        kvm_async_pf_task_wait_schedule+0xad/0xe0
        ? exit_to_user_mode_prepare+0x60/0x70
        __kvm_handle_async_pf+0x4f/0xb0
        ? asm_exc_page_fault+0x8/0x30
        exc_page_fault+0x6f/0x110
        ? asm_exc_page_fault+0x8/0x30
        asm_exc_page_fault+0x1e/0x30
       RIP: 0033:0x402d00
       RSP: 002b:00007ffd31912500 EFLAGS: 00010206
       RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
       RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
       RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
       R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
       R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000
      Signed-off-by: NLiang Zhang <zhangliang5@huawei.com>
      Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6f3c1fc5
  2. 22 2月, 2022 2 次提交
  3. 18 2月, 2022 2 次提交
  4. 17 2月, 2022 6 次提交
    • L
      x86/kvm/fpu: Remove kvm_vcpu_arch.guest_supported_xcr0 · 988896bb
      Leonardo Bras 提交于
      kvm_vcpu_arch currently contains the guest supported features in both
      guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field.
      
      Currently both fields are set to the same value in
      kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that.
      
      Since it's not good to keep duplicated data, remove guest_supported_xcr0.
      
      To keep the code more readable, introduce kvm_guest_supported_xcr()
      and kvm_guest_supported_xfd() to replace the previous usages of
      guest_supported_xcr0.
      Signed-off-by: NLeonardo Bras <leobras@redhat.com>
      Message-Id: <20220217053028.96432-3-leobras@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      988896bb
    • L
      x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 · ad856280
      Leonardo Bras 提交于
      During host/guest switch (like in kvm_arch_vcpu_ioctl_run()), the kernel
      swaps the fpu between host/guest contexts, by using fpu_swap_kvm_fpstate().
      
      When xsave feature is available, the fpu swap is done by:
      - xsave(s) instruction, with guest's fpstate->xfeatures as mask, is used
        to store the current state of the fpu registers to a buffer.
      - xrstor(s) instruction, with (fpu_kernel_cfg.max_features &
        XFEATURE_MASK_FPSTATE) as mask, is used to put the buffer into fpu regs.
      
      For xsave(s) the mask is used to limit what parts of the fpu regs will
      be copied to the buffer. Likewise on xrstor(s), the mask is used to
      limit what parts of the fpu regs will be changed.
      
      The mask for xsave(s), the guest's fpstate->xfeatures, is defined on
      kvm_arch_vcpu_create(), which (in summary) sets it to all features
      supported by the cpu which are enabled on kernel config.
      
      This means that xsave(s) will save to guest buffer all the fpu regs
      contents the cpu has enabled when the guest is paused, even if they
      are not used.
      
      This would not be an issue, if xrstor(s) would also do that.
      
      xrstor(s)'s mask for host/guest swap is basically every valid feature
      contained in kernel config, except XFEATURE_MASK_PKRU.
      Accordingto kernel src, it is instead switched in switch_to() and
      flush_thread().
      
      Then, the following happens with a host supporting PKRU starts a
      guest that does not support it:
      1 - Host has XFEATURE_MASK_PKRU set. 1st switch to guest,
      2 - xsave(s) fpu regs to host fpustate (buffer has XFEATURE_MASK_PKRU)
      3 - xrstor(s) guest fpustate to fpu regs (fpu regs have XFEATURE_MASK_PKRU)
      4 - guest runs, then switch back to host,
      5 - xsave(s) fpu regs to guest fpstate (buffer now have XFEATURE_MASK_PKRU)
      6 - xrstor(s) host fpstate to fpu regs.
      7 - kvm_vcpu_ioctl_x86_get_xsave() copy guest fpstate to userspace (with
          XFEATURE_MASK_PKRU, which should not be supported by guest vcpu)
      
      On 5, even though the guest does not support PKRU, it does have the flag
      set on guest fpstate, which is transferred to userspace via vcpu ioctl
      KVM_GET_XSAVE.
      
      This becomes a problem when the user decides on migrating the above guest
      to another machine that does not support PKRU: the new host restores
      guest's fpu regs to as they were before (xrstor(s)), but since the new
      host don't support PKRU, a general-protection exception ocurs in xrstor(s)
      and that crashes the guest.
      
      This can be solved by making the guest's fpstate->user_xfeatures hold
      a copy of guest_supported_xcr0. This way, on 7 the only flags copied to
      userspace will be the ones compatible to guest requirements, and thus
      there will be no issue during migration.
      
      As a bonus, it will also fail if userspace tries to set fpu features
      (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
      configuration.  Such features will never be returned by KVM_GET_XSAVE
      or KVM_GET_XSAVE2.
      
      Also, since kvm_vcpu_after_set_cpuid() now sets fpstate->user_xfeatures,
      there is not need to set it in kvm_check_cpuid(). So, change
      fpstate_realloc() so it does not touch fpstate->user_xfeatures if a
      non-NULL guest_fpu is passed, which is the case when kvm_check_cpuid()
      calls it.
      Signed-off-by: NLeonardo Bras <leobras@redhat.com>
      Message-Id: <20220217053028.96432-2-leobras@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad856280
    • A
      kvm: x86: Disable KVM_HC_CLOCK_PAIRING if tsc is in always catchup mode · 3a55f729
      Anton Romanov 提交于
      If vcpu has tsc_always_catchup set each request updates pvclock data.
      KVM_HC_CLOCK_PAIRING consumers such as ptp_kvm_x86 rely on tsc read on
      host's side and do hypercall inside pvclock_read_retry loop leading to
      infinite loop in such situation.
      
      v3:
          Removed warn
          Changed return code to KVM_EFAULT
      v2:
          Added warn
      Signed-off-by: NAnton Romanov <romanton@google.com>
      Message-Id: <20220216182653.506850-1-romanton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3a55f729
    • W
      KVM: Fix lockdep false negative during host resume · 4cb9a998
      Wanpeng Li 提交于
      I saw the below splatting after the host suspended and resumed.
      
         WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
         CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G        W IOE     5.17.0-rc3+ #4
         RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
         Call Trace:
          <TASK>
          syscore_resume+0x90/0x340
          suspend_devices_and_enter+0xaee/0xe90
          pm_suspend.cold+0x36b/0x3c2
          state_store+0x82/0xf0
          kernfs_fop_write_iter+0x1b6/0x260
          new_sync_write+0x258/0x370
          vfs_write+0x33f/0x510
          ksys_write+0xc9/0x160
          do_syscall_64+0x3b/0xc0
          entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      lockdep_is_held() can return -1 when lockdep is disabled which triggers
      this warning. Let's use lockdep_assert_not_held() which can detect
      incorrect calls while holding a lock and it also avoids false negatives
      when lockdep is disabled.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1644920142-81249-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cb9a998
    • A
      KVM: x86: Add KVM_CAP_ENABLE_CAP to x86 · 127770ac
      Aaron Lewis 提交于
      Follow the precedent set by other architectures that support the VCPU
      ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP.
      This way, userspace can ensure that KVM_ENABLE_CAP is available on a
      vcpu before using it.
      
      Fixes: 5c919412 ("kvm/x86: Hyper-V synthetic interrupt controller")
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Message-Id: <20220214212950.1776943-1-aaronlewis@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      127770ac
    • O
      KVM: arm64: Don't miss pending interrupts for suspended vCPU · a867e9d0
      Oliver Upton 提交于
      In order to properly emulate the WFI instruction, KVM reads back
      ICH_VMCR_EL2 and enables doorbells for GICv4. These preparations are
      necessary in order to recognize pending interrupts in
      kvm_arch_vcpu_runnable() and return to the guest. Until recently, this
      work was done by kvm_arch_vcpu_{blocking,unblocking}(). Since commit
      6109c5a6 ("KVM: arm64: Move vGIC v4 handling for WFI out arch
      callback hook"), these callbacks were gutted and superseded by
      kvm_vcpu_wfi().
      
      It is important to note that KVM implements PSCI CPU_SUSPEND calls as
      a WFI within the guest. However, the implementation calls directly into
      kvm_vcpu_halt(), which skips the needed work done in kvm_vcpu_wfi()
      to detect pending interrupts. Fix the issue by calling the WFI helper.
      
      Fixes: 6109c5a6 ("KVM: arm64: Move vGIC v4 handling for WFI out arch callback hook")
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220217101242.3013716-1-oupton@google.com
      a867e9d0
  5. 14 2月, 2022 2 次提交
  6. 12 2月, 2022 5 次提交
  7. 11 2月, 2022 2 次提交
    • M
      KVM: arm64: vgic: Read HW interrupt pending state from the HW · 5bfa685e
      Marc Zyngier 提交于
      It appears that a read access to GIC[DR]_I[CS]PENDRn doesn't always
      result in the pending interrupts being accurately reported if they are
      mapped to a HW interrupt. This is particularily visible when acking
      the timer interrupt and reading the GICR_ISPENDR1 register immediately
      after, for example (the interrupt appears as not-pending while it really
      is...).
      
      This is because a HW interrupt has its 'active and pending state' kept
      in the *physical* distributor, and not in the virtual one, as mandated
      by the spec (this is what allows the direct deactivation). The virtual
      distributor only caries the pending and active *states* (note the
      plural, as these are two independent and non-overlapping states).
      
      Fix it by reading the HW state back, either from the timer itself or
      from the distributor if necessary.
      Reported-by: NRicardo Koller <ricarkol@google.com>
      Tested-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220208123726.3604198-1-maz@kernel.org
      5bfa685e
    • D
      KVM: x86/xen: Fix runstate updates to be atomic when preempting vCPU · fcb732d8
      David Woodhouse 提交于
      There are circumstances whem kvm_xen_update_runstate_guest() should not
      sleep because it ends up being called from __schedule() when the vCPU
      is preempted:
      
      [  222.830825]  kvm_xen_update_runstate_guest+0x24/0x100
      [  222.830878]  kvm_arch_vcpu_put+0x14c/0x200
      [  222.830920]  kvm_sched_out+0x30/0x40
      [  222.830960]  __schedule+0x55c/0x9f0
      
      To handle this, make it use the same trick as __kvm_xen_has_interrupt(),
      of using the hva from the gfn_to_hva_cache directly. Then it can use
      pagefault_disable() around the accesses and just bail out if the page
      is absent (which is unlikely).
      
      I almost switched to using a gfn_to_pfn_cache here and bailing out if
      kvm_map_gfn() fails, like kvm_steal_time_set_preempted() does — but on
      closer inspection it looks like kvm_map_gfn() will *always* fail in
      atomic context for a page in IOMEM, which means it will silently fail
      to make the update every single time for such guests, AFAICT. So I
      didn't do it that way after all. And will probably fix that one too.
      
      Cc: stable@vger.kernel.org
      Fixes: 30b5c851 ("KVM: x86/xen: Add support for vCPU runstate information")
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <b17a93e5ff4561e57b1238e3e7ccd0b613eb827e.camel@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fcb732d8
  8. 09 2月, 2022 9 次提交
    • M
      KVM: x86: SVM: move avic definitions from AMD's spec to svm.h · 39150352
      Maxim Levitsky 提交于
      asm/svm.h is the correct place for all values that are defined in
      the SVM spec, and that includes AVIC.
      
      Also add some values from the spec that were not defined before
      and will be soon useful.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-10-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39150352
    • M
      KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it · 755c2bf8
      Maxim Levitsky 提交于
      kvm_apic_update_apicv is called when AVIC is still active, thus IRR bits
      can be set by the CPU after it is called, and don't cause the irr_pending
      to be set to true.
      
      Also logic in avic_kick_target_vcpu doesn't expect a race with this
      function so to make it simple, just keep irr_pending set to true and
      let the next interrupt injection to the guest clear it.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-9-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      755c2bf8
    • M
      KVM: x86: nSVM: deal with L1 hypervisor that intercepts interrupts but lets L2 control them · 2b0ecccb
      Maxim Levitsky 提交于
      Fix a corner case in which the L1 hypervisor intercepts
      interrupts (INTERCEPT_INTR) and either doesn't set
      virtual interrupt masking (V_INTR_MASKING) or enters a
      nested guest with EFLAGS.IF disabled prior to the entry.
      
      In this case, despite the fact that L1 intercepts the interrupts,
      KVM still needs to set up an interrupt window to wait before
      injecting the INTR vmexit.
      
      Currently the KVM instead enters an endless loop of 'req_immediate_exit'.
      
      Exactly the same issue also happens for SMIs and NMI.
      Fix this as well.
      
      Note that on VMX, this case is impossible as there is only
      'vmexit on external interrupts' execution control which either set,
      in which case both host and guest's EFLAGS.IF
      are ignored, or not set, in which case no VMexits are delivered.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2b0ecccb
    • M
      KVM: x86: nSVM: expose clean bit support to the guest · 91f673b3
      Maxim Levitsky 提交于
      KVM already honours few clean bits thus it makes sense
      to let the nested guest know about it.
      
      Note that KVM also doesn't check if the hardware supports
      clean bits, and therefore nested KVM was
      already setting clean bits and L0 KVM
      was already honouring them.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91f673b3
    • M
      KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM · 759cbd59
      Maxim Levitsky 提交于
      While RSM induced VM entries are not full VM entries,
      they still need to be followed by actual VM entry to complete it,
      unlike setting the nested state.
      
      This patch fixes boot of hyperv and SMM enabled
      windows VM running nested on KVM, which fail due
      to this issue combined with lack of dirty bit setting.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-5-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      759cbd59
    • M
      KVM: x86: nSVM: mark vmcb01 as dirty when restoring SMM saved state · e8efa4ff
      Maxim Levitsky 提交于
      While usually, restoring the smm state makes the KVM enter
      the nested guest thus a different vmcb (vmcb02 vs vmcb01),
      KVM should still mark it as dirty, since hardware
      can in theory cache multiple vmcbs.
      
      Failure to do so, combined with lack of setting the
      nested_run_pending (which is fixed in the next patch),
      might make KVM re-enter vmcb01, which was just exited from,
      with completely different set of guest state registers
      (SMM vs non SMM) and without proper dirty bits set,
      which results in the CPU reusing stale IDTR pointer
      which leads to a guest shutdown on any interrupt.
      
      On the real hardware this usually doesn't happen,
      but when running nested, L0's KVM does check and
      honour few dirty bits, causing this issue to happen.
      
      This patch fixes boot of hyperv and SMM enabled
      windows VM running nested on KVM.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8efa4ff
    • M
      KVM: x86: nSVM: fix potential NULL derefernce on nested migration · e1779c27
      Maxim Levitsky 提交于
      Turns out that due to review feedback and/or rebases
      I accidentally moved the call to nested_svm_load_cr3 to be too early,
      before the NPT is enabled, which is very wrong to do.
      
      KVM can't even access guest memory at that point as nested NPT
      is needed for that, and of course it won't initialize the walk_mmu,
      which is main issue the patch was addressing.
      
      Fix this for real.
      
      Fixes: 232f75d3 ("KVM: nSVM: call nested_svm_load_cr3 on nested state load")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220207155447.840194-3-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e1779c27
    • M
      KVM: x86: SVM: don't passthrough SMAP/SMEP/PKE bits in !NPT && !gCR0.PG case · c53bbe21
      Maxim Levitsky 提交于
      When the guest doesn't enable paging, and NPT/EPT is disabled, we
      use guest't paging CR3's as KVM's shadow paging pointer and
      we are technically in direct mode as if we were to use NPT/EPT.
      
      In direct mode we create SPTEs with user mode permissions
      because usually in the direct mode the NPT/EPT doesn't
      need to restrict access based on guest CPL
      (there are MBE/GMET extenstions for that but KVM doesn't use them).
      
      In this special "use guest paging as direct" mode however,
      and if CR4.SMAP/CR4.SMEP are enabled, that will make the CPU
      fault on each access and KVM will enter endless loop of page faults.
      
      Since page protection doesn't have any meaning in !PG case,
      just don't passthrough these bits.
      
      The fix is the same as was done for VMX in commit:
      commit 656ec4a4 ("KVM: VMX: fix SMEP and SMAP without EPT")
      
      This fixes the boot of windows 10 without NPT for good.
      (Without this patch, BSP boots, but APs were stuck in endless
      loop of page faults, causing the VM boot with 1 CPU)
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Message-Id: <20220207155447.840194-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c53bbe21
    • S
      Revert "svm: Add warning message for AVIC IPI invalid target" · dd4589ee
      Sean Christopherson 提交于
      Remove a WARN on an "AVIC IPI invalid target" exit, the WARN is trivial
      to trigger from guest as it will fail on any destination APIC ID that
      doesn't exist from the guest's perspective.
      
      Don't bother recording anything in the kernel log, the common tracepoint
      for kvm_avic_incomplete_ipi() is sufficient for debugging.
      
      This reverts commit 37ef0c44.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220204214205.3306634-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dd4589ee
  9. 07 2月, 2022 5 次提交
    • L
      Linux 5.17-rc3 · dfd42fac
      Linus Torvalds 提交于
      dfd42fac
    • L
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d8ad2ce8
      Linus Torvalds 提交于
      Pull ext4 fixes from Ted Ts'o:
       "Various bug fixes for ext4 fast commit and inline data handling.
      
        Also fix regression introduced as part of moving to the new mount API"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        fs/ext4: fix comments mentioning i_mutex
        ext4: fix incorrect type issue during replay_del_range
        jbd2: fix kernel-doc descriptions for jbd2_journal_shrink_{scan,count}()
        ext4: fix potential NULL pointer dereference in ext4_fill_super()
        jbd2: refactor wait logic for transaction updates into a common function
        jbd2: cleanup unused functions declarations from jbd2.h
        ext4: fix error handling in ext4_fc_record_modified_inode()
        ext4: remove redundant max inline_size check in ext4_da_write_inline_data_begin()
        ext4: fix error handling in ext4_restore_inline_data()
        ext4: fast commit may miss file actions
        ext4: fast commit may not fallback for ineligible commit
        ext4: modify the logic of ext4_mb_new_blocks_simple
        ext4: prevent used blocks from being allocated during fast commit replay
      d8ad2ce8
    • L
      Merge tag 'perf-tools-fixes-for-v5.17-2022-02-06' of... · 18118a42
      Linus Torvalds 提交于
      Merge tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull perf tools fixes from Arnaldo Carvalho de Melo:
      
       - Fix display of grouped aliased events in 'perf stat'.
      
       - Add missing branch_sample_type to perf_event_attr__fprintf().
      
       - Apply correct label to user/kernel symbols in branch mode.
      
       - Fix 'perf ftrace' system_wide tracing, it has to be set before
         creating the maps.
      
       - Return error if procfs isn't mounted for PID namespaces when
         synthesizing records for pre-existing processes.
      
       - Set error stream of objdump process for 'perf annotate' TUI, to avoid
         garbling the screen.
      
       - Add missing arm64 support to perf_mmap__read_self(), the kernel part
         got into 5.17.
      
       - Check for NULL pointer before dereference writing debug info about a
         sample.
      
       - Update UAPI copies for asound, perf_event, prctl and kvm headers.
      
       - Fix a typo in bpf_counter_cgroup.c.
      
      * tag 'perf-tools-fixes-for-v5.17-2022-02-06' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
        perf ftrace: system_wide collection is not effective by default
        libperf: Add arm64 support to perf_mmap__read_self()
        tools include UAPI: Sync sound/asound.h copy with the kernel sources
        perf stat: Fix display of grouped aliased events
        perf tools: Apply correct label to user/kernel symbols in branch mode
        perf bpf: Fix a typo in bpf_counter_cgroup.c
        perf synthetic-events: Return error if procfs isn't mounted for PID namespaces
        perf session: Check for NULL pointer before dereference
        perf annotate: Set error stream of objdump process for TUI
        perf tools: Add missing branch_sample_type to perf_event_attr__fprintf()
        tools headers UAPI: Sync linux/kvm.h with the kernel sources
        tools headers UAPI: Sync linux/prctl.h with the kernel sources
        perf beauty: Make the prctl arg regexp more strict to cope with PR_SET_VMA
        tools headers cpufeatures: Sync with the kernel sources
        tools headers UAPI: Sync linux/perf_event.h with the kernel sources
        tools include UAPI: Sync sound/asound.h copy with the kernel sources
      18118a42
    • L
      Merge tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · c3bf8a14
      Linus Torvalds 提交于
      Pull perf fixes from Borislav Petkov:
      
       - Intel/PT: filters could crash the kernel
      
       - Intel: default disable the PMU for SMM, some new-ish EFI firmware has
         started using CPL3 and the PMU CPL filters don't discriminate against
         SMM, meaning that CPL3 (userspace only) events now also count EFI/SMM
         cycles.
      
       - Fixup for perf_event_attr::sig_data
      
      * tag 'perf_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel/pt: Fix crash with stop filters in single-range mode
        perf: uapi: Document perf_event_attr::sig_data truncation on 32 bit architectures
        selftests/perf_events: Test modification of perf_event_attr::sig_data
        perf: Copy perf_event_attr::sig_data on modification
        x86/perf: Default set FREEZE_ON_SMI for all
      c3bf8a14
    • L
      Merge tag 'objtool_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · aeabe1e0
      Linus Torvalds 提交于
      Pull objtool fix from Borislav Petkov:
       "Fix a potential truncated string warning triggered by gcc12"
      
      * tag 'objtool_urgent_for_v5.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        objtool: Fix truncated string warning
      aeabe1e0