1. 11 9月, 2019 3 次提交
  2. 10 9月, 2019 1 次提交
  3. 22 8月, 2019 8 次提交
  4. 14 8月, 2019 1 次提交
  5. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  6. 22 7月, 2019 1 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
  7. 20 7月, 2019 1 次提交
    • L
      KVM: SVM: Fix detection of AMD Errata 1096 · 118154bd
      Liran Alon 提交于
      When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
      possible that CPU microcode implementing DecodeAssist will fail
      to read bytes of instruction which caused #NPF. This is AMD errata
      1096 and it happens because CPU microcode reading instruction bytes
      incorrectly attempts to read code as implicit supervisor-mode data
      accesses (that is, just like it would read e.g. a TSS), which are
      susceptible to SMAP faults. The microcode reads CS:RIP and if it is
      a user-mode address according to the page tables, the processor
      gives up and returns no instruction bytes.  In this case,
      GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
      return 0 instead of the correct guest instruction bytes.
      
      Current KVM code attemps to detect and workaround this errata, but it
      has multiple issues:
      
      1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
      which is required for encountering a SMAP fault.
      
      2) It assumes SMAP faults can only occur when guest CPL==3.
      However, in case guest CR4.SMEP=0, the guest can execute an instruction
      which reside in a user-accessible page with CPL<3 priviledge. If this
      instruction raise a #NPF on it's data access, then CPU DecodeAssist
      microcode will still encounter a SMAP violation.  Even though no sane
      OS will do so (as it's an obvious priviledge escalation vulnerability),
      we still need to handle this semanticly correct in KVM side.
      
      Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
      triggerable condition and guests usually enable SMAP together with SMEP.
      If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
      at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
      instead of #NPF.  We keep this condition to avoid false positives in
      the detection of the errata.
      
      In addition, to avoid future confusion and improve code readbility,
      include details of the errata in code and not just in commit message.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: Singh Brijesh <brijesh.singh@amd.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      118154bd
  8. 11 7月, 2019 1 次提交
    • S
      KVM: x86: Unconditionally enable irqs in guest context · d7a08882
      Sean Christopherson 提交于
      On VMX, KVM currently does not re-enable irqs until after it has exited
      the guest context.  As a result, a tick that fires in the window between
      VM-Exit and guest_exit_irqoff() will be accounted as system time.  While
      said window is relatively small, it's large enough to be problematic in
      some configurations, e.g. if VM-Exits are consistently occurring a hair
      earlier than the tick irq.
      
      Intentionally toggle irqs back off so that guest_exit_irqoff() can be
      used in lieu of guest_exit() in order to avoid the save/restore of flags
      in guest_exit().  On my Haswell system, "nop; cli; sti" is ~6 cycles,
      versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf".
      
      Fixes: f2485b3e ("KVM: x86: use guest_exit_irqoff")
      Reported-by: NWei Yang <w90p710@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d7a08882
  9. 03 7月, 2019 1 次提交
  10. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  11. 19 6月, 2019 1 次提交
  12. 18 6月, 2019 2 次提交
    • S
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson 提交于
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • S
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson 提交于
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      165072b0
  13. 05 6月, 2019 4 次提交
  14. 25 5月, 2019 2 次提交
  15. 15 5月, 2019 1 次提交
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
  16. 01 5月, 2019 3 次提交
  17. 16 4月, 2019 6 次提交
    • S
      KVM: x86: clear SMM flags before loading state while leaving SMM · 9ec19493
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
      loading SMSTATE into architectural state, e.g. by toggling it for
      problematic flows, and simply clear HF_SMM_MASK prior to loading
      architectural state (from SMRAM save state area).
      Reported-by: NJon Doron <arilou@gmail.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: 5bea5123 ("KVM: VMX: check nested state and CR4.VMXE against SMM")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9ec19493
    • S
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • W
      x86/kvm: move kvm_load/put_guest_xcr0 into atomic context · 1811d979
      WANG Chao 提交于
      guest xcr0 could leak into host when MCE happens in guest mode. Because
      do_machine_check() could schedule out at a few places.
      
      For example:
      
      kvm_load_guest_xcr0
      ...
      kvm_x86_ops->run(vcpu) {
        vmx_vcpu_run
          vmx_complete_atomic_exit
            kvm_machine_check
              do_machine_check
                do_memory_failure
                  memory_failure
                    lock_page
      
      In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
      out, host cpu has guest xcr0 loaded (0xff).
      
      In __switch_to {
           switch_fpu_finish
             copy_kernel_to_fpregs
               XRSTORS
      
      If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
      generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
      and tries to reinitialize fpu by restoring init fpu state. Same story as
      last #GP, except we get DOUBLE FAULT this time.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1811d979
    • V
      KVM: x86: svm: make sure NMI is injected after nmi_singlestep · 99c22179
      Vitaly Kuznetsov 提交于
      I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
      the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
      shows that we're sometimes able to deliver a few but never all.
      
      When we're trying to inject an NMI we may fail to do so immediately for
      various reasons, however, we still need to inject it so enable_nmi_window()
      arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
      for pending NMIs before entering the guest and unless there's a different
      event to process, the NMI will never get delivered.
      
      Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
      pending NMIs are checked and possibly injected.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      99c22179
    • S
      svm/avic: Fix invalidate logical APIC id entry · e44e3eac
      Suthikulpanit, Suravee 提交于
      Only clear the valid bit when invalidate logical APIC id entry.
      The current logic clear the valid bit, but also set the rest of
      the bits (including reserved bits) to 1.
      
      Fixes: 98d90582 ('svm: Fix AVIC DFR and LDR handling')
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e44e3eac
    • S
      Revert "svm: Fix AVIC incomplete IPI emulation" · 4a58038b
      Suthikulpanit, Suravee 提交于
      This reverts commit bb218fbc.
      
      As Oren Twaig pointed out the old discussion:
      
        https://patchwork.kernel.org/patch/8292231/
      
      that the change coud potentially cause an extra IPI to be sent to
      the destination vcpu because the AVIC hardware already set the IRR bit
      before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
      Since writting to ICR and ICR2 will also set the IRR. If something triggers
      the destination vcpu to get scheduled before the emulation finishes, then
      this could result in an additional IPI.
      
      Also, the issue mentioned in the commit bb218fbc was misdiagnosed.
      
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: NOren Twaig <oren@scalemp.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4a58038b
  18. 06 4月, 2019 2 次提交