1. 24 9月, 2019 4 次提交
    • V
      KVM: x86: hyper-v: set NoNonArchitecturalCoreSharing CPUID bit when SMT is impossible · b2d8b167
      Vitaly Kuznetsov 提交于
      Hyper-V 2019 doesn't expose MD_CLEAR CPUID bit to guests when it cannot
      guarantee that two virtual processors won't end up running on sibling SMT
      threads without knowing about it. This is done as an optimization as in
      this case there is nothing the guest can do to protect itself against MDS
      and issuing additional flush requests is just pointless. On bare metal the
      topology is known, however, when Hyper-V is running nested (e.g. on top of
      KVM) it needs an additional piece of information: a confirmation that the
      exposed topology (wrt vCPU placement on different SMT threads) is
      trustworthy.
      
      NoNonArchitecturalCoreSharing (CPUID 0x40000004 EAX bit 18) is described in
      TLFS as follows: "Indicates that a virtual processor will never share a
      physical core with another virtual processor, except for virtual processors
      that are reported as sibling SMT threads." From KVM we can give such
      guarantee in two cases:
      - SMT is unsupported or forcefully disabled (just 'disabled' doesn't work
       as it can become re-enabled during the lifetime of the guest).
      - vCPUs are properly pinned so the scheduler won't put them on sibling
      SMT threads (when they're not reported as such).
      
      This patch reports NoNonArchitecturalCoreSharing bit in to userspace in the
      first case. The second case is outside of KVM's domain of responsibility
      (as vCPU pinning is actually done by someone who manages KVM's userspace -
      e.g. libvirt pinning QEMU threads).
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2d8b167
    • V
      KVM/Hyper-V/VMX: Add direct tlb flush support · 6f6a657c
      Vitaly Kuznetsov 提交于
      Hyper-V provides direct tlb flush function which helps
      L1 Hypervisor to handle Hyper-V tlb flush request from
      L2 guest. Add the function support for VMX.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6f6a657c
    • T
      KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH · 344c6c80
      Tianyu Lan 提交于
      Hyper-V direct tlb flush function should be enabled for
      guest that only uses Hyper-V hypercall. User space
      hypervisor(e.g, Qemu) can disable KVM identification in
      CPUID and just exposes Hyper-V identification to make
      sure the precondition. Add new KVM capability KVM_CAP_
      HYPERV_DIRECT_TLBFLUSH for user space to enable Hyper-V
      direct tlb function and this function is default to be
      disabled in KVM.
      Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      344c6c80
    • T
      x86/Hyper-V: Fix definition of struct hv_vp_assist_page · 7a83247e
      Tianyu Lan 提交于
      The struct hv_vp_assist_page was defined incorrectly.
      The "vtl_control" should be u64[3], "nested_enlightenments
      _control" should be a u64 and there are 7 reserved bytes
      following "enlighten_vmentry". Fix the definition.
      Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a83247e
  2. 16 9月, 2019 2 次提交
    • R
      x86: bug.h: use asm_inline in _BUG_FLAGS definitions · 32ee8230
      Rasmus Villemoes 提交于
      This helps preventing a BUG* or WARN* in some static inline from
      preventing that (or one of its callers) being inlined, so should allow
      gcc to make better informed inlining decisions.
      
      For example, with gcc 9.2, tcp_fastopen_no_cookie() vanishes from
      net/ipv4/tcp_fastopen.o. It does not itself have any BUG or WARN, but
      it calls dst_metric() which has a WARN_ON_ONCE - and despite that
      WARN_ON_ONCE vanishing since the condition is compile-time false,
      dst_metric() is apparently sufficiently "large" that when it gets
      inlined into tcp_fastopen_no_cookie(), the latter becomes too large
      for inlining.
      
      Overall, if one asks size(1), .text decreases a little and .data
      increases by about the same amount (x86-64 defconfig)
      
      $ size vmlinux.{before,after}
         text    data     bss     dec     hex filename
      19709726        5202600 1630280 26542606        195020e vmlinux.before
      19709330        5203068 1630280 26542678        1950256 vmlinux.after
      
      while bloat-o-meter says
      
      add/remove: 10/28 grow/shrink: 103/51 up/down: 3669/-2854 (815)
      ...
      Total: Before=14783683, After=14784498, chg +0.01%
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      32ee8230
    • R
      x86: alternative.h: use asm_inline for all alternative variants · 40576e5e
      Rasmus Villemoes 提交于
      Most, if not all, uses of the alternative* family just provide one or
      two instructions in .text, but the string literal can be quite large,
      causing gcc to overestimate the size of the generated code. That in
      turn affects its decisions about inlining of the function containing
      the alternative() asm statement.
      
      New enough versions of gcc allow one to overrule the estimated size by
      using "asm inline" instead of just "asm". So replace asm by the helper
      asm_inline, which for older gccs just expands to asm.
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      40576e5e
  3. 14 9月, 2019 1 次提交
    • S
      KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot · 002c5f73
      Sean Christopherson 提交于
      James Harvey reported a livelock that was introduced by commit
      d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when
      removing a memslot"").
      
      The livelock occurs because kvm_mmu_zap_all() as it exists today will
      voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs
      to add shadow pages.  With enough vCPUs, kvm_mmu_zap_all() can get stuck
      in an infinite loop as it can never zap all pages before observing lock
      contention or the need to reschedule.  The equivalent of kvm_mmu_zap_all()
      that was in use at the time of the reverted commit (4e103134, "KVM:
      x86/mmu: Zap only the relevant pages when removing a memslot") employed
      a fast invalidate mechanism and was not susceptible to the above livelock.
      
      There are three ways to fix the livelock:
      
      - Reverting the revert (commit d012a06a) is not a viable option as
        the revert is needed to fix a regression that occurs when the guest has
        one or more assigned devices.  It's unlikely we'll root cause the device
        assignment regression soon enough to fix the regression timely.
      
      - Remove the conditional reschedule from kvm_mmu_zap_all().  However, although
        removing the reschedule would be a smaller code change, it's less safe
        in the sense that the resulting kvm_mmu_zap_all() hasn't been used in
        the wild for flushing memslots since the fast invalidate mechanism was
        introduced by commit 6ca18b69 ("KVM: x86: use the fast way to
        invalidate all pages"), back in 2013.
      
      - Reintroduce the fast invalidate mechanism and use it when zapping shadow
        pages in response to a memslot being deleted/moved, which is what this
        patch does.
      
      For all intents and purposes, this is a revert of commit ea145aac
      ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of
      commit 7390de1e ("Revert "KVM: x86: use the fast way to invalidate
      all pages""), i.e. restores the behavior of commit 5304b8d3 ("KVM:
      MMU: fast invalidate all pages") and commit 6ca18b69 ("KVM: x86:
      use the fast way to invalidate all pages") respectively.
      
      Fixes: d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"")
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      Cc: Alex Willamson <alex.williamson@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      002c5f73
  4. 12 9月, 2019 1 次提交
    • L
      KVM: x86: Fix INIT signal handling in various CPU states · 4b9852f4
      Liran Alon 提交于
      Commit cd7764fe ("KVM: x86: latch INITs while in system management mode")
      changed code to latch INIT while vCPU is in SMM and process latched INIT
      when leaving SMM. It left a subtle remark in commit message that similar
      treatment should also be done while vCPU is in VMX non-root-mode.
      
      However, INIT signals should actually be latched in various vCPU states:
      (*) For both Intel and AMD, INIT signals should be latched while vCPU
      is in SMM.
      (*) For Intel, INIT should also be latched while vCPU is in VMX
      operation and later processed when vCPU leaves VMX operation by
      executing VMXOFF.
      (*) For AMD, INIT should also be latched while vCPU runs with GIF=0
      or in guest-mode with intercept defined on INIT signal.
      
      To fix this:
      1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor
      can define the various CPU states in which INIT signals should be
      blocked and modify kvm_apic_accept_events() to use it.
      2) Modify vmx_check_nested_events() to check for pending INIT signal
      while vCPU in guest-mode. If so, emualte vmexit on
      EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour
      but is currently left as a TODO comment to implement in the future
      because nSVM don't yet implement svm_check_nested_events().
      
      Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI
      activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to
      guest (See nested_vmx_setup_ctls_msrs()).
      If and when support for this activity state will be implemented,
      kvm_check_nested_events() would need to avoid emulating vmexit on
      INIT signal in case activity-state is wait-for-SIPI. In addition,
      kvm_apic_accept_events() would need to be modified to avoid discarding
      SIPI in case VMX activity-state is wait-for-SIPI but instead delay
      SIPI processing to vmx_check_nested_events() that would clear
      pending APIC events and emulate vmexit on SIPI.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b9852f4
  5. 11 9月, 2019 4 次提交
  6. 10 9月, 2019 1 次提交
  7. 06 9月, 2019 3 次提交
  8. 03 9月, 2019 5 次提交
  9. 02 9月, 2019 3 次提交
  10. 31 8月, 2019 1 次提交
  11. 30 8月, 2019 1 次提交
    • K
      perf/x86/amd/ibs: Fix sample bias for dispatched micro-ops · 0f4cd769
      Kim Phillips 提交于
      When counting dispatched micro-ops with cnt_ctl=1, in order to prevent
      sample bias, IBS hardware preloads the least significant 7 bits of
      current count (IbsOpCurCnt) with random values, such that, after the
      interrupt is handled and counting resumes, the next sample taken
      will be slightly perturbed.
      
      The current count bitfield is in the IBS execution control h/w register,
      alongside the maximum count field.
      
      Currently, the IBS driver writes that register with the maximum count,
      leaving zeroes to fill the current count field, thereby overwriting
      the random bits the hardware preloaded for itself.
      
      Fix the driver to actually retain and carry those random bits from the
      read of the IBS control register, through to its write, instead of
      overwriting the lower current count bits with zeroes.
      
      Tested with:
      
      perf record -c 100001 -e ibs_op/cnt_ctl=1/pp -a -C 0 taskset -c 0 <workload>
      
      'perf annotate' output before:
      
       15.70  65:   addsd     %xmm0,%xmm1
       17.30        add       $0x1,%rax
       15.88        cmp       %rdx,%rax
                    je        82
       17.32  72:   test      $0x1,%al
                    jne       7c
        7.52        movapd    %xmm1,%xmm0
        5.90        jmp       65
        8.23  7c:   sqrtsd    %xmm1,%xmm0
       12.15        jmp       65
      
      'perf annotate' output after:
      
       16.63  65:   addsd     %xmm0,%xmm1
       16.82        add       $0x1,%rax
       16.81        cmp       %rdx,%rax
                    je        82
       16.69  72:   test      $0x1,%al
                    jne       7c
        8.30        movapd    %xmm1,%xmm0
        8.13        jmp       65
        8.24  7c:   sqrtsd    %xmm1,%xmm0
        8.39        jmp       65
      
      Tested on Family 15h and 17h machines.
      
      Machines prior to family 10h Rev. C don't have the RDWROPCNT capability,
      and have the IbsOpCurCnt bitfield reserved, so this patch shouldn't
      affect their operation.
      
      It is unknown why commit db98c5fa ("perf/x86: Implement 64-bit
      counter support for IBS") ignored the lower 4 bits of the IbsOpCurCnt
      field; the number of preloaded random bits has always been 7, AFAICT.
      Signed-off-by: NKim Phillips <kim.phillips@amd.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Arnaldo Carvalho de Melo" <acme@kernel.org>
      Cc: <x86@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: "Namhyung Kim" <namhyung@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20190826195730.30614-1-kim.phillips@amd.com
      0f4cd769
  12. 28 8月, 2019 8 次提交
  13. 26 8月, 2019 1 次提交
  14. 23 8月, 2019 3 次提交
    • S
      x86/retpoline: Don't clobber RFLAGS during CALL_NOSPEC on i386 · b63f20a7
      Sean Christopherson 提交于
      Use 'lea' instead of 'add' when adjusting %rsp in CALL_NOSPEC so as to
      avoid clobbering flags.
      
      KVM's emulator makes indirect calls into a jump table of sorts, where
      the destination of the CALL_NOSPEC is a small blob of code that performs
      fast emulation by executing the target instruction with fixed operands.
      
        adcb_al_dl:
           0x000339f8 <+0>:   adc    %dl,%al
           0x000339fa <+2>:   ret
      
      A major motiviation for doing fast emulation is to leverage the CPU to
      handle consumption and manipulation of arithmetic flags, i.e. RFLAGS is
      both an input and output to the target of CALL_NOSPEC.  Clobbering flags
      results in all sorts of incorrect emulation, e.g. Jcc instructions often
      take the wrong path.  Sans the nops...
      
        asm("push %[flags]; popf; " CALL_NOSPEC " ; pushf; pop %[flags]\n"
           0x0003595a <+58>:  mov    0xc0(%ebx),%eax
           0x00035960 <+64>:  mov    0x60(%ebx),%edx
           0x00035963 <+67>:  mov    0x90(%ebx),%ecx
           0x00035969 <+73>:  push   %edi
           0x0003596a <+74>:  popf
           0x0003596b <+75>:  call   *%esi
           0x000359a0 <+128>: pushf
           0x000359a1 <+129>: pop    %edi
           0x000359a2 <+130>: mov    %eax,0xc0(%ebx)
           0x000359b1 <+145>: mov    %edx,0x60(%ebx)
      
        ctxt->eflags = (ctxt->eflags & ~EFLAGS_MASK) | (flags & EFLAGS_MASK);
           0x000359a8 <+136>: mov    -0x10(%ebp),%eax
           0x000359ab <+139>: and    $0x8d5,%edi
           0x000359b4 <+148>: and    $0xfffff72a,%eax
           0x000359b9 <+153>: or     %eax,%edi
           0x000359bd <+157>: mov    %edi,0x4(%ebx)
      
      For the most part this has gone unnoticed as emulation of guest code
      that can trigger fast emulation is effectively limited to MMIO when
      running on modern hardware, and MMIO is rarely, if ever, accessed by
      instructions that affect or consume flags.
      
      Breakage is almost instantaneous when running with unrestricted guest
      disabled, in which case KVM must emulate all instructions when the guest
      has invalid state, e.g. when the guest is in Big Real Mode during early
      BIOS.
      
      Fixes: 776b043848fd2 ("x86/retpoline: Add initial retpoline support")
      Fixes: 1a29b5b7 ("KVM: x86: Make indirect calls in emulator speculation safe")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20190822211122.27579-1-sean.j.christopherson@intel.com
      b63f20a7
    • V
      clocksource/drivers/hyperv: Enable TSC page clocksource on 32bit · 3e2d9453
      Vitaly Kuznetsov 提交于
      There is no particular reason to not enable TSC page clocksource on
      32-bit. mul_u64_u64_shr() is available and despite the increased
      computational complexity (compared to 64bit) TSC page is still a huge win
      compared to MSR-based clocksource.
      
      In-kernel reads:
        MSR based clocksource: 3361 cycles
        TSC page clocksource: 49 cycles
      
      Reads from userspace (utilizing vDSO in case of TSC page):
        MSR based clocksource: 5664 cycles
        TSC page clocksource: 131 cycles
      
      Enabling TSC page on 32bits allows to get rid of CONFIG_HYPERV_TSCPAGE as
      it is now not any different from CONFIG_HYPERV_TIMER.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
      Link: https://lkml.kernel.org/r/20190822083630.17059-1-vkuznets@redhat.com
      3e2d9453
    • J
      x86/dma: Get rid of iommu_pass_through · c53c47aa
      Joerg Roedel 提交于
      This variable has no users anymore. Remove it and tell the
      IOMMU code via its new functions about requested DMA modes.
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      c53c47aa
  15. 22 8月, 2019 2 次提交