1. 22 8月, 2019 16 次提交
  2. 21 8月, 2019 1 次提交
  3. 14 8月, 2019 2 次提交
  4. 05 8月, 2019 5 次提交
    • P
      x86: kvm: remove useless calls to kvm_para_available · 57b76bdb
      Paolo Bonzini 提交于
      Most code in arch/x86/kernel/kvm.c is called through x86_hyper_kvm, and thus only
      runs if KVM has been detected.  There is no need to check again for the CPUID
      base.
      
      Cc: Sergio Lopez <slp@redhat.com>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      57b76bdb
    • G
      KVM: no need to check return value of debugfs_create functions · 3e7093d0
      Greg KH 提交于
      When calling debugfs functions, there is no need to ever check the
      return value.  The function can work or not, but the code logic should
      never do something different based on this.
      
      Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
      void instead of an integer, as we should not care at all about if this
      function actually does anything or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <x86@kernel.org>
      Cc: <kvm@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3e7093d0
    • P
      KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae
      Paolo Bonzini 提交于
      There is no need for this function as all arches have to implement
      kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
      let us actually simplify the code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      741cbbae
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
    • W
      KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire · a48d06f9
      Wanpeng Li 提交于
      kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
      has pending timer, don't need to check this in apic_timer_expired() again.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a48d06f9
  5. 24 7月, 2019 2 次提交
    • W
      KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption · 266e85a5
      Wanpeng Li 提交于
      Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
      introduces hybrid PV queued/unfair locks
       - queued mode (no starvation)
       - unfair mode (good performance on not heavily contended lock)
      The lock waiter goes into the unfair mode especially in VMs with over-commit
      vCPUs since increaing over-commitment increase the likehood that the queue
      head vCPU may have been preempted and not actively spinning.
      
      However, reschedule queue head vCPU timely to acquire the lock still can get
      better performance than just depending on lock stealing in over-subscribe
      scenario.
      
      Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
      ebizzy -M
                   vanilla     boosting    improved
       1VM          23520        25040         6%
       2VM           8000        13600        70%
       3VM           3100         5400        74%
      
      The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
      head vCPU which is involuntary preemption or the one which is voluntary halt
      due to fail to acquire the lock after a short spin in the guest.
      
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      266e85a5
    • C
      Documentation: move Documentation/virtual to Documentation/virt · 2f5947df
      Christoph Hellwig 提交于
      Renaming docs seems to be en vogue at the moment, so fix on of the
      grossly misnamed directories.  We usually never use "virtual" as
      a shortcut for virtualization in the kernel, but always virt,
      as seen in the virt/ top-level directory.  Fix up the documentation
      to match that.
      
      Fixes: ed16648e ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:")
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f5947df
  6. 22 7月, 2019 5 次提交
  7. 20 7月, 2019 7 次提交
    • T
      x86/entry/64: Prevent clobbering of saved CR2 value · 6879298b
      Thomas Gleixner 提交于
      The recent fix for CR2 corruption introduced a new way to reliably corrupt
      the saved CR2 value.
      
      CR2 is saved early in the entry code in RDX, which is the third argument to
      the fault handling functions. But it missed that between saving and
      invoking the fault handler enter_from_user_mode() can be called. RDX is a
      caller saved register so the invoked function can freely clobber it with
      the obvious consequences.
      
      The TRACE_IRQS_OFF call is safe as it calls through the thunk which
      preserves RDX, but TRACE_IRQS_OFF_DEBUG is not because it also calls into
      C-code outside of the thunk.
      
      Store CR2 in R12 instead which is a callee saved register and move R12 to
      RDX just before calling the fault handler.
      
      Fixes: a0d14b89 ("x86/mm, tracing: Fix CR2 corruption")
      Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907201020540.1782@nanos.tec.linutronix.de
      6879298b
    • E
      KVM: x86: Add fixed counters to PMU filter · 30cd8604
      Eric Hankland 提交于
      Updates KVM_CAP_PMU_EVENT_FILTER so it can also whitelist or blacklist
      fixed counters.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [No need to check padding fields for zero. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      30cd8604
    • P
      KVM: nVMX: do not use dangling shadow VMCS after guest reset · 88dddc11
      Paolo Bonzini 提交于
      If a KVM guest is reset while running a nested guest, free_nested will
      disable the shadow VMCS execution control in the vmcs01.  However,
      on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
      the VMCS12 to the shadow VMCS which has since been freed.
      
      This causes a vmptrld of a NULL pointer on my machime, but Jan reports
      the host to hang altogether.  Let's see how much this trivial patch fixes.
      Reported-by: NJan Kiszka <jan.kiszka@siemens.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      88dddc11
    • P
      KVM: VMX: dump VMCS on failed entry · 3b20e03a
      Paolo Bonzini 提交于
      This is useful for debugging, and is ratelimited nowadays.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b20e03a
    • L
      KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed · 6fc3977c
      Like Xu 提交于
      If a perf_event creation fails due to any reason of the host perf
      subsystem, it has no chance to log the corresponding event for guest
      which may cause abnormal sampling data in guest result. In debug mode,
      this message helps to understand the state of vPMC and we may not
      limit the number of occurrences but not in a spamming style.
      Suggested-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fc3977c
    • L
      KVM: SVM: Fix detection of AMD Errata 1096 · 118154bd
      Liran Alon 提交于
      When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
      possible that CPU microcode implementing DecodeAssist will fail
      to read bytes of instruction which caused #NPF. This is AMD errata
      1096 and it happens because CPU microcode reading instruction bytes
      incorrectly attempts to read code as implicit supervisor-mode data
      accesses (that is, just like it would read e.g. a TSS), which are
      susceptible to SMAP faults. The microcode reads CS:RIP and if it is
      a user-mode address according to the page tables, the processor
      gives up and returns no instruction bytes.  In this case,
      GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
      return 0 instead of the correct guest instruction bytes.
      
      Current KVM code attemps to detect and workaround this errata, but it
      has multiple issues:
      
      1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
      which is required for encountering a SMAP fault.
      
      2) It assumes SMAP faults can only occur when guest CPL==3.
      However, in case guest CR4.SMEP=0, the guest can execute an instruction
      which reside in a user-accessible page with CPL<3 priviledge. If this
      instruction raise a #NPF on it's data access, then CPU DecodeAssist
      microcode will still encounter a SMAP violation.  Even though no sane
      OS will do so (as it's an obvious priviledge escalation vulnerability),
      we still need to handle this semanticly correct in KVM side.
      
      Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
      triggerable condition and guests usually enable SMAP together with SMEP.
      If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
      at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
      instead of #NPF.  We keep this condition to avoid false positives in
      the detection of the errata.
      
      In addition, to avoid future confusion and improve code readbility,
      include details of the errata in code and not just in commit message.
      
      Fixes: 05d5a486 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
      Cc: Singh Brijesh <brijesh.singh@amd.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      118154bd
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  8. 19 7月, 2019 2 次提交
    • D
      x86/hyper-v: Zero out the VP ASSIST PAGE on allocation · e320ab3c
      Dexuan Cui 提交于
      The VP ASSIST PAGE is an "overlay" page (see Hyper-V TLFS's Section
      5.2.1 "GPA Overlay Pages" for the details) and here is an excerpt:
      
      "The hypervisor defines several special pages that "overlay" the guest's
       Guest Physical Addresses (GPA) space. Overlays are addressed GPA but are
       not included in the normal GPA map maintained internally by the hypervisor.
       Conceptually, they exist in a separate map that overlays the GPA map.
      
       If a page within the GPA space is overlaid, any SPA page mapped to the
       GPA page is effectively "obscured" and generally unreachable by the
       virtual processor through processor memory accesses.
      
       If an overlay page is disabled, the underlying GPA page is "uncovered",
       and an existing mapping becomes accessible to the guest."
      
      SPA = System Physical Address = the final real physical address.
      
      When a CPU (e.g. CPU1) is onlined, hv_cpu_init() allocates the VP ASSIST
      PAGE and enables the EOI optimization for this CPU by writing the MSR
      HV_X64_MSR_VP_ASSIST_PAGE. From now on, hvp->apic_assist belongs to the
      special SPA page, and this CPU *always* uses hvp->apic_assist (which is
      shared with the hypervisor) to decide if it needs to write the EOI MSR.
      
      When a CPU is offlined then on the outgoing CPU:
      1. hv_cpu_die() disables the EOI optimizaton for this CPU, and from
         now on hvp->apic_assist belongs to the original "normal" SPA page;
      2. the remaining work of stopping this CPU is done
      3. this CPU is completely stopped.
      
      Between 1 and 3, this CPU can still receive interrupts (e.g. reschedule
      IPIs from CPU0, and Local APIC timer interrupts), and this CPU *must* write
      the EOI MSR for every interrupt received, otherwise the hypervisor may not
      deliver further interrupts, which may be needed to completely stop the CPU.
      
      So, after the EOI optimization is disabled in hv_cpu_die(), it's required
      that the hvp->apic_assist's bit0 is zero, which is not guaranteed by the
      current allocation mode because it lacks __GFP_ZERO. As a consequence the
      bit might be set and interrupt handling would not write the EOI MSR causing
      interrupt delivery to become stuck.
      
      Add the missing __GFP_ZERO to the allocation.
      
      Note 1: after the "normal" SPA page is allocted and zeroed out, neither the
      hypervisor nor the guest writes into the page, so the page remains with
      zeros.
      
      Note 2: see Section 10.3.5 "EOI Assist" for the details of the EOI
      optimization. When the optimization is enabled, the guest can still write
      the EOI MSR register irrespective of the "No EOI required" value, but
      that's slower than the optimized assist based variant.
      
      Fixes: ba696429 ("x86/hyper-v: Implement EOI assist")
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/ <PU1P153MB0169B716A637FABF07433C04BFCB0@PU1P153MB0169.APCP153.PROD.OUTLOOK.COM
      e320ab3c
    • M
      proc/sysctl: add shared variables for range check · eec4844f
      Matteo Croce 提交于
      In the sysctl code the proc_dointvec_minmax() function is often used to
      validate the user supplied value between an allowed range.  This
      function uses the extra1 and extra2 members from struct ctl_table as
      minimum and maximum allowed value.
      
      On sysctl handler declaration, in every source file there are some
      readonly variables containing just an integer which address is assigned
      to the extra1 and extra2 members, so the sysctl range is enforced.
      
      The special values 0, 1 and INT_MAX are very often used as range
      boundary, leading duplication of variables like zero=0, one=1,
      int_max=INT_MAX in different source files:
      
          $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
          248
      
      Add a const int array containing the most commonly used values, some
      macros to refer more easily to the correct array member, and use them
      instead of creating a local one for every object file.
      
      This is the bloat-o-meter output comparing the old and new binary
      compiled with the default Fedora config:
      
          # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
          add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
          Data                                         old     new   delta
          sysctl_vals                                    -      12     +12
          __kstrtab_sysctl_vals                          -      12     +12
          max                                           14      10      -4
          int_max                                       16       -     -16
          one                                           68       -     -68
          zero                                         128      28    -100
          Total: Before=20583249, After=20583085, chg -0.00%
      
      [mcroce@redhat.com: tipc: remove two unused variables]
        Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
      [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
      [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
        Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
      [akpm@linux-foundation.org: fix fs/eventpoll.c]
      Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.comSigned-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eec4844f